Author
Listed:
- Wenhao Liu
(School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China)
- Simiao Yuan
(Zibo Medical Emergency Command Center, Zibo 255000, China)
- Zhen Wang
(School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China)
- Xinyi Chang
(School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China)
- Limeng Gao
(School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China)
- Zhenrui Zhang
(School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China)
Abstract
The image-recipe cross-modal retrieval task, which retrieves the relevant recipes according to food images and vice versa, is now attracting widespread attention. There are two main challenges for image-recipe cross-modal retrieval task. Firstly, a recipe’s different components (words in a sentence, sentences in an entity, and entities in a recipe) have different weight values. If a recipe’s different components own the same weight, the recipe embeddings cannot pay more attention to the important components. As a result, the important components make less contribution to the retrieval task. Secondly, the food images have obvious properties of locality and only the local food regions matter. There are still difficulties in enhancing the discriminative local region features in the food images. To address these two problems, we propose a novel framework named Dual Cross Attention Encoders for Cross-modal Food Retrieval (DCA-Food). The proposed framework consists of a hierarchical cross attention recipe encoder (HCARE) and a cross attention image encoder (CAIE). HCARE consists of three types of cross attention modules to capture the important words in a sentence, the important sentences in an entity and the important entities in a recipe, respectively. CAIE extracts global and local region features. Then, it calculates cross attention between them to enhance the discriminative local features in the food images. We conduct the ablation studies to validate our design choices. Our proposed approach outperforms the existing approaches by a large margin on the Recipe1M dataset. Specifically, we improve the R@1 performance by +2.7 and +1.9 on the 1k and 10k testing sets, respectively.
Suggested Citation
Wenhao Liu & Simiao Yuan & Zhen Wang & Xinyi Chang & Limeng Gao & Zhenrui Zhang, 2024.
"Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders,"
Mathematics, MDPI, vol. 12(20), pages 1-18, October.
Handle:
RePEc:gam:jmathe:v:12:y:2024:i:20:p:3181-:d:1496606
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:20:p:3181-:d:1496606. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.