IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2023i10p2279-d1146321.html
   My bibliography  Save this article

An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval

Author

Listed:
  • Liu He

    (Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China)

  • Shuyan Liu

    (Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China)

  • Ran An

    (Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China)

  • Yudong Zhuo

    (Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China)

  • Jian Tao

    (Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China)

Abstract

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction of remote sensing image information and flexible human–computer interaction. Traditional RSCTIR methods mainly focus on improving the performance of uni-modal feature extraction separately, and most rely on pre-trained object detectors to obtain better local feature representation, which not only lack multi-modal interaction information, but also cause the training gap between the pre-trained object detector and the retrieval task. In this paper, we propose an end-to-end RSCTIR framework based on vision-language fusion (EnVLF) consisting of two uni-modal (vision and language) encoders and a muti-modal encoder which can be optimized by multitask training. Specifically, to achieve an end-to-end training process, we introduce a vision transformer module for image local features instead of a pre-trained object detector. By semantic alignment of visual and text features, the vision transformer module achieves the same performance as pre-trained object detectors for image local features. In addition, the trained multi-modal encoder can improve the top-one and top-five ranking performances after retrieval processing. Experiments on common RSICD and RSITMD datasets demonstrate that our EnVLF can obtain state-of-the-art retrieval performance.

Suggested Citation

  • Liu He & Shuyan Liu & Ran An & Yudong Zhuo & Jian Tao, 2023. "An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval," Mathematics, MDPI, vol. 11(10), pages 1-17, May.
  • Handle: RePEc:gam:jmathe:v:11:y:2023:i:10:p:2279-:d:1146321
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/10/2279/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/10/2279/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2023:i:10:p:2279-:d:1146321. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.