IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i18p3346-d915697.html
   My bibliography  Save this article

Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

Author

Listed:
  • Fudong Nian

    (School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, China
    Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling, Anhui Jianzhu University, Hefei 230601, China)

  • Ling Ding

    (School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, China)

  • Yuxia Hu

    (Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling, Anhui Jianzhu University, Hefei 230601, China)

  • Yanhong Gu

    (School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, China)

Abstract

This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video–text retrieval by jointly modeling video–text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial–temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video–text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video–text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.

Suggested Citation

  • Fudong Nian & Ling Ding & Yuxia Hu & Yanhong Gu, 2022. "Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval," Mathematics, MDPI, vol. 10(18), pages 1-19, September.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:18:p:3346-:d:915697
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/18/3346/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/18/3346/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:18:p:3346-:d:915697. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.