IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2023i21p4550-d1274227.html
   My bibliography  Save this article

Exploring Spatial-Based Position Encoding for Image Captioning

Author

Listed:
  • Xiaobao Yang

    (School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
    School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China)

  • Shuai He

    (School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China)

  • Junsheng Wu

    (School of Software, Northwestern Polytechnical University, Xi’an 710072, China)

  • Yang Yang

    (School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China)

  • Zhiqiang Hou

    (School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China)

  • Sugang Ma

    (School of Computer Science & Technology, Xi’an University of Posts and Telecommunications, Xi’an 710061, China)

Abstract

Image captioning has become a hot topic in artificial intelligence research and sits at the intersection of computer vision and natural language processing. Most recent imaging captioning models have adopted an “encoder + decoder” architecture, in which the encoder is employed generally to extract the visual feature, while the decoder generates the descriptive sentence word by word. However, the visual features need to be flattened into sequence form before being forwarded to the decoder, and this results in the loss of the 2D spatial position information of the image. This limitation is particularly pronounced in the Transformer architecture since it is inherently not position-aware. Therefore, in this paper, we propose a simple coordinate-based spatial position encoding method (CSPE) to remedy this deficiency. CSPE firstly creates the 2D position coordinates for each feature pixel, and then encodes them by row and by column separately via trainable or hard encoding, effectively strengthening the position representation of visual features and enriching the generated description sentences. In addition, in order to reduce the time cost, we also explore a diagonal-based spatial position encoding (DSPE) approach. Compared with CSPE, DSPE is slightly inferior in performance but has a faster calculation speed. Extensive experiments on the MS COCO 2014 dataset demonstrate that CSPE and DSPE can significantly enhance the spatial position representation of visual features. CSPE, in particular, demonstrates BLEU-4 and CIDEr metrics improved by 1.6% and 5.7%, respectively, compared with a baseline model without sequence-based position encoding, and also outperforms current sequence-based position encoding approaches by a significant margin. In addition, the robustness and plug-and-play ability of the proposed method are validated based on a medical captioning generation model.

Suggested Citation

  • Xiaobao Yang & Shuai He & Junsheng Wu & Yang Yang & Zhiqiang Hou & Sugang Ma, 2023. "Exploring Spatial-Based Position Encoding for Image Captioning," Mathematics, MDPI, vol. 11(21), pages 1-16, November.
  • Handle: RePEc:gam:jmathe:v:11:y:2023:i:21:p:4550-:d:1274227
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/21/4550/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/21/4550/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2023:i:21:p:4550-:d:1274227. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.