Author
Listed:
- Xingguo Qin
(School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
These authors contributed equally to this work.)
- Ya Zhou
(School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
These authors contributed equally to this work.)
- Jun Li
(School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
Guangxi Key Laboratory of Image and Graphic Intelligence Processing, Guilin University of Electronic Technology, Guilin 541004, China)
Abstract
The endeavor of spatial position reasoning effectively simulates the sensory and comprehension faculties of artificial intelligence, especially within the purview of multimodal modeling that fuses imagery with linguistic data. Recent progress in visual image–language models has marked significant advancements in multimodal reasoning tasks. Notably, contrastive learning models based on the Contrastive Language-Image pre-training (CLIP) framework have attracted substantial interest. Predominantly, current contrastive learning models focus on nominal and verbal elements within image descriptions, while spatial locatives receive comparatively less attention. However, prepositional spatial indicators are pivotal for encapsulating the critical positional data between entities within images, which is essential for the reasoning capabilities of image–language models. This paper introduces a spatial location reasoning model that is founded on spatial locative terms. The model concentrates on spatial prepositions within image descriptions, models the locational interrelations between entities in images through these prepositions, evaluates and corroborates the spatial interconnections of entities within images, and harmonizes the precision with image–textual descriptions. This model represents an enhancement of the CLIP model, delving deeply into the semantic characteristics of spatial prepositions and highlighting their directive role in visual language models. Empirical evidence suggests that the proposed model adeptly captures the correlation of spatial indicators in both image and textual representations across open datasets. The incorporation of spatial position terms into the model was observed to elevate the average predictive accuracy by approximately three percentage points.
Suggested Citation
Xingguo Qin & Ya Zhou & Jun Li, 2024.
"Spatial Position Reasoning of Image Entities Based on Location Words,"
Mathematics, MDPI, vol. 12(24), pages 1-14, December.
Handle:
RePEc:gam:jmathe:v:12:y:2024:i:24:p:3940-:d:1543986
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:24:p:3940-:d:1543986. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.