Author
Listed:
- Yoonseok Heo
(Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea
Work done while interning at LG AI Research.)
- Taehoon Kim
(LG AI Research, Seoul 07796, Republic of Korea)
- Seunghwan Kim
(LG AI Research, Seoul 07796, Republic of Korea)
- Jungyun Seo
(LG AI Research, Seoul 07796, Republic of Korea)
- Juae Kim
(Department of English Linguistics and Language Technology, Division of Language & AI, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea)
Abstract
Video captioning is a task of describing the visual scene of a given video in natural language. There have been several lines of research focused on developing large-scale models in a transfer learning paradigm, with major challenge being the tradeoff between scalability and performance in limited environments. To address this problem, we propose a simple yet effective encoder–decoder-based video captioning model integrating transformers and CLIP, both of which are widely adopted in the vision and language domains, together with appropriate temporal feature embedding modules. Taking this proposal a step further, we also address the challenge of human-interactive video captioning, where the captions are tailored to specific information desired by humans. To design a human-interactive environment, we assume that a human offers an object or action in the video as a short prompt; in turn, the system then provides a detailed explanation regarding the prompt. We embed human prompts within an LSTM-based prompt encoder and leverage soft prompting to tune the model effectively. We extensively evaluated our model on benchmark datasets, demonstrating comparable results, particularly on the MSR-VTT dataset, where we achieve state-of-the-art performance with 4% improvement. In addition, we also show potential for human-interactive video captioning through quantitative and qualitative analysis.
Suggested Citation
Yoonseok Heo & Taehoon Kim & Seunghwan Kim & Jungyun Seo & Juae Kim, 2024.
"Towards Human-Interactive Controllable Video Captioning with Efficient Modeling,"
Mathematics, MDPI, vol. 12(13), pages 1-14, June.
Handle:
RePEc:gam:jmathe:v:12:y:2024:i:13:p:2037-:d:1426092
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:13:p:2037-:d:1426092. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.