Author
Listed:
- Jungbeom Ko
(Department of Health Sciences and Technology, Gachon Advanced Institute for Health Sciences and Technology (GAIHST), Gachon University, Incheon 21936, Republic of Korea)
- Hyunchul Kim
(School of Information, University of California Berkeley, 102 South Hall 4600, Berkeley, CA 94720, USA)
- Jungsuk Kim
(Department of Biomedical Engineering, College of IT Convergence, Gachon University, Seongnam-si 13120, Republic of Korea)
Abstract
The advent of voice assistance technology and its integration into smart devices has facilitated many useful services, such as texting and application execution. However, most assistive technologies lack the capability to enable the system to act as a human who can localize the speaker and selectively spot meaningful keywords. Because keyword spotting (KWS) and sound source localization (SSL) are essential and must operate in real time, the efficiency of a neural network model is crucial for memory and computation. In this paper, a single neural network model for KWS and SSL is proposed to overcome the limitations of sequential KWS and SSL, which require more memory and inference time. The proposed model uses multi-task learning to utilize the limited resources of the device efficiently. A shared encoder is used as the initial layer to extract common features from the multichannel audio data. Subsequently, the task-specific parallel layers utilize these features for KWS and SSL. The proposed model was evaluated on a synthetic dataset with multiple speakers, and a 7-module shared encoder structure was identified as optimal in terms of accuracy, direction of arrival (DOA) accuracy, DOA error, and latency. It achieved a KWS accuracy of 94.51%, DOA error of 12.397°, and DOA accuracy of 89.86%. Consequently, the proposed model requires significantly less memory owing to the shared network architecture, which enhances the inference time without compromising KWS accuracy, DOA error, and DOA accuracy.
Suggested Citation
Jungbeom Ko & Hyunchul Kim & Jungsuk Kim, 2024.
"Combined Keyword Spotting and Localization Network Based on Multi-Task Learning,"
Mathematics, MDPI, vol. 12(21), pages 1-14, October.
Handle:
RePEc:gam:jmathe:v:12:y:2024:i:21:p:3309-:d:1504014
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:21:p:3309-:d:1504014. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.