IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v8y2020i12p2133-d454149.html
   My bibliography  Save this article

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Author

Listed:
  • Mustaqeem

    (Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea)

  • Soonil Kwon

    (Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea)

Abstract

Artificial intelligence, deep learning, and machine learning are dominant sources to use in order to make a system smarter. Nowadays, the smart speech emotion recognition (SER) system is a basic necessity and an emerging research area of digital audio signal processing. However, SER plays an important role with many applications that are related to human–computer interactions (HCI). The existing state-of-the-art SER system has a quite low prediction performance, which needs improvement in order to make it feasible for the real-time commercial applications. The key reason for the low accuracy and the poor prediction rate is the scarceness of the data and a model configuration, which is the most challenging task to build a robust machine learning technique. In this paper, we addressed the limitations of the existing SER systems and proposed a unique artificial intelligence (AI) based system structure for the SER that utilizes the hierarchical blocks of the convolutional long short-term memory (ConvLSTM) with sequence learning. We designed four blocks of ConvLSTM, which is called the local features learning block (LFLB), in order to extract the local emotional features in a hierarchical correlation. The ConvLSTM layers are adopted for input-to-state and state-to-state transition in order to extract the spatial cues by utilizing the convolution operations. We placed four LFLBs in order to extract the spatiotemporal cues in the hierarchical correlational form speech signals using the residual learning strategy. Furthermore, we utilized a novel sequence learning strategy in order to extract the global information and adaptively adjust the relevant global feature weights according to the correlation of the input features. Finally, we used the center loss function with the softmax loss in order to produce the probability of the classes. The center loss increases the final classification results and ensures an accurate prediction as well as shows a conspicuous role in the whole proposed SER scheme. We tested the proposed system over two standard, interactive emotional dyadic motion capture (IEMOCAP) and ryerson audio visual database of emotional speech and song (RAVDESS) speech corpora, and obtained a 75% and an 80% recognition rate, respectively.

Suggested Citation

  • Mustaqeem & Soonil Kwon, 2020. "CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network," Mathematics, MDPI, vol. 8(12), pages 1-19, November.
  • Handle: RePEc:gam:jmathe:v:8:y:2020:i:12:p:2133-:d:454149
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/8/12/2133/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/8/12/2133/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Alexei Kourbatov & Marek Wolf, 2019. "Predicting Maximal Gaps in Sets of Primes," Mathematics, MDPI, vol. 7(5), pages 1-28, May.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jie He & Farong Gao & Jian Wang & Qiuxuan Wu & Qizhong Zhang & Weijie Lin, 2022. "A Method Combining Multi-Feature Fusion and Optimized Deep Belief Network for EMG-Based Human Gait Classification," Mathematics, MDPI, vol. 10(22), pages 1-20, November.
    2. Jingjing Cao & Zhipeng Wen & Liang Huang & Jinshan Dai & Hu Qin, 2024. "EFE-LSTM: A Feature Extension, Fusion and Extraction Approach Using Long Short-Term Memory for Navigation Aids State Recognition," Mathematics, MDPI, vol. 12(7), pages 1-20, March.
    3. Kai Ding & Zhangqi Niu & Jizhuang Hui & Xueliang Zhou & Felix T. S. Chan, 2022. "A Weld Surface Defect Recognition Method Based on Improved MobileNetV2 Algorithm," Mathematics, MDPI, vol. 10(19), pages 1-18, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.

      Corrections

      All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:8:y:2020:i:12:p:2133-:d:454149. See general information about how to correct material in RePEc.

      If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

      If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

      If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

      For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

      Please note that corrections may take a couple of weeks to filter through the various RePEc services.

      IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.