Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN
Poster Presentation
Authors
1Biomedical Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran (04436727824)
2Department of Biomedical Engineering, Faculty of Engineering, University of Isfahan, Isfahan, Iran (031-37934032)
3Department of Biomedical Engineering, Faculty of Engineering, University of Isfahan, Isfahan, Iran
Abstract
Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%.
Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.
Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.
Keywords
bidirectional long short-term memory; 3D-convolutional neural network; Deep Learning; emotion recognition; short-term fourier transform
Proceeding Title [Persian]
Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN
Authors [Persian]
Abstract [Persian]
Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%.
Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.
Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.
Keywords [Persian]
bidirectional long short-term memory، 3D-convolutional neural network، Deep Learning، emotion recognition، short-term fourier transform