Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN
Poster Presentation XML
Authors
1Biomedical Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran (04436727824)
2Department of Biomedical Engineering, Faculty of Engineering, University of Isfahan, Isfahan, Iran (031-37934032)
3Department of Biomedical Engineering, Faculty of Engineering, University of Isfahan, Isfahan, Iran
Abstract
Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%.
Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.
Keywords
 
Proceeding Title [Persian]
Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN
Authors [Persian]
Abstract [Persian]
Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%.
Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.
Keywords [Persian]
bidirectional long short-term memory، 3D-convolutional neural network، Deep Learning، emotion recognition، short-term fourier transform