Emotion recognition in conversation has applications within numerous domains such as affective computing and medicine. Recent methods for emotion recognition jointly utilize conversational data over several modalities including audio, video, and text. However, state-of-the-art frameworks for this task do not focus on the feature extraction and feature fusion steps of this process. This thesis aims to improve the state-of-the-art method by incorporating two components to better accomplish these steps. By doing so, we are able to produce improved representations for the text modality and better model the relationships between all modalities. This paper proposes two methods which focus on these concepts and provide improved accuracy over the state-of-the-art framework for multimodal emotion recognition in dialogue.
- Improving upon the State-of-the-Art in Multimodal Emotional Recognition in Dialogue