Theses and Dissertations
Displaying 1 - 2 of 2
Filtering by
- All Subjects: emotion recognition
- Creators: Davulcu, Hasan
- Creators: Turaga, Pavan
Description
Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging.
The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance.
Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems.
Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface.
The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption.
The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance.
Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems.
Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface.
The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption.
ContributorsShah, Mohit (Author) / Spanias, Andreas (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Berisha, Visar (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)
Created2015
Description
While discrete emotions like joy, anger, disgust etc. are quite popular, continuous
emotion dimensions like arousal and valence are gaining popularity within the research
community due to an increase in the availability of datasets annotated with these
emotions. Unlike the discrete emotions, continuous emotions allow modeling of subtle
and complex affect dimensions but are difficult to predict.
Dimension reduction techniques form the core of emotion recognition systems and
help create a new feature space that is more helpful in predicting emotions. But these
techniques do not necessarily guarantee a better predictive capability as most of them
are unsupervised, especially in regression learning. In emotion recognition literature,
supervised dimension reduction techniques have not been explored much and in this
work a solution is provided through probabilistic topic models. Topic models provide
a strong probabilistic framework to embed new learning paradigms and modalities.
In this thesis, the graphical structure of Latent Dirichlet Allocation has been explored
and new models tuned to emotion recognition and change detection have been built.
In this work, it has been shown that the double mixture structure of topic models
helps 1) to visualize feature patterns, and 2) to project features onto a topic simplex
that is more predictive of human emotions, when compared to popular techniques
like PCA and KernelPCA. Traditionally, topic models have been used on quantized
features but in this work, a continuous topic model called the Dirichlet Gaussian
Mixture model has been proposed. Evaluation of DGMM has shown that while modeling
videos, performance of LDA models can be replicated even without quantizing
the features. Until now, topic models have not been explored in a supervised context
of video analysis and thus a Regularized supervised topic model (RSLDA) that
models video and audio features is introduced. RSLDA learning algorithm performs
both dimension reduction and regularized linear regression simultaneously, and has outperformed supervised dimension reduction techniques like SPCA and Correlation
based feature selection algorithms. In a first of its kind, two new topic models, Adaptive
temporal topic model (ATTM) and SLDA for change detection (SLDACD) have
been developed for predicting concept drift in time series data. These models do not
assume independence of consecutive frames and outperform traditional topic models
in detecting local and global changes respectively.
emotion dimensions like arousal and valence are gaining popularity within the research
community due to an increase in the availability of datasets annotated with these
emotions. Unlike the discrete emotions, continuous emotions allow modeling of subtle
and complex affect dimensions but are difficult to predict.
Dimension reduction techniques form the core of emotion recognition systems and
help create a new feature space that is more helpful in predicting emotions. But these
techniques do not necessarily guarantee a better predictive capability as most of them
are unsupervised, especially in regression learning. In emotion recognition literature,
supervised dimension reduction techniques have not been explored much and in this
work a solution is provided through probabilistic topic models. Topic models provide
a strong probabilistic framework to embed new learning paradigms and modalities.
In this thesis, the graphical structure of Latent Dirichlet Allocation has been explored
and new models tuned to emotion recognition and change detection have been built.
In this work, it has been shown that the double mixture structure of topic models
helps 1) to visualize feature patterns, and 2) to project features onto a topic simplex
that is more predictive of human emotions, when compared to popular techniques
like PCA and KernelPCA. Traditionally, topic models have been used on quantized
features but in this work, a continuous topic model called the Dirichlet Gaussian
Mixture model has been proposed. Evaluation of DGMM has shown that while modeling
videos, performance of LDA models can be replicated even without quantizing
the features. Until now, topic models have not been explored in a supervised context
of video analysis and thus a Regularized supervised topic model (RSLDA) that
models video and audio features is introduced. RSLDA learning algorithm performs
both dimension reduction and regularized linear regression simultaneously, and has outperformed supervised dimension reduction techniques like SPCA and Correlation
based feature selection algorithms. In a first of its kind, two new topic models, Adaptive
temporal topic model (ATTM) and SLDA for change detection (SLDACD) have
been developed for predicting concept drift in time series data. These models do not
assume independence of consecutive frames and outperform traditional topic models
in detecting local and global changes respectively.
ContributorsLade, Prasanth (Author) / Panchanathan, Sethuraman (Thesis advisor) / Davulcu, Hasan (Committee member) / Li, Baoxin (Committee member) / Balasubramanian, Vineeth N (Committee member) / Arizona State University (Publisher)
Created2015