Search Content

Context recognition methods using audio signals for human-machine interaction

Description

Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents…

Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging.

The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance.

Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems.

Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface.

The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption.

ContributorsShah, Mohit (Author) / Spanias, Andreas (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Berisha, Visar (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2015

LYFE: Learning Your Formed Experiences

Description

In this thesis, I explored the interconnected ways in which human experience can shape and be shaped by environments of the future, such as interactive environments and spaces, embedded with sensors, enlivened by advanced algorithms for sensor data processing. I have developed an abstract representational experience into the vast and…

In this thesis, I explored the interconnected ways in which human experience can shape and be shaped by environments of the future, such as interactive environments and spaces, embedded with sensors, enlivened by advanced algorithms for sensor data processing. I have developed an abstract representational experience into the vast and continual journey through life that shapes how we can use sensory immersion. The experimental work was housed in the iStage: an advanced black box space in the School of Arts, Media, and Engineering, which consists of video cameras, motion capture systems, spatial audio systems, and controllable lighting and projector systems. The malleable and interactive space of the iStage transformed into a reflective tool in which to gain insight into the overall shared, but very individual, emotional odyssey. Additionally, I surveyed participants after engaging in the experience to better understand their perceptions and interpretations of the experience. With the responses of participants' experiences and collective reflection upon the project I can begin to think about future iterations and how they might contain applications in health and/or wellness.

ContributorsHaagen, Jordan (Author) / Turaga, Pavan (Thesis director) / Drummond Otten, Caitlin (Committee member) / Barrett, The Honors College (Contributor) / Arts, Media and Engineering Sch T (Contributor) / School of Human Evolution & Social Change (Contributor)

Created2022-05

Haagen Final Project (Spring 2022)

ContributorsHaagen, Jordan (Author) / Turaga, Pavan (Thesis director) / Drummond Otten, Caitlin (Committee member) / Barrett, The Honors College (Contributor) / Arts, Media and Engineering Sch T (Contributor)

Created2022-05

Final Installation Showcase

ContributorsHaagen, Jordan (Author) / Turaga, Pavan (Thesis director) / Drummond Otten, Caitlin (Committee member) / Barrett, The Honors College (Contributor) / Arts, Media and Engineering Sch T (Contributor)

Created2022-05

Theses and Dissertations

Filtering by

Context recognition methods using audio signals for human-machine interaction

LYFE: Learning Your Formed Experiences

Haagen Final Project (Spring 2022)

Final Installation Showcase