Search Content

Displaying 1 - 2 of 2

Filtering by

All Subjects: Colloquial language
All Subjects: speech analysis

Context recognition methods using audio signals for human-machine interaction

Description

Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging.

The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance.

Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems.

Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface.

The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption.

ContributorsShah, Mohit (Author) / Spanias, Andreas (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Berisha, Visar (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2015

Glottal fry in college aged females: an entrainment phenomenon?

Description

Glottal fry is a vocal register characterized by low frequency and increased signal perturbation, and is perceptually identified by its popping, creaky quality. Recently, the use of the glottal fry vocal register has received growing awareness and attention in popular culture and media in the United States. The creaky quality that was originally associated with vocal pathologies is indeed becoming “trendy,” particularly among young women across the United States. But while existing studies have defined, quantified, and attempted to explain the use of glottal fry in conversational speech, there is currently no explanation for the increasing prevalence of the use of glottal fry amongst American women. This thesis, however, proposes that conversational entrainment—a communication phenomenon which describes the propensity to modify one’s behavior to align more closely with one’s communication partner—may provide a theoretical framework to explain the growing trend in the use of glottal fry amongst college-aged women in the United States. Female participants (n = 30) between the ages of 18 and 29 years (M = 20.6, SD = 2.95) had conversations with two conversation partners, one who used quantifiably more glottal fry than the other. The study utilized perceptual and quantifiable acoustic information to address the following key question: Does the amount of habitual glottal fry in a conversational partner influence one’s use of glottal fry in their own speech? Results yielded the following two findings: (1) according to perceptual annotations, the participants used a greater amount of glottal fry when speaking with the Fry conversation partner than with the Non Fry partner, (2) statistically significant differences were found in the acoustics of the participants’ vocal qualities based on conversation partner. While the current study demonstrates that young women are indeed speaking in glottal fry in everyday conversations, and that its use can be attributed in part to conversational entrainment, we still lack a clear explanation of the deeper motivations for women to speak in a lower vocal register. The current study opens avenues for continued analysis of the sociolinguistic functions of the glottal fry register.

ContributorsDelfino, Christine R (Author) / Liss, Julie M (Thesis advisor) / Borrie, Stephanie A (Thesis advisor) / Azuma, Tamiko (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2015

Theses and Dissertations

Filtering by

Context recognition methods using audio signals for human-machine interaction

Glottal fry in college aged females: an entrainment phenomenon?