Search Content

Context recognition methods using audio signals for human-machine interaction

Description

Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents…

Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging.

The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance.

Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems.

Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface.

The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption.

ContributorsShah, Mohit (Author) / Spanias, Andreas (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Berisha, Visar (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2015

Cognitive and Auditory Factors for Speech and Music Perception in Elderly Adult Cochlear Implant Users

Description

Working memory and cognitive functions contribute to speech recognition in normal hearing and hearing impaired listeners. In this study, auditory and cognitive functions are measured in young adult normal hearing, elderly normal hearing, and elderly cochlear implant subjects. The effects of age and hearing on the different measures are investigated.…

Working memory and cognitive functions contribute to speech recognition in normal hearing and hearing impaired listeners. In this study, auditory and cognitive functions are measured in young adult normal hearing, elderly normal hearing, and elderly cochlear implant subjects. The effects of age and hearing on the different measures are investigated. The correlations between auditory/cognitive functions and speech/music recognition are examined. The results may demonstrate which factors can better explain the variable performance across elderly cochlear implant users.

ContributorsKolberg, Courtney Elizabeth (Author) / Luo, Xin (Thesis director) / Azuma, Tamiko (Committee member) / Department of Speech and Hearing Science (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

The Interaction of Word Complexity and Consonant Correctness in Spanish-Speaking Children

Description

This thesis investigated the impact of word complexity as measured through the Proportion of Whole Word Proximity (PWP; Ingram 2002) on consonant correctness as measured by the Percentage of Correct Consonants (PCC; Shriberg & Kwiatkowski 1980) on the spoken words of monolingual Spanish-speaking children. The effect of word complexity on…

This thesis investigated the impact of word complexity as measured through the Proportion of Whole Word Proximity (PWP; Ingram 2002) on consonant correctness as measured by the Percentage of Correct Consonants (PCC; Shriberg & Kwiatkowski 1980) on the spoken words of monolingual Spanish-speaking children. The effect of word complexity on consonant correctness has previously been studied on English-speaking children (Knodel 2012); the present study extends this line of research to determine if it can be appropriately applied to Spanish. Language samples from a previous study were used (Hase, 2010) in which Spanish-speaking children were given two articulation assessments: Evaluación fonológica del habla infantil (FON; Bosch Galceran, 2004), and the Spanish Test of Articulation for Children Under Three Years of Age (STAR; Bunta, 2002). It was hypothesized that word complexity would affect a Spanish-speaking child’s productions of correct consonants as was seen for the English- speaking children studied. This hypothesis was supported for 10 out of the 14 children. The pattern of word complexity found for Spanish was as follows: CVCV > CVCVC, Tri-syllables no clusters > Disyllable words with clusters.

ContributorsPurinton, Kaitlyn Lisa (Author) / Ingram, David (Thesis director) / Dixon, Dixon (Committee member) / Barlow, Jessica (Committee member) / Barrett, The Honors College (Contributor) / Department of Speech and Hearing Science (Contributor) / School of International Letters and Cultures (Contributor)

Created2013-12

Examining the Equivalence of Traditional vs. Automated Speech Perception Testing in Adult Listeners with Normal Hearing

Description

The purpose of the present study was to determine if an automated speech perception task yields results that are equivalent to a word recognition test used in audiometric evaluations. This was done by testing 51 normally hearing adults using a traditional word recognition task (NU-6) and an automated Non-Word Detection…

The purpose of the present study was to determine if an automated speech perception task yields results that are equivalent to a word recognition test used in audiometric evaluations. This was done by testing 51 normally hearing adults using a traditional word recognition task (NU-6) and an automated Non-Word Detection task. Stimuli for each task were presented in quiet as well as in six signal-to-noise ratios (SNRs) increasing in 3 dB increments (+0 dB, +3 dB, +6 dB, +9 dB, + 12 dB, +15 dB). A two one-sided test procedure (TOST) was used to determine equivalency of the two tests. This approach required the performance for both tasks to be arcsine transformed and converted to z-scores in order to calculate the difference in scores across listening conditions. These values were then compared to a predetermined criterion to establish if equivalency exists. It was expected that the TOST procedure would reveal equivalency between the traditional word recognition task and the automated Non-Word Detection Task. The results confirmed that the two tasks differed by no more than 2 test items in any of the listening conditions. Overall, the results indicate that the automated Non-Word Detection task could be used in addition to, or in place of, traditional word recognition tests. In addition, the features of an automated test such as the Non-Word Detection task offer additional benefits including rapid administration, accurate scoring, and supplemental performance data (e.g., error analyses) beyond those obtained in traditional speech perception measures.

ContributorsStahl, Amy Nicole (Author) / Pittman, Andrea (Thesis director) / Boothroyd, Arthur (Committee member) / McBride, Ingrid (Committee member) / School of Human Evolution and Social Change (Contributor) / Department of Speech and Hearing Science (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Approximate neural networks for speech applications in resource-constrained environments

Description

Speech recognition and keyword detection are becoming increasingly popular applications for mobile systems. While deep neural network (DNN) implementation of these systems have very good performance,

they have large memory and compute resource requirements, making their implementation on a mobile device quite challenging. In this thesis, techniques to reduce the…

Speech recognition and keyword detection are becoming increasingly popular applications for mobile systems. While deep neural network (DNN) implementation of these systems have very good performance,

they have large memory and compute resource requirements, making their implementation on a mobile device quite challenging. In this thesis, techniques to reduce the memory and computation cost

of keyword detection and speech recognition networks (or DNNs) are presented.

The first technique is based on representing all weights and biases by a small number of bits and mapping all nodal computations into fixed-point ones with minimal degradation in the

accuracy. Experiments conducted on the Resource Management (RM) database show that for the keyword detection neural network, representing the weights by 5 bits results in a 6 fold reduction in memory compared to a floating point implementation with very little loss in performance. Similarly, for the speech recognition neural network, representing the weights by 6 bits results in a 5 fold reduction in memory while maintaining an error rate similar to a floating point implementation. Additional reduction in memory is achieved by a technique called weight pruning,

where the weights are classified as sensitive and insensitive and the sensitive weights are represented with higher precision. A combination of these two techniques helps reduce the memory

footprint by 81 - 84% for speech recognition and keyword detection networks respectively.

Further reduction in memory size is achieved by judiciously dropping connections for large blocks of weights. The corresponding technique, termed coarse-grain sparsification, introduces

hardware-aware sparsity during DNN training, which leads to efficient weight memory compression and significant reduction in the number of computations during classification without

loss of accuracy. Keyword detection and speech recognition DNNs trained with 75% of the weights dropped and classified with 5-6 bit weight precision effectively reduced the weight memory

requirement by ~95% compared to a fully-connected network with double precision, while showing similar performance in keyword detection accuracy and word error rate.

ContributorsArunachalam, Sairam (Author) / Chakrabarti, Chaitali (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by

Context recognition methods using audio signals for human-machine interaction

Cognitive and Auditory Factors for Speech and Music Perception in Elderly Adult Cochlear Implant Users

The Interaction of Word Complexity and Consonant Correctness in Spanish-Speaking Children

Examining the Equivalence of Traditional vs. Automated Speech Perception Testing in Adult Listeners with Normal Hearing

Approximate neural networks for speech applications in resource-constrained environments