Search Content

Audio processing and loudness estimation algorithms with iOS simulations

Description

The processing power and storage capacity of portable devices have improved considerably over the past decade. This has motivated the implementation of sophisticated audio and other signal processing algorithms on such mobile devices. Of particular interest in this thesis is audio/speech processing based on perceptual criteria. Specifically, estimation of parameters…

The processing power and storage capacity of portable devices have improved considerably over the past decade. This has motivated the implementation of sophisticated audio and other signal processing algorithms on such mobile devices. Of particular interest in this thesis is audio/speech processing based on perceptual criteria. Specifically, estimation of parameters from human auditory models, such as auditory patterns and loudness, involves computationally intensive operations which can strain device resources. Hence, strategies for implementing computationally efficient human auditory models for loudness estimation have been studied in this thesis. Existing algorithms for reducing computations in auditory pattern and loudness estimation have been examined and improved algorithms have been proposed to overcome limitations of these methods. In addition, real-time applications such as perceptual loudness estimation and loudness equalization using auditory models have also been implemented. A software implementation of loudness estimation on iOS devices is also reported in this thesis. In addition to the loudness estimation algorithms and software, in this thesis project we also created new illustrations of speech and audio processing concepts for research and education. As a result, a new suite of speech/audio DSP functions was developed and integrated as part of the award-winning educational iOS App 'iJDSP." These functions are described in detail in this thesis. Several enhancements in the architecture of the application have also been introduced for providing the supporting framework for speech/audio processing. Frame-by-frame processing and visualization functionalities have been developed to facilitate speech/audio processing. In addition, facilities for easy sound recording, processing and audio rendering have also been developed to provide students, practitioners and researchers with an enriched DSP simulation tool. Simulations and assessments have been also developed for use in classes and training of practitioners and students.

ContributorsKalyanasundaram, Girish (Author) / Spanias, Andreas S (Thesis advisor) / Tepedelenlioğlu, Cihan (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2013

Audiovisual perception of dysarthric speech in older adults compared to younger adults

Description

Everyday speech communication typically takes place face-to-face. Accordingly, the task of perceiving speech is a multisensory phenomenon involving both auditory and visual information. The current investigation examines how visual information influences recognition of dysarthric speech. It also explores where the influence of visual information is dependent upon age. Forty adults…

Everyday speech communication typically takes place face-to-face. Accordingly, the task of perceiving speech is a multisensory phenomenon involving both auditory and visual information. The current investigation examines how visual information influences recognition of dysarthric speech. It also explores where the influence of visual information is dependent upon age. Forty adults participated in the study that measured intelligibility (percent words correct) of dysarthric speech in auditory versus audiovisual conditions. Participants were then separated into two groups: older adults (age range 47 to 68) and young adults (age range 19 to 36) to examine the influence of age. Findings revealed that all participants, regardless of age, improved their ability to recognize dysarthric speech when visual speech was added to the auditory signal. The magnitude of this benefit, however, was greater for older adults when compared with younger adults. These results inform our understanding of how visual speech information influences understanding of dysarthric speech.

ContributorsFall, Elizabeth (Author) / Liss, Julie (Thesis advisor) / Berisha, Visar (Committee member) / Gray, Shelley (Committee member) / Arizona State University (Publisher)

Created2014

Joint radar-communications performance bounds: data versus estimation information rates

Description

The problem of cooperative radar and communications signaling is investigated. Each system typically considers the other system a source of interference. Consequently, the tradition is to have them operate in orthogonal frequency bands. By considering the radar and communications operations to be a single joint system, performance bounds on a…

The problem of cooperative radar and communications signaling is investigated. Each system typically considers the other system a source of interference. Consequently, the tradition is to have them operate in orthogonal frequency bands. By considering the radar and communications operations to be a single joint system, performance bounds on a receiver that observes communications and radar return in the same frequency allocation are derived. Bounds in performance of the joint system is measured in terms of data information rate for communications and radar estimation information rate for the radar. Inner bounds on performance are constructed.

ContributorsChiriyath, Alex (Author) / Bliss, Daniel W (Thesis advisor) / Kosut, Oliver (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2014

Head rotation detection in marmoset monkeys

Description

Head movement is known to have the benefit of improving the accuracy of sound localization for humans and animals. Marmoset is a small bodied New World monkey species and it has become an emerging model for studying the auditory functions. This thesis aims to detect the horizontal and vertical…

Head movement is known to have the benefit of improving the accuracy of sound localization for humans and animals. Marmoset is a small bodied New World monkey species and it has become an emerging model for studying the auditory functions. This thesis aims to detect the horizontal and vertical rotation of head movement in marmoset monkeys.

Experiments were conducted in a sound-attenuated acoustic chamber. Head movement of marmoset monkey was studied under various auditory and visual stimulation conditions. With increasing complexity, these conditions are (1) idle, (2) sound-alone, (3) sound and visual signals, and (4) alert signal by opening and closing of the chamber door. All of these conditions were tested with either house light on or off. Infra-red camera with a frame rate of 90 Hz was used to capture of the head movement of monkeys. To assist the signal detection, two circular markers were attached to the top of monkey head. The data analysis used an image-based marker detection scheme. Images were processed using the Computation Vision Toolbox in Matlab. The markers and their positions were detected using blob detection techniques. Based on the frame-by-frame information of marker positions, the angular position, velocity and acceleration were extracted in horizontal and vertical planes. Adaptive Otsu Thresholding, Kalman filtering and bound setting for marker properties were used to overcome a number of challenges encountered during this analysis, such as finding image segmentation threshold, continuously tracking markers during large head movement, and false alarm detection.

The results show that the blob detection method together with Kalman filtering yielded better performances than other image based techniques like optical flow and SURF features .The median of the maximal head turn in the horizontal plane was in the range of 20 to 70 degrees and the median of the maximal velocity in horizontal plane was in the range of a few hundreds of degrees per second. In comparison, the natural alert signal - door opening and closing - evoked the faster head turns than other stimulus conditions. These results suggest that behaviorally relevant stimulus such as alert signals evoke faster head-turn responses in marmoset monkeys.

ContributorsSimhadri, Sravanthi (Author) / Zhou, Yi (Thesis advisor) / Turaga, Pavan (Thesis advisor) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2014

Spectral efficiency of MIMO ad hoc networks with partial channel state information

Description

As the number of devices with wireless capabilities and the proximity of these devices to each other increases, better ways to handle the interference they cause need to be explored. Also important is for these devices to keep up with the demand for data rates while not compromising on…

As the number of devices with wireless capabilities and the proximity of these devices to each other increases, better ways to handle the interference they cause need to be explored. Also important is for these devices to keep up with the demand for data rates while not compromising on industry established expectations of power consumption and mobility. Current methods of distributing the spectrum among all participants are expected to not cope with the demand in a very near future. In this thesis, the effect of employing sophisticated multiple-input, multiple-output (MIMO) systems in this regard is explored. The efficacy of systems which can make intelligent decisions on the transmission mode usage and power allocation to these modes becomes relevant in the current scenario, where the need for performance far exceeds the cost expendable on hardware. The effect of adding multiple antennas at either ends will be examined, the capacity of such systems and of networks comprised of many such participants will be evaluated. Methods of simulating said networks, and ways to achieve better performance by making intelligent transmission decisions will be proposed. Finally, a way of access control closer to the physical layer (a 'statistical MAC') and a possible metric to be used for such a MAC is suggested.

ContributorsThontadarya, Niranjan (Author) / Bliss, Daniel W (Thesis advisor) / Berisha, Visar (Committee member) / Ying, Lei (Committee member) / Arizona State University (Publisher)

Created2014

Estimation of subspace occupancy

Description

The ability to identify unoccupied resources in the radio spectrum is a key capability for opportunistic users in a cognitive radio environment. This paper draws upon and extends geometrically based ideas in statistical signal processing to develop estimators for the rank and the occupied subspace in a multi-user environment from…

The ability to identify unoccupied resources in the radio spectrum is a key capability for opportunistic users in a cognitive radio environment. This paper draws upon and extends geometrically based ideas in statistical signal processing to develop estimators for the rank and the occupied subspace in a multi-user environment from multiple temporal samples of the signal received at a single antenna. These estimators enable identification of resources, such as the orthogonal complement of the occupied subspace, that may be exploitable by an opportunistic user. This concept is supported by simulations showing the estimation of the number of users in a simple CDMA system using a maximum a posteriori (MAP) estimate for the rank. It was found that with suitable parameters, such as high SNR, sufficient number of time epochs and codes of appropriate length, the number of users could be correctly estimated using the MAP estimator even when the noise variance is unknown. Additionally, the process of identifying the maximum likelihood estimate of the orthogonal projector onto the unoccupied subspace is discussed.

ContributorsBeaudet, Kaitlyn (Author) / Cochran, Douglas (Thesis advisor) / Turaga, Pavan (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2014

Learning Interaction Primitives for Biomechanical Prediction

Description

This dissertation is focused on developing an algorithm to provide current state estimation and future state predictions for biomechanical human walking features. The goal is to develop a system which is capable of evaluating the current action a subject is taking while walking and then use this to predict the…

This dissertation is focused on developing an algorithm to provide current state estimation and future state predictions for biomechanical human walking features. The goal is to develop a system which is capable of evaluating the current action a subject is taking while walking and then use this to predict the future states of biomechanical features.

This work focuses on the exploration and analysis of Interaction Primitives (Amor er al, 2014) and their relevance to biomechanical prediction for human walking. Built on the framework of Probabilistic Movement Primitives, Interaction Primitives utilize an EKF SLAM algorithm to localize and map a distribution over the weights of a set of basis functions. The prediction properties of Bayesian Interaction Primitives were utilized to predict real-time foot forces from a 9 degrees of freedom IMUs mounted to a subjects tibias. This method shows that real-time human biomechanical features can be predicted and have a promising link to real-time controls applications.

ContributorsClark, Geoffrey Mitchell (Author) / Ben Amor, Heni (Thesis advisor) / Si, Jennie (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2018

Joint Optimization of Quantization and Structured Sparsity for Compressed Deep Neural Networks

Description

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such…

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such as element-wise sparsity, structured sparsity and quantization. While most of these works have applied these compression techniques in isolation, there have been very few studies on application of quantization and structured sparsity together on a DNN model.

This thesis co-optimizes structured sparsity and quantization constraints on DNN models during training. Specifically, it obtains optimal setting of 2-bit weight and 2-bit activation coupled with 4X structured compression by performing combined exploration of quantization and structured compression settings. The optimal DNN model achieves 50X weight memory reduction compared to floating-point uncompressed DNN. This memory saving is significant since applying only structured sparsity constraints achieves 2X memory savings and only quantization constraints achieves 16X memory savings. The algorithm has been validated on both high and low capacity DNNs and on wide-sparse and deep-sparse DNN models. Experiments demonstrated that deep-sparse DNN outperforms shallow-dense DNN with varying level of memory savings depending on DNN precision and sparsity levels. This work further proposed a Pareto-optimal approach to systematically extract optimal DNN models from a huge set of sparse and dense DNN models. The resulting 11 optimal designs were further evaluated by considering overall DNN memory which includes activation memory and weight memory. It was found that there is only a small change in the memory footprint of the optimal designs corresponding to the low sparsity DNNs. However, activation memory cannot be ignored for high sparsity DNNs.

ContributorsSrivastava, Gaurav (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2018

Interaction Analytics of Software Factory Recordings

Description

A human communications research project at Arizona State University aurally

recorded the daily interactions of aware and consenting employees and their visiting

clients at the Software Factory, a software engineering consulting team, over a three

year period. The resulting dataset contains valuable insights on the communication

networks that the participants formed however it is…

A human communications research project at Arizona State University aurally

recorded the daily interactions of aware and consenting employees and their visiting

clients at the Software Factory, a software engineering consulting team, over a three

year period. The resulting dataset contains valuable insights on the communication

networks that the participants formed however it is far too vast to be processed manually

by researchers. In this work, digital signal processing techniques are employed

to develop a software toolkit that can aid in estimating the observable networks contained

in the Software Factory recordings. A four-step process is employed that starts

with parsing available metadata to initially align the recordings followed by alignment

estimation and correction. Once aligned, the recordings are processed for common

signals that are detected across multiple participants’ recordings which serve as a

proxy for conversations. Lastly, visualization tools are developed to graphically encode

the estimated similarity measures to efficiently convey the observable network

relationships to assist in future human communications research.

ContributorsPressler, Daniel (Author) / Bliss, Daniel W (Thesis advisor) / Berisha, Visar (Committee member) / Corman, Steven (Committee member) / Arizona State University (Publisher)

Created2018

Using Capsule Networks for Image and Speech Recognition Problems

Description

In recent years, conventional convolutional neural network (CNN) has achieved outstanding performance in image and speech processing applications. Unfortunately, the pooling operation in CNN ignores important spatial information which is an important attribute in many applications. The recently proposed capsule network retains spatial information and improves the capabilities of traditional…

In recent years, conventional convolutional neural network (CNN) has achieved outstanding performance in image and speech processing applications. Unfortunately, the pooling operation in CNN ignores important spatial information which is an important attribute in many applications. The recently proposed capsule network retains spatial information and improves the capabilities of traditional CNN. It uses capsules to describe features in multiple dimensions and dynamic routing to increase the statistical stability of the network.

In this work, we first use capsule network for overlapping digit recognition problem. We evaluate the performance of the network with respect to recognition accuracy, convergence and training time per epoch. We show that capsule network achieves higher accuracy when training set size is small. When training set size is larger, capsule network and conventional CNN have comparable recognition accuracy. The training time per epoch for capsule network is longer than conventional CNN because of the dynamic routing algorithm. An analysis of the GPU timing shows that adjusting the capsule structure can help decrease the time complexity of the dynamic routing algorithm significantly.

Next, we design a capsule network for speech recognition, specifically, overlapping word recognition. We use both capsule network and conventional CNN to recognize 2 overlapping words in speech files created from 5 word classes. We show that capsule network achieves a considerably higher recognition accuracy (96.92%) compared to conventional CNN (85.19%). Our results show that capsule network recognizes overlapping word by recognizing each individual word in the speech. We also verify the scalability of capsule network by increasing the number of word classes from 5 to 10. Capsule network still shows a high recognition accuracy of 95.42% in case of 10 words while the accuracy of conventional CNN decreases sharply to 73.18%.

ContributorsXiong, Yan (Author) / Chakrabarti, Chaitali (Thesis advisor) / Berisha, Visar (Thesis advisor) / Weng, Yang (Committee member) / Arizona State University (Publisher)

Created2018

Filtering by