Matching Items (83)
Filtering by

Clear all filters

152514-Thumbnail Image.png
Description
As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms

As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms which are capable of finding the hidden structure within these datasets. As consumers of popular Big Data frameworks have sought to apply and benefit from these improved learning algorithms, the problems encountered with the frameworks have motivated a new generation of Big Data tools to address the shortcomings of the previous generation. One important example of this is the improved performance in the newer tools with the large class of machine learning algorithms which are highly iterative in nature. In this thesis project, I set about to implement a low-rank matrix completion algorithm (as an example of a highly iterative algorithm) within a popular Big Data framework, and to evaluate its performance processing the Netflix Prize dataset. I begin by describing several approaches which I attempted, but which did not perform adequately. These include an implementation of the Singular Value Thresholding (SVT) algorithm within the Apache Mahout framework, which runs on top of the Apache Hadoop MapReduce engine. I then describe an approach which uses the Divide-Factor-Combine (DFC) algorithmic framework to parallelize the state-of-the-art low-rank completion algorithm Orthogoal Rank-One Matrix Pursuit (OR1MP) within the Apache Spark engine. I describe the results of a series of tests running this implementation with the Netflix dataset on clusters of various sizes, with various degrees of parallelism. For these experiments, I utilized the Amazon Elastic Compute Cloud (EC2) web service. In the final analysis, I conclude that the Spark DFC + OR1MP implementation does indeed produce competitive results, in both accuracy and performance. In particular, the Spark implementation performs nearly as well as the MATLAB implementation of OR1MP without any parallelism, and improves performance to a significant degree as the parallelism increases. In addition, the experience demonstrates how Spark's flexible programming model makes it straightforward to implement this parallel and iterative machine learning algorithm.
ContributorsKrouse, Brian (Author) / Ye, Jieping (Thesis advisor) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2014
152909-Thumbnail Image.png
Description
This thesis is an initial test of the hypothesis that superficial measures suffice for measuring collaboration among pairs of students solving complex math problems, where the degree of collaboration is categorized at a high level. Data were collected

in the form of logs from students' tablets and the vocal interaction

This thesis is an initial test of the hypothesis that superficial measures suffice for measuring collaboration among pairs of students solving complex math problems, where the degree of collaboration is categorized at a high level. Data were collected

in the form of logs from students' tablets and the vocal interaction between pairs of students. Thousands of different features were defined, and then extracted computationally from the audio and log data. Human coders used richer data (several video streams) and a thorough understand of the tasks to code episodes as

collaborative, cooperative or asymmetric contribution. Machine learning was used to induce a detector, based on random forests, that outputs one of these three codes for an episode given only a characterization of the episode in terms of superficial features. An overall accuracy of 92.00% (kappa = 0.82) was obtained when

comparing the detector's codes to the humans' codes. However, due irregularities in running the study (e.g., the tablet software kept crashing), these results should be viewed as preliminary.
ContributorsViswanathan, Sree Aurovindh (Author) / VanLehn, Kurt (Thesis advisor) / T.H CHI, Michelene (Committee member) / Walker, Erin (Committee member) / Arizona State University (Publisher)
Created2014
152976-Thumbnail Image.png
Description
Research in the learning sciences suggests that students learn better by collaborating with their peers than learning individually. Students working together as a group tend to generate new ideas more frequently and exhibit a higher level of reasoning. In this internet age with the advent of massive open online courses

Research in the learning sciences suggests that students learn better by collaborating with their peers than learning individually. Students working together as a group tend to generate new ideas more frequently and exhibit a higher level of reasoning. In this internet age with the advent of massive open online courses (MOOCs), students across the world are able to access and learn material remotely. This creates a need for tools that support distant or remote collaboration. In order to build such tools we need to understand the basic elements of remote collaboration and how it differs from traditional face-to-face collaboration.

The main goal of this thesis is to explore how spoken dialogue varies in face-to-face and remote collaborative learning settings. Speech data is collected from student participants solving mathematical problems collaboratively on a tablet. Spoken dialogue is analyzed based on conversational and acoustic features in both the settings. Looking for collaborative differences of transactivity and dialogue initiative, both settings are compared in detail using machine learning classification techniques based on acoustic and prosodic features of speech. Transactivity is defined as a joint construction of knowledge by peers. The main contributions of this thesis are: a speech corpus to analyze spoken dialogue in face-to-face and remote settings and an empirical analysis of conversation, collaboration, and speech prosody in both the settings. The results from the experiments show that amount of overlap is lower in remote dialogue than in the face-to-face setting. There is a significant difference in transactivity among strangers. My research benefits the computer-supported collaborative learning community by providing an analysis that can be used to build more efficient tools for supporting remote collaborative learning.
ContributorsNelakurthi, Arun Reddy (Author) / Pon-Barry, Heather (Thesis advisor) / VanLehn, Kurt (Committee member) / Walker, Erin (Committee member) / Arizona State University (Publisher)
Created2014
153394-Thumbnail Image.png
Description
As a promising solution to the problem of acquiring and storing large amounts of image and video data, spatial-multiplexing camera architectures have received lot of attention in the recent past. Such architectures have the attractive feature of combining a two-step process of acquisition and compression of pixel measurements in a

As a promising solution to the problem of acquiring and storing large amounts of image and video data, spatial-multiplexing camera architectures have received lot of attention in the recent past. Such architectures have the attractive feature of combining a two-step process of acquisition and compression of pixel measurements in a conventional camera, into a single step. A popular variant is the single-pixel camera that obtains measurements of the scene using a pseudo-random measurement matrix. Advances in compressive sensing (CS) theory in the past decade have supplied the tools that, in theory, allow near-perfect reconstruction of an image from these measurements even for sub-Nyquist sampling rates. However, current state-of-the-art reconstruction algorithms suffer from two drawbacks -- They are (1) computationally very expensive and (2) incapable of yielding high fidelity reconstructions for high compression ratios. In computer vision, the final goal is usually to perform an inference task using the images acquired and not signal recovery. With this motivation, this thesis considers the possibility of inference directly from compressed measurements, thereby obviating the need to use expensive reconstruction algorithms. It is often the case that non-linear features are used for inference tasks in computer vision. However, currently, it is unclear how to extract such features from compressed measurements. Instead, using the theoretical basis provided by the Johnson-Lindenstrauss lemma, discriminative features using smashed correlation filters are derived and it is shown that it is indeed possible to perform reconstruction-free inference at high compression ratios with only a marginal loss in accuracy. As a specific inference problem in computer vision, face recognition is considered, mainly beyond the visible spectrum such as in the short wave infra-red region (SWIR), where sensors are expensive.
ContributorsLohit, Suhas Anand (Author) / Turaga, Pavan (Thesis advisor) / Spanias, Andreas (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)
Created2015
150086-Thumbnail Image.png
Description
Detecting anatomical structures, such as the carina, the pulmonary trunk and the aortic arch, is an important step in designing a CAD system of detection Pulmonary Embolism. The presented CAD system gets rid of the high-level prior defined knowledge to become a system which can easily extend to detect other

Detecting anatomical structures, such as the carina, the pulmonary trunk and the aortic arch, is an important step in designing a CAD system of detection Pulmonary Embolism. The presented CAD system gets rid of the high-level prior defined knowledge to become a system which can easily extend to detect other anatomic structures. The system is based on a machine learning algorithm --- AdaBoost and a general feature --- Haar. This study emphasizes on off-line and on-line AdaBoost learning. And in on-line AdaBoost, the thesis further deals with extremely imbalanced condition. The thesis first reviews several knowledge-based detection methods, which are relied on human being's understanding of the relationship between anatomic structures. Then the thesis introduces a classic off-line AdaBoost learning. The thesis applies different cascading scheme, namely multi-exit cascading scheme. The comparison between the two methods will be provided and discussed. Both of the off-line AdaBoost methods have problems in memory usage and time consuming. Off-line AdaBoost methods need to store all the training samples and the dataset need to be set before training. The dataset cannot be enlarged dynamically. Different training dataset requires retraining the whole process. The retraining is very time consuming and even not realistic. To deal with the shortcomings of off-line learning, the study exploited on-line AdaBoost learning approach. The thesis proposed a novel pool based on-line method with Kalman filters and histogram to better represent the distribution of the samples' weight. Analysis of the performance, the stability and the computational complexity will be provided in the thesis. Furthermore, the original on-line AdaBoost performs badly in imbalanced conditions, which occur frequently in medical image processing. In image dataset, positive samples are limited and negative samples are countless. A novel Self-Adaptive Asymmetric On-line Boosting method is presented. The method utilized a new asymmetric loss criterion with self-adaptability according to the ratio of exposed positive and negative samples and it has an advanced rule to update sample's importance weight taking account of both classification result and sample's label. Compared to traditional on-line AdaBoost Learning method, the new method can achieve far more accuracy in imbalanced conditions.
ContributorsWu, Hong (Author) / Liang, Jianming (Thesis advisor) / Farin, Gerald (Committee member) / Ye, Jieping (Committee member) / Arizona State University (Publisher)
Created2011
150181-Thumbnail Image.png
Description
Real-world environments are characterized by non-stationary and continuously evolving data. Learning a classification model on this data would require a framework that is able to adapt itself to newer circumstances. Under such circumstances, transfer learning has come to be a dependable methodology for improving classification performance with reduced training costs

Real-world environments are characterized by non-stationary and continuously evolving data. Learning a classification model on this data would require a framework that is able to adapt itself to newer circumstances. Under such circumstances, transfer learning has come to be a dependable methodology for improving classification performance with reduced training costs and without the need for explicit relearning from scratch. In this thesis, a novel instance transfer technique that adapts a "Cost-sensitive" variation of AdaBoost is presented. The method capitalizes on the theoretical and functional properties of AdaBoost to selectively reuse outdated training instances obtained from a "source" domain to effectively classify unseen instances occurring in a different, but related "target" domain. The algorithm is evaluated on real-world classification problems namely accelerometer based 3D gesture recognition, smart home activity recognition and text categorization. The performance on these datasets is analyzed and evaluated against popular boosting-based instance transfer techniques. In addition, supporting empirical studies, that investigate some of the less explored bottlenecks of boosting based instance transfer methods, are presented, to understand the suitability and effectiveness of this form of knowledge transfer.
ContributorsVenkatesan, Ashok (Author) / Panchanathan, Sethuraman (Thesis advisor) / Li, Baoxin (Committee member) / Ye, Jieping (Committee member) / Arizona State University (Publisher)
Created2011
150190-Thumbnail Image.png
Description
Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant information from the irrelevant information has been a topic of

Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant information from the irrelevant information has been a topic of focus. In supervised learning like regression, the data consists of many features and only a subset of the features may be responsible for the result. Also, the features might require special structural requirements, which introduces additional complexity for feature selection. The sparse learning package, provides a set of algorithms for learning a sparse set of the most relevant features for both regression and classification problems. Structural dependencies among features which introduce additional requirements are also provided as part of the package. The features may be grouped together, and there may exist hierarchies and over- lapping groups among these, and there may be requirements for selecting the most relevant groups among them. In spite of getting sparse solutions, the solutions are not guaranteed to be robust. For the selection to be robust, there are certain techniques which provide theoretical justification of why certain features are selected. The stability selection, is a method for feature selection which allows the use of existing sparse learning methods to select the stable set of features for a given training sample. This is done by assigning probabilities for the features: by sub-sampling the training data and using a specific sparse learning technique to learn the relevant features, and repeating this a large number of times, and counting the probability as the number of times a feature is selected. Cross-validation which is used to determine the best parameter value over a range of values, further allows to select the best parameter value. This is done by selecting the parameter value which gives the maximum accuracy score. With such a combination of algorithms, with good convergence guarantees, stable feature selection properties and the inclusion of various structural dependencies among features, the sparse learning package will be a powerful tool for machine learning research. Modular structure, C implementation, ATLAS integration for fast linear algebraic subroutines, make it one of the best tool for a large sparse setting. The varied collection of algorithms, support for group sparsity, batch algorithms, are a few of the notable functionality of the SLEP package, and these features can be used in a variety of fields to infer relevant elements. The Alzheimer Disease(AD) is a neurodegenerative disease, which gradually leads to dementia. The SLEP package is used for feature selection for getting the most relevant biomarkers from the available AD dataset, and the results show that, indeed, only a subset of the features are required to gain valuable insights.
ContributorsThulasiram, Ramesh (Author) / Ye, Jieping (Thesis advisor) / Xue, Guoliang (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)
Created2011
151154-Thumbnail Image.png
Description
Alzheimer's Disease (AD) is the most common form of dementia observed in elderly patients and has significant social-economic impact. There are many initiatives which aim to capture leading causes of AD. Several genetic, imaging, and biochemical markers are being explored to monitor progression of AD and explore treatment and detection

Alzheimer's Disease (AD) is the most common form of dementia observed in elderly patients and has significant social-economic impact. There are many initiatives which aim to capture leading causes of AD. Several genetic, imaging, and biochemical markers are being explored to monitor progression of AD and explore treatment and detection options. The primary focus of this thesis is to identify key biomarkers to understand the pathogenesis and prognosis of Alzheimer's Disease. Feature selection is the process of finding a subset of relevant features to develop efficient and robust learning models. It is an active research topic in diverse areas such as computer vision, bioinformatics, information retrieval, chemical informatics, and computational finance. In this work, state of the art feature selection algorithms, such as Student's t-test, Relief-F, Information Gain, Gini Index, Chi-Square, Fisher Kernel Score, Kruskal-Wallis, Minimum Redundancy Maximum Relevance, and Sparse Logistic regression with Stability Selection have been extensively exploited to identify informative features for AD using data from Alzheimer's Disease Neuroimaging Initiative (ADNI). An integrative approach which uses blood plasma protein, Magnetic Resonance Imaging, and psychometric assessment scores biomarkers has been explored. This work also analyzes the techniques to handle unbalanced data and evaluate the efficacy of sampling techniques. Performance of feature selection algorithm is evaluated using the relevance of derived features and the predictive power of the algorithm using Random Forest and Support Vector Machine classifiers. Performance metrics such as Accuracy, Sensitivity and Specificity, and area under the Receiver Operating Characteristic curve (AUC) have been used for evaluation. The feature selection algorithms best suited to analyze AD proteomics data have been proposed. The key biomarkers distinguishing healthy and AD patients, Mild Cognitive Impairment (MCI) converters and non-converters, and healthy and MCI patients have been identified.
ContributorsDubey, Rashmi (Author) / Ye, Jieping (Thesis advisor) / Wang, Yalin (Committee member) / Wu, Tong (Committee member) / Arizona State University (Publisher)
Created2012
153988-Thumbnail Image.png
Description
With the advent of Internet, the data being added online is increasing at enormous rate. Though search engines are using IR techniques to facilitate the search requests from users, the results are not effective towards the search query of the user. The search engine user has to go through certain

With the advent of Internet, the data being added online is increasing at enormous rate. Though search engines are using IR techniques to facilitate the search requests from users, the results are not effective towards the search query of the user. The search engine user has to go through certain webpages before getting at the webpage he/she wanted. This problem of Information Overload can be solved using Automatic Text Summarization. Summarization is a process of obtaining at abridged version of documents so that user can have a quick view to understand what exactly the document is about. Email threads from W3C are used in this system. Apart from common IR features like Term Frequency, Inverse Document Frequency, Term Rank, a variation of page rank based on graph model, which can cluster the words with respective to word ambiguity, is implemented. Term Rank also considers the possibility of co-occurrence of words with the corpus and evaluates the rank of the word accordingly. Sentences of email threads are ranked as per features and summaries are generated. System implemented the concept of pyramid evaluation in content selection. The system can be considered as a framework for Unsupervised Learning in text summarization.
ContributorsNadella, Sravan (Author) / Davulcu, Hasan (Thesis advisor) / Li, Baoxin (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)
Created2015
153889-Thumbnail Image.png
Description
Robust and stable decoding of neural signals is imperative for implementing a useful neuroprosthesis capable of carrying out dexterous tasks. A nonhuman primate (NHP) was trained to perform combined flexions of the thumb, index and middle fingers in addition to individual flexions and extensions of the same digits. An array

Robust and stable decoding of neural signals is imperative for implementing a useful neuroprosthesis capable of carrying out dexterous tasks. A nonhuman primate (NHP) was trained to perform combined flexions of the thumb, index and middle fingers in addition to individual flexions and extensions of the same digits. An array of microelectrodes was implanted in the hand area of the motor cortex of the NHP and used to record action potentials during finger movements. A Support Vector Machine (SVM) was used to classify which finger movement the NHP was making based upon action potential firing rates. The effect of four feature selection techniques, Wilcoxon signed-rank test, Relative Importance, Principal Component Analysis, and Mutual Information Maximization was compared based on SVM classification performance. SVM classification was used to examine the functional parameters of (i) efficacy (ii) endurance to simulated failure and (iii) longevity of classification. The effect of using isolated-neuron and multi-unit firing rates was compared as the feature vector supplied to the SVM. The best classification performance was on post-implantation day 36, when using multi-unit firing rates the worst classification accuracy resulted from features selected with Wilcoxon signed-rank test (51.12 ± 0.65%) and the best classification accuracy resulted from Mutual Information Maximization (93.74 ± 0.32%). On this day when using single-unit firing rates, the classification accuracy from the Wilcoxon signed-rank test was 88.85 ± 0.61 % and Mutual Information Maximization was 95.60 ± 0.52% (degrees of freedom =10, level of chance =10%)
ContributorsPadmanaban, Subash (Author) / Greger, Bradley (Thesis advisor) / Santello, Marco (Thesis advisor) / Helms Tillery, Stephen (Committee member) / Arizona State University (Publisher)
Created2015