Search Content

Structured sparse learning and its applications to biomedical and biological data

Description

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups…

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups or graphs. In this thesis, I first propose to solve a sparse learning model with a general group structure, where the predefined groups may overlap with each other. Then, I present three real world applications which can benefit from the group structured sparse learning technique. In the first application, I study the Alzheimer's Disease diagnosis problem using multi-modality neuroimaging data. In this dataset, not every subject has all data sources available, exhibiting an unique and challenging block-wise missing pattern. In the second application, I study the automatic annotation and retrieval of fruit-fly gene expression pattern images. Combined with the spatial information, sparse learning techniques can be used to construct effective representation of the expression images. In the third application, I present a new computational approach to annotate developmental stage for Drosophila embryos in the gene expression images. In addition, it provides a stage score that enables one to more finely annotate each embryo so that they are divided into early and late periods of development within standard stage demarcations. Stage scores help us to illuminate global gene activities and changes much better, and more refined stage annotations improve our ability to better interpret results when expression pattern matches are discovered between genes.

ContributorsYuan, Lei (Author) / Ye, Jieping (Thesis advisor) / Wang, Yalin (Committee member) / Xue, Guoliang (Committee member) / Kumar, Sudhir (Committee member) / Arizona State University (Publisher)

Created2013

Alternative methods via random forest to identify interactions in a general framework and variable importance in the context of value-added models

Description

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’…

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions.

ContributorsValdivia, Arturo (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Broatch, Jennifer (Committee member) / Arizona State University (Publisher)

Created2013

Can a vegetarian diet affect resting metabolic rate or satiety: a pilot study utilizing a metabolic cart and the SenseWear armband

Description

Dietary protein is known to increase postprandial thermogenesis more so than carbohydrates or fats, probably related to the fact that amino acids have no immediate form of storage in the body and can become toxic if not readily incorporated into body tissues or excreted. It is also well documented that…

Dietary protein is known to increase postprandial thermogenesis more so than carbohydrates or fats, probably related to the fact that amino acids have no immediate form of storage in the body and can become toxic if not readily incorporated into body tissues or excreted. It is also well documented that subjects report greater satiety on high- versus low-protein diets and that subject compliance tends to be greater on high-protein diets, thus contributing to their popularity. What is not as well known is how a high-protein diet affects resting metabolic rate over time, and what is even less well known is if resting metabolic rate changes significantly when a person consuming an omnivorous diet suddenly adopts a vegetarian one. This pilot study sought to determine whether subjects adopting a vegetarian diet would report decreased satiety or demonstrate a decreased metabolic rate due to a change in protein intake and possible increase in carbohydrates. Further, this study sought to validate a new device called the SenseWear Armband (SWA) to determine if it might be sensitive enough to detect subtle changes in metabolic rate related to diet. Subjects were tested twice on all variables, at baseline and post-test. Independent and related samples tests revealed no significant differences between or within groups for any variable at any time point in the study. The SWA had a strong positive correlation to the Oxycon Mobile metabolic cart but due to a lack of change in metabolic rate, its sensitivity was undetermined. These data do not support the theory that adopting a vegetarian diet results in a long-term change in metabolic rate.

ContributorsMoore, Amy (Author) / Johnston, Carol (Thesis advisor) / Appel, Christy (Thesis advisor) / Gaesser, Glenn (Committee member) / Arizona State University (Publisher)

Created2012

Sparse methods in image understanding and computer vision

Description

Image understanding has been playing an increasingly crucial role in vision applications. Sparse models form an important component in image understanding, since the statistics of natural images reveal the presence of sparse structure. Sparse methods lead to parsimonious models, in addition to being efficient for large scale learning. In sparse…

Image understanding has been playing an increasingly crucial role in vision applications. Sparse models form an important component in image understanding, since the statistics of natural images reveal the presence of sparse structure. Sparse methods lead to parsimonious models, in addition to being efficient for large scale learning. In sparse modeling, data is represented as a sparse linear combination of atoms from a "dictionary" matrix. This dissertation focuses on understanding different aspects of sparse learning, thereby enhancing the use of sparse methods by incorporating tools from machine learning. With the growing need to adapt models for large scale data, it is important to design dictionaries that can model the entire data space and not just the samples considered. By exploiting the relation of dictionary learning to 1-D subspace clustering, a multilevel dictionary learning algorithm is developed, and it is shown to outperform conventional sparse models in compressed recovery, and image denoising. Theoretical aspects of learning such as algorithmic stability and generalization are considered, and ensemble learning is incorporated for effective large scale learning. In addition to building strategies for efficiently implementing 1-D subspace clustering, a discriminative clustering approach is designed to estimate the unknown mixing process in blind source separation. By exploiting the non-linear relation between the image descriptors, and allowing the use of multiple features, sparse methods can be made more effective in recognition problems. The idea of multiple kernel sparse representations is developed, and algorithms for learning dictionaries in the feature space are presented. Using object recognition experiments on standard datasets it is shown that the proposed approaches outperform other sparse coding-based recognition frameworks. Furthermore, a segmentation technique based on multiple kernel sparse representations is developed, and successfully applied for automated brain tumor identification. Using sparse codes to define the relation between data samples can lead to a more robust graph embedding for unsupervised clustering. By performing discriminative embedding using sparse coding-based graphs, an algorithm for measuring the glomerular number in kidney MRI images is developed. Finally, approaches to build dictionaries for local sparse coding of image descriptors are presented, and applied to object recognition and image retrieval.

ContributorsJayaraman Thiagarajan, Jayaraman (Author) / Spanias, Andreas (Thesis advisor) / Frakes, David (Committee member) / Tepedelenlioğlu, Cihan (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2013

Learning from asymmetric models and matched pairs

Description

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus…

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.

ContributorsKoh, Derek (Author) / Runger, George C. (Thesis advisor) / Wu, Tong (Committee member) / Pan, Rong (Committee member) / Cesta, John (Committee member) / Arizona State University (Publisher)

Created2013

New directions in sparse models for image analysis and restoration

Description

Effective modeling of high dimensional data is crucial in information processing and machine learning. Classical subspace methods have been very effective in such applications. However, over the past few decades, there has been considerable research towards the development of new modeling paradigms that go beyond subspace methods. This dissertation focuses…

Effective modeling of high dimensional data is crucial in information processing and machine learning. Classical subspace methods have been very effective in such applications. However, over the past few decades, there has been considerable research towards the development of new modeling paradigms that go beyond subspace methods. This dissertation focuses on the study of sparse models and their interplay with modern machine learning techniques such as manifold, ensemble and graph-based methods, along with their applications in image analysis and recovery. By considering graph relations between data samples while learning sparse models, graph-embedded codes can be obtained for use in unsupervised, supervised and semi-supervised problems. Using experiments on standard datasets, it is demonstrated that the codes obtained from the proposed methods outperform several baseline algorithms. In order to facilitate sparse learning with large scale data, the paradigm of ensemble sparse coding is proposed, and different strategies for constructing weak base models are developed. Experiments with image recovery and clustering demonstrate that these ensemble models perform better when compared to conventional sparse coding frameworks. When examples from the data manifold are available, manifold constraints can be incorporated with sparse models and two approaches are proposed to combine sparse coding with manifold projection. The improved performance of the proposed techniques in comparison to sparse coding approaches is demonstrated using several image recovery experiments. In addition to these approaches, it might be required in some applications to combine multiple sparse models with different regularizations. In particular, combining an unconstrained sparse model with non-negative sparse coding is important in image analysis, and it poses several algorithmic and theoretical challenges. A convex and an efficient greedy algorithm for recovering combined representations are proposed. Theoretical guarantees on sparsity thresholds for exact recovery using these algorithms are derived and recovery performance is also demonstrated using simulations on synthetic data. Finally, the problem of non-linear compressive sensing, where the measurement process is carried out in feature space obtained using non-linear transformations, is considered. An optimized non-linear measurement system is proposed, and improvements in recovery performance are demonstrated in comparison to using random measurements as well as optimized linear measurements.

ContributorsNatesan Ramamurthy, Karthikeyan (Author) / Spanias, Andreas (Thesis advisor) / Tsakalis, Konstantinos (Committee member) / Karam, Lina (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2013

The evolution of addiction: a case study of nicotine dependence

Description

A variety of studies have shown that the tendency toward nicotine dependence has a genetic component. The work described in this thesis addresses three separate questions: i) are there unidentified SNPs in the nicotinic receptors or other genes that contribute to the risk for nicotine dependence; ii) is there evidence…

A variety of studies have shown that the tendency toward nicotine dependence has a genetic component. The work described in this thesis addresses three separate questions: i) are there unidentified SNPs in the nicotinic receptors or other genes that contribute to the risk for nicotine dependence; ii) is there evidence of ongoing selection at nicotinic receptor loci; and, iii) since nicotine dependence is unlikely to be the phenotype undergoing selection, is a positive effect on memory or cognition the selected phenotype. I first undertook a genome –wide association scan of imputed data using samples from the Collaborative Study of the Genetics of Nicotine Dependence (COGEND). A novel association was found between nicotine dependence and SNPs at 13q31. The genes at this newly associated locus on chromosome 13 encode a group of micro-RNAs and a member of the glypican gene family. These are among the first findings to implicate a non-candidate gene in risk for nicotine dependence. I applied several complimentary methods to sequence data from the 1000 Genomes Project to test for evidence of selection at the nicotinic receptor loci. I found strong evidence for selection for alleles in the nicotinic receptor cluster on chromosome 8 that confer risk of nicotine dependence. I then used the dataset from the Collaborative Studies on the Genetics of Alcoholism (COGA) and looked for an association between neuropsychological phenotypes and SNPs conferring risk of nicotine dependence. One SNP passed multiple test correction for association with WAIS digit symbol score. This SNP is not itself associated with nicotine dependence but is in reasonable (r 2 = 0.75) LD with SNPs that are associated with nicotine dependence. These data suggest at best, a weak correlation between nicotine dependence and any of the tested cognitive phenotypes. Given the reproducible finding of an inverse relationship between SNPs associated with risk for nicotine dependence and cocaine dependence, I hypothesize that the apparently detrimental phenotype of nicotine dependence may confer decreased risk for cocaine dependence. As cocaine use impairs the positive rewards associated with social interactions, reducing the risk of cocaine addiction may be beneficial to both the individual and the group.

ContributorsSadler, Brooke (Author) / Hurtado, Ana Magdalena (Thesis advisor) / Goate, Alison (Thesis advisor) / Hill, Kim (Committee member) / Nagoshi, Craig (Committee member) / Arizona State University (Publisher)

Created2014

Informatics approach to improving surgical skills training

Description

Surgery as a profession requires significant training to improve both clinical decision making and psychomotor proficiency. In the medical knowledge domain, tools have been developed, validated, and accepted for evaluation of surgeons' competencies. However, assessment of the psychomotor skills still relies on the Halstedian model of apprenticeship, wherein surgeons are…

Surgery as a profession requires significant training to improve both clinical decision making and psychomotor proficiency. In the medical knowledge domain, tools have been developed, validated, and accepted for evaluation of surgeons' competencies. However, assessment of the psychomotor skills still relies on the Halstedian model of apprenticeship, wherein surgeons are observed during residency for judgment of their skills. Although the value of this method of skills assessment cannot be ignored, novel methodologies of objective skills assessment need to be designed, developed, and evaluated that augment the traditional approach. Several sensor-based systems have been developed to measure a user's skill quantitatively, but use of sensors could interfere with skill execution and thus limit the potential for evaluating real-life surgery. However, having a method to judge skills automatically in real-life conditions should be the ultimate goal, since only with such features that a system would be widely adopted. This research proposes a novel video-based approach for observing surgeons' hand and surgical tool movements in minimally invasive surgical training exercises as well as during laparoscopic surgery. Because our system does not require surgeons to wear special sensors, it has the distinct advantage over alternatives of offering skills assessment in both learning and real-life environments. The system automatically detects major skill-measuring features from surgical task videos using a computing system composed of a series of computer vision algorithms and provides on-screen real-time performance feedback for more efficient skill learning. Finally, the machine-learning approach is used to develop an observer-independent composite scoring model through objective and quantitative measurement of surgical skills. To increase effectiveness and usability of the developed system, it is integrated with a cloud-based tool, which automatically assesses surgical videos upload to the cloud.

ContributorsIslam, Gazi (Author) / Li, Baoxin (Thesis advisor) / Liang, Jianming (Thesis advisor) / Dinu, Valentin (Committee member) / Greenes, Robert (Committee member) / Smith, Marshall (Committee member) / Kahol, Kanav (Committee member) / Patel, Vimla L. (Committee member) / Arizona State University (Publisher)

Created2013

Metabolic engineering for the biosynthesis of styrene and its derivatives

Description

Metabolic engineering is an extremely useful tool enabling the biosynthetic production of commodity chemicals (typically derived from petroleum) from renewable resources. In this work, a pathway for the biosynthesis of styrene (a plastics monomer) has been engineered in Escherichia coli from glucose by utilizing the pathway for the naturally occurring…

Metabolic engineering is an extremely useful tool enabling the biosynthetic production of commodity chemicals (typically derived from petroleum) from renewable resources. In this work, a pathway for the biosynthesis of styrene (a plastics monomer) has been engineered in Escherichia coli from glucose by utilizing the pathway for the naturally occurring amino acid phenylalanine, the precursor to styrene. Styrene production was accomplished using an E. coli phenylalanine overproducer, E. coli NST74, and over-expression of PAL2 from Arabidopsis thaliana and FDC1 from Saccharomyces cerevisiae. The styrene pathway was then extended by just one enzyme to either (S)-styrene oxide (StyAB from Pseudomonas putida S12) or (R)-1,2-phenylethanediol (NahAaAbAcAd from Pseudomonas sp. NCIB 9816-4) which are both used in pharmaceutical production. Overall, these pathways suffered from limitations due to product toxicity as well as limited precursor availability. In an effort to overcome the toxicity threshold, the styrene pathway was transferred to a yeast host with a higher toxicity limit. First, Saccharomyces cerevisiae BY4741 was engineered to overproduce phenylalanine. Next, PAL2 (the only enzyme needed to complete the styrene pathway) was then expressed in the BY4741 phenylalanine overproducer. Further strain improvements included the deletion of the phenylpyruvate decarboxylase (ARO10) and expression of a feedback-resistant choristmate mutase (ARO4K229L). These works have successfully demonstrated the possibility of utilizing microorganisms as cellular factories for the production styrene, (S)-styrene oxide, and (R)-1,2-phenylethanediol.

ContributorsMcKenna, Rebekah (Author) / Nielsen, David R (Thesis advisor) / Torres, Cesar (Committee member) / Caplan, Michael (Committee member) / Jarboe, Laura (Committee member) / Haynes, Karmella (Committee member) / Arizona State University (Publisher)

Created2014

Large-scale matrix completion using orthogonal rank-one matrix pursuit, divide-factor-combine, and apache spark

Description

As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms…

As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms which are capable of finding the hidden structure within these datasets. As consumers of popular Big Data frameworks have sought to apply and benefit from these improved learning algorithms, the problems encountered with the frameworks have motivated a new generation of Big Data tools to address the shortcomings of the previous generation. One important example of this is the improved performance in the newer tools with the large class of machine learning algorithms which are highly iterative in nature. In this thesis project, I set about to implement a low-rank matrix completion algorithm (as an example of a highly iterative algorithm) within a popular Big Data framework, and to evaluate its performance processing the Netflix Prize dataset. I begin by describing several approaches which I attempted, but which did not perform adequately. These include an implementation of the Singular Value Thresholding (SVT) algorithm within the Apache Mahout framework, which runs on top of the Apache Hadoop MapReduce engine. I then describe an approach which uses the Divide-Factor-Combine (DFC) algorithmic framework to parallelize the state-of-the-art low-rank completion algorithm Orthogoal Rank-One Matrix Pursuit (OR1MP) within the Apache Spark engine. I describe the results of a series of tests running this implementation with the Netflix dataset on clusters of various sizes, with various degrees of parallelism. For these experiments, I utilized the Amazon Elastic Compute Cloud (EC2) web service. In the final analysis, I conclude that the Spark DFC + OR1MP implementation does indeed produce competitive results, in both accuracy and performance. In particular, the Spark implementation performs nearly as well as the MATLAB implementation of OR1MP without any parallelism, and improves performance to a significant degree as the parallelism increases. In addition, the experience demonstrates how Spark's flexible programming model makes it straightforward to implement this parallel and iterative machine learning algorithm.

ContributorsKrouse, Brian (Author) / Ye, Jieping (Thesis advisor) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)

Created2014

Filtering by