Search Content

Structured sparse learning and its applications to biomedical and biological data

Description

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups…

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups or graphs. In this thesis, I first propose to solve a sparse learning model with a general group structure, where the predefined groups may overlap with each other. Then, I present three real world applications which can benefit from the group structured sparse learning technique. In the first application, I study the Alzheimer's Disease diagnosis problem using multi-modality neuroimaging data. In this dataset, not every subject has all data sources available, exhibiting an unique and challenging block-wise missing pattern. In the second application, I study the automatic annotation and retrieval of fruit-fly gene expression pattern images. Combined with the spatial information, sparse learning techniques can be used to construct effective representation of the expression images. In the third application, I present a new computational approach to annotate developmental stage for Drosophila embryos in the gene expression images. In addition, it provides a stage score that enables one to more finely annotate each embryo so that they are divided into early and late periods of development within standard stage demarcations. Stage scores help us to illuminate global gene activities and changes much better, and more refined stage annotations improve our ability to better interpret results when expression pattern matches are discovered between genes.

ContributorsYuan, Lei (Author) / Ye, Jieping (Thesis advisor) / Wang, Yalin (Committee member) / Xue, Guoliang (Committee member) / Kumar, Sudhir (Committee member) / Arizona State University (Publisher)

Created2013

Alternative methods via random forest to identify interactions in a general framework and variable importance in the context of value-added models

Description

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’…

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions.

ContributorsValdivia, Arturo (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Broatch, Jennifer (Committee member) / Arizona State University (Publisher)

Created2013

Niche differentiation of ammonia-oxidizing microbial communities in arid land soils

Description

Human activity has increased loading of reactive nitrogen (N) in the environment, with important and often deleterious impacts on biodiversity, climate, and human health. Since the fate of N in the ecosystem is mainly controlled by microorganisms, understanding the factors that shape microbial communities becomes relevant and urgent. In arid…

Human activity has increased loading of reactive nitrogen (N) in the environment, with important and often deleterious impacts on biodiversity, climate, and human health. Since the fate of N in the ecosystem is mainly controlled by microorganisms, understanding the factors that shape microbial communities becomes relevant and urgent. In arid land soils, these microbial communities and factors are not well understood. I aimed to study the role of N cycling microbes, such as the ammonia-oxidizing bacteria (AOB), the recently discovered ammonia-oxidizing archaea (AOA), and various fungal groups, in soils of arid lands. I also tested if niche differentiation among microbial populations is a driver of differential biogeochemical outcomes. I found that N cycling microbial communities in arid lands are structured by environmental factors to a stronger degree than what is generally observed in mesic systems. For example, in biological soil crusts, temperature selected for AOA in warmer deserts and for AOB in colder deserts. Land-use change also affects niche differentiation, with fungi being the major agents of N2O production in natural arid lands, whereas emissions could be attributed to bacteria in mesic urban lawns. By contrast, NO3- production in the native desert and managed soils was mainly controlled by autotrophic microbes (i.e., AOB and AOA) rather than by heterotrophic fungi. I could also determine that AOA surprisingly responded positively to inorganic N availability in both short (one month) and long-term (seven years) experimental manipulations in an arid land soil, while environmental N enrichment in other ecosystem types is known to favor AOB over AOA. This work improves our predictions of ecosystem response to anthropogenic N increase and shows that paradigms derived from mesic systems are not always applicable to arid lands. My dissertation also highlights the unique ecology of ammonia oxidizers and draws attention to the importance of N cycling in desert soils.

ContributorsMarusenko, Yevgeniy (Author) / Hall, Sharon J (Thesis advisor) / Garcia-Pichel, Ferran (Thesis advisor) / Mclain, Jean E (Committee member) / Schwartz, Egbert (Committee member) / Arizona State University (Publisher)

Created2013

The ancient agroecology of Perry Mesa: integrating runoff, nutrients, and climate

Description

Understanding agricultural land use requires the integration of natural factors, such as climate and nutrients, as well as human factors, such as agricultural intensification. Employing an agroecological framework, I use the Perry Mesa landscape, located in central Arizona, as a case study to explore the intersection of these factors to…

Understanding agricultural land use requires the integration of natural factors, such as climate and nutrients, as well as human factors, such as agricultural intensification. Employing an agroecological framework, I use the Perry Mesa landscape, located in central Arizona, as a case study to explore the intersection of these factors to investigate prehistoric agriculture from A.D. 1275-1450. Ancient Perry Mesa farmers used a runoff agricultural strategy and constructed extensive alignments, or terraces, on gentle hillslopes to slow and capture nutrient rich surface runoff generated from intense rainfall. I investigate how the construction of agricultural terraces altered key parameters (water and nutrients) necessary for successful agriculture in this arid region. Building upon past work focused on agricultural terraces in general, I gathered empirical data pertaining to nutrient renewal and water retention from one ancient runoff field. I developed a long-term model of maize growth and soil nutrient dynamics parameterized using nutrient analyses of runoff collected from the sample prehistoric field. This model resulted in an estimate of ideal field use and fallow periods for maintaining long-term soil fertility under different climatic regimes. The results of the model were integrated with estimates of prehistoric population distribution and geographical characterizations of the arable lands to evaluate the places and periods when sufficient arable land was available for the type of cropping and fallowing systems suggested by the model (given the known climatic trends and land use requirements). Results indicate that not only do dry climatic periods put stress on crops due to reduced precipitation but that a reduction in expected runoff events results in a reduction in the amount of nutrient renewal due to fewer runoff events. This reduction lengthens estimated fallow cycles, and probably would have increased the amount of land necessary to maintain sustainable agricultural production. While the overall Perry Mesa area was not limited in terms of arable land, this analysis demonstrates the likely presence of arable land pressures in the immediate vicinity of some communities. Anthropological understandings of agricultural land use combined with ecological tools for investigating nutrient dynamics provides a comprehensive understanding of ancient land use in arid regions.

ContributorsKruse-Peeples, Melissa R (Author) / Spielmann, Katherine A. (Thesis advisor) / Abbott, David R. (Committee member) / Hall, Sharon J. (Committee member) / Kintigh, Keith W. (Committee member) / Arizona State University (Publisher)

Created2013

Sparse methods in image understanding and computer vision

Description

Image understanding has been playing an increasingly crucial role in vision applications. Sparse models form an important component in image understanding, since the statistics of natural images reveal the presence of sparse structure. Sparse methods lead to parsimonious models, in addition to being efficient for large scale learning. In sparse…

Image understanding has been playing an increasingly crucial role in vision applications. Sparse models form an important component in image understanding, since the statistics of natural images reveal the presence of sparse structure. Sparse methods lead to parsimonious models, in addition to being efficient for large scale learning. In sparse modeling, data is represented as a sparse linear combination of atoms from a "dictionary" matrix. This dissertation focuses on understanding different aspects of sparse learning, thereby enhancing the use of sparse methods by incorporating tools from machine learning. With the growing need to adapt models for large scale data, it is important to design dictionaries that can model the entire data space and not just the samples considered. By exploiting the relation of dictionary learning to 1-D subspace clustering, a multilevel dictionary learning algorithm is developed, and it is shown to outperform conventional sparse models in compressed recovery, and image denoising. Theoretical aspects of learning such as algorithmic stability and generalization are considered, and ensemble learning is incorporated for effective large scale learning. In addition to building strategies for efficiently implementing 1-D subspace clustering, a discriminative clustering approach is designed to estimate the unknown mixing process in blind source separation. By exploiting the non-linear relation between the image descriptors, and allowing the use of multiple features, sparse methods can be made more effective in recognition problems. The idea of multiple kernel sparse representations is developed, and algorithms for learning dictionaries in the feature space are presented. Using object recognition experiments on standard datasets it is shown that the proposed approaches outperform other sparse coding-based recognition frameworks. Furthermore, a segmentation technique based on multiple kernel sparse representations is developed, and successfully applied for automated brain tumor identification. Using sparse codes to define the relation between data samples can lead to a more robust graph embedding for unsupervised clustering. By performing discriminative embedding using sparse coding-based graphs, an algorithm for measuring the glomerular number in kidney MRI images is developed. Finally, approaches to build dictionaries for local sparse coding of image descriptors are presented, and applied to object recognition and image retrieval.

ContributorsJayaraman Thiagarajan, Jayaraman (Author) / Spanias, Andreas (Thesis advisor) / Frakes, David (Committee member) / Tepedelenlioğlu, Cihan (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2013

Learning from asymmetric models and matched pairs

Description

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus…

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.

ContributorsKoh, Derek (Author) / Runger, George C. (Thesis advisor) / Wu, Tong (Committee member) / Pan, Rong (Committee member) / Cesta, John (Committee member) / Arizona State University (Publisher)

Created2013

Impact of restoration practices on mycorrhizal inoculum potential in a semi-arid riparian ecosystem

Description

Mycorrhizal fungi form symbiotic relationships with plant roots, increasing nutrient and water availability to plants and improving soil stability. Mechanical disturbance of soil has been found to reduce mycorrhizal inoculum in soils, but findings have been inconsistent. To examine the impact of restoration practices on riparian mycorrhizal inoculum potential, soil…

Mycorrhizal fungi form symbiotic relationships with plant roots, increasing nutrient and water availability to plants and improving soil stability. Mechanical disturbance of soil has been found to reduce mycorrhizal inoculum in soils, but findings have been inconsistent. To examine the impact of restoration practices on riparian mycorrhizal inoculum potential, soil samples were collected at the Tres Rios Ecosystem Restoration and Flood Control Project located at the confluence of the Salt, Gila, and Agua Fria rivers in central Arizona. The project involved the mechanical removal of invasive Tamarix spp.( tamarisk, salt cedar) and grading prior to revegetation. Soil samples were collected from three stages of restoration: pre-restoration, soil banks with chipped vegetation, and in areas that had been graded in preparation for revegetation. Bioassay plants were grown in the soil samples and roots analyzed for arbuscular mycorrhizal (AM) and ectomycorrhizal (EM) infection percentages. Vegetations measurements were also taken for woody vegetation at the site. The mean number of AM and EM fungal propagules did not differ between the three treatment area, but inoculum levels did differ between AM and EM fungi with AM fungal propagules detected at moderate levels and EM fungi at very low levels. These differences may have been related to availability of host plants since AM fungi form associations with a variety of desert riparian forbs and grasses and EM fungi only form associations with Populus spp. and Salix spp. which were present at the site but at low density and canopy cover. Prior studies have also found that EM fungi may be more affected by tamarisk invasions than AM fungi. Our results were similar to other restoration projects for AM fungi suggesting that it may not be necessary to add AM fungi to soil prior to planting native vegetation because of the moderate presence of AM fungi even in soils dominated by tamarisk and exposed to soil disturbance during the restoration process. In contrast when planting trees that form EM associations, it may be beneficial to augment soil with EM fungi collected from riparian areas or to pre-inoculate plants prior to planting.

ContributorsArnold, Susanne (Author) / Stutz, Jean (Thesis advisor) / Alford, Eddie (Committee member) / Green, Douglas (Committee member) / Arizona State University (Publisher)

Created2012

New directions in sparse models for image analysis and restoration

Description

Effective modeling of high dimensional data is crucial in information processing and machine learning. Classical subspace methods have been very effective in such applications. However, over the past few decades, there has been considerable research towards the development of new modeling paradigms that go beyond subspace methods. This dissertation focuses…

Effective modeling of high dimensional data is crucial in information processing and machine learning. Classical subspace methods have been very effective in such applications. However, over the past few decades, there has been considerable research towards the development of new modeling paradigms that go beyond subspace methods. This dissertation focuses on the study of sparse models and their interplay with modern machine learning techniques such as manifold, ensemble and graph-based methods, along with their applications in image analysis and recovery. By considering graph relations between data samples while learning sparse models, graph-embedded codes can be obtained for use in unsupervised, supervised and semi-supervised problems. Using experiments on standard datasets, it is demonstrated that the codes obtained from the proposed methods outperform several baseline algorithms. In order to facilitate sparse learning with large scale data, the paradigm of ensemble sparse coding is proposed, and different strategies for constructing weak base models are developed. Experiments with image recovery and clustering demonstrate that these ensemble models perform better when compared to conventional sparse coding frameworks. When examples from the data manifold are available, manifold constraints can be incorporated with sparse models and two approaches are proposed to combine sparse coding with manifold projection. The improved performance of the proposed techniques in comparison to sparse coding approaches is demonstrated using several image recovery experiments. In addition to these approaches, it might be required in some applications to combine multiple sparse models with different regularizations. In particular, combining an unconstrained sparse model with non-negative sparse coding is important in image analysis, and it poses several algorithmic and theoretical challenges. A convex and an efficient greedy algorithm for recovering combined representations are proposed. Theoretical guarantees on sparsity thresholds for exact recovery using these algorithms are derived and recovery performance is also demonstrated using simulations on synthetic data. Finally, the problem of non-linear compressive sensing, where the measurement process is carried out in feature space obtained using non-linear transformations, is considered. An optimized non-linear measurement system is proposed, and improvements in recovery performance are demonstrated in comparison to using random measurements as well as optimized linear measurements.

ContributorsNatesan Ramamurthy, Karthikeyan (Author) / Spanias, Andreas (Thesis advisor) / Tsakalis, Konstantinos (Committee member) / Karam, Lina (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2013

Informatics approach to improving surgical skills training

Description

Surgery as a profession requires significant training to improve both clinical decision making and psychomotor proficiency. In the medical knowledge domain, tools have been developed, validated, and accepted for evaluation of surgeons' competencies. However, assessment of the psychomotor skills still relies on the Halstedian model of apprenticeship, wherein surgeons are…

Surgery as a profession requires significant training to improve both clinical decision making and psychomotor proficiency. In the medical knowledge domain, tools have been developed, validated, and accepted for evaluation of surgeons' competencies. However, assessment of the psychomotor skills still relies on the Halstedian model of apprenticeship, wherein surgeons are observed during residency for judgment of their skills. Although the value of this method of skills assessment cannot be ignored, novel methodologies of objective skills assessment need to be designed, developed, and evaluated that augment the traditional approach. Several sensor-based systems have been developed to measure a user's skill quantitatively, but use of sensors could interfere with skill execution and thus limit the potential for evaluating real-life surgery. However, having a method to judge skills automatically in real-life conditions should be the ultimate goal, since only with such features that a system would be widely adopted. This research proposes a novel video-based approach for observing surgeons' hand and surgical tool movements in minimally invasive surgical training exercises as well as during laparoscopic surgery. Because our system does not require surgeons to wear special sensors, it has the distinct advantage over alternatives of offering skills assessment in both learning and real-life environments. The system automatically detects major skill-measuring features from surgical task videos using a computing system composed of a series of computer vision algorithms and provides on-screen real-time performance feedback for more efficient skill learning. Finally, the machine-learning approach is used to develop an observer-independent composite scoring model through objective and quantitative measurement of surgical skills. To increase effectiveness and usability of the developed system, it is integrated with a cloud-based tool, which automatically assesses surgical videos upload to the cloud.

ContributorsIslam, Gazi (Author) / Li, Baoxin (Thesis advisor) / Liang, Jianming (Thesis advisor) / Dinu, Valentin (Committee member) / Greenes, Robert (Committee member) / Smith, Marshall (Committee member) / Kahol, Kanav (Committee member) / Patel, Vimla L. (Committee member) / Arizona State University (Publisher)

Created2013

Large-scale matrix completion using orthogonal rank-one matrix pursuit, divide-factor-combine, and apache spark

Description

As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms…

As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms which are capable of finding the hidden structure within these datasets. As consumers of popular Big Data frameworks have sought to apply and benefit from these improved learning algorithms, the problems encountered with the frameworks have motivated a new generation of Big Data tools to address the shortcomings of the previous generation. One important example of this is the improved performance in the newer tools with the large class of machine learning algorithms which are highly iterative in nature. In this thesis project, I set about to implement a low-rank matrix completion algorithm (as an example of a highly iterative algorithm) within a popular Big Data framework, and to evaluate its performance processing the Netflix Prize dataset. I begin by describing several approaches which I attempted, but which did not perform adequately. These include an implementation of the Singular Value Thresholding (SVT) algorithm within the Apache Mahout framework, which runs on top of the Apache Hadoop MapReduce engine. I then describe an approach which uses the Divide-Factor-Combine (DFC) algorithmic framework to parallelize the state-of-the-art low-rank completion algorithm Orthogoal Rank-One Matrix Pursuit (OR1MP) within the Apache Spark engine. I describe the results of a series of tests running this implementation with the Netflix dataset on clusters of various sizes, with various degrees of parallelism. For these experiments, I utilized the Amazon Elastic Compute Cloud (EC2) web service. In the final analysis, I conclude that the Spark DFC + OR1MP implementation does indeed produce competitive results, in both accuracy and performance. In particular, the Spark implementation performs nearly as well as the MATLAB implementation of OR1MP without any parallelism, and improves performance to a significant degree as the parallelism increases. In addition, the experience demonstrates how Spark's flexible programming model makes it straightforward to implement this parallel and iterative machine learning algorithm.

ContributorsKrouse, Brian (Author) / Ye, Jieping (Thesis advisor) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)

Created2014

Filtering by