Search Content

Structured sparse learning and its applications to biomedical and biological data

Description

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups…

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups or graphs. In this thesis, I first propose to solve a sparse learning model with a general group structure, where the predefined groups may overlap with each other. Then, I present three real world applications which can benefit from the group structured sparse learning technique. In the first application, I study the Alzheimer's Disease diagnosis problem using multi-modality neuroimaging data. In this dataset, not every subject has all data sources available, exhibiting an unique and challenging block-wise missing pattern. In the second application, I study the automatic annotation and retrieval of fruit-fly gene expression pattern images. Combined with the spatial information, sparse learning techniques can be used to construct effective representation of the expression images. In the third application, I present a new computational approach to annotate developmental stage for Drosophila embryos in the gene expression images. In addition, it provides a stage score that enables one to more finely annotate each embryo so that they are divided into early and late periods of development within standard stage demarcations. Stage scores help us to illuminate global gene activities and changes much better, and more refined stage annotations improve our ability to better interpret results when expression pattern matches are discovered between genes.

ContributorsYuan, Lei (Author) / Ye, Jieping (Thesis advisor) / Wang, Yalin (Committee member) / Xue, Guoliang (Committee member) / Kumar, Sudhir (Committee member) / Arizona State University (Publisher)

Created2013

Spatial and temporal patterns of population genetic diversity in the fynbos plant, Leucadendron salignum, in the Cape Floral Region of South Africa

Description

The Cape Floral Region (CFR) in southwestern South Africa is one of the most diverse in the world, with >9,000 plant species, 70% of which are endemic, in an area of only ~90,000 km2. Many have suggested that the CFR's heterogeneous environment, with respect to landscape gradients, vegetation, rainfall, elevation,…

The Cape Floral Region (CFR) in southwestern South Africa is one of the most diverse in the world, with >9,000 plant species, 70% of which are endemic, in an area of only ~90,000 km2. Many have suggested that the CFR's heterogeneous environment, with respect to landscape gradients, vegetation, rainfall, elevation, and soil fertility, is responsible for the origin and maintenance of this biodiversity. While studies have struggled to link species diversity with these features, no study has attempted to associate patterns of gene flow with environmental data to determine how CFR biodiversity evolves on different scales. Here, a molecular population genetic data is presented for a widespread CFR plant, Leucadendron salignum, across 51 locations with 5-kb of chloroplast (cpDNA) and 6-kb of unlinked nuclear (nuDNA) DNA sequences in a dataset of 305 individuals. In the cpDNA dataset, significant genetic structure was found to vary on temporal and spatial scales, separating Western and Eastern Capes - the latter of which appears to be recently derived from the former - with the highest diversity in the heart of the CFR in a central region. A second study applied a statistical model using vegetation and soil composition and found fine-scale genetic divergence is better explained by this landscape resistance model than a geographic distance model. Finally, a third analysis contrasted cpDNA and nuDNA datasets, and revealed very little geographic structure in the latter, suggesting that seed and pollen dispersal can have different evolutionary genetic histories of gene flow on even small CFR scales. These three studies together caution that different genomic markers need to be considered when modeling the geographic and temporal origin of CFR groups. From a greater perspective, the results here are consistent with the hypothesis that landscape heterogeneity is one driving influence in limiting gene flow across the CFR that can lead to species diversity on fine-scales. Nonetheless, while this pattern may be true of the widespread L. salignum, the extension of this approach is now warranted for other CFR species with varying ranges and dispersal mechanisms to determine how universal these patterns of landscape genetic diversity are.

ContributorsTassone, Erica (Author) / Verrelli, Brian C (Thesis advisor) / Dowling, Thomas (Committee member) / Cartwright, Reed (Committee member) / Rosenberg, Michael S. (Committee member) / Wojciechowski, Martin (Committee member) / Arizona State University (Publisher)

Created2013

HIV evolution: biogeography and intra-individual dynamics

Description

The entire history of HIV-1 is hidden in its ten thousand bases, where information regarding its evolutionary traversal through the human population can only be unlocked with fine-scale sequence analysis. Measurable footprints of mutation and recombination have imparted upon us a wealth of knowledge, from multiple chimpanzee-to-human transmissions to patterns…

The entire history of HIV-1 is hidden in its ten thousand bases, where information regarding its evolutionary traversal through the human population can only be unlocked with fine-scale sequence analysis. Measurable footprints of mutation and recombination have imparted upon us a wealth of knowledge, from multiple chimpanzee-to-human transmissions to patterns of neutralizing antibody and drug resistance. Extracting maximum understanding from such diverse data can only be accomplished by analyzing the viral population from many angles. This body of work explores two primary aspects of HIV sequence evolution, point mutation and recombination, through cross-sectional (inter-individual) and longitudinal (intra-individual) investigations, respectively. Cross-sectional Analysis: The role of Haiti in the subtype B pandemic has been hotly debated for years; while there have been many studies, up to this point, no one has incorporated the well-known mechanism of retroviral recombination into their biological model. Prior to the use of recombination detection, multiple analyses produced trees where subtype B appears to have first entered Haiti, followed by a jump into the rest of the world. The results presented here contest the Haiti-first theory of the pandemic and instead suggest simultaneous entries of subtype B into Haiti and the rest of the world. Longitudinal Analysis: Potential N-linked glycosylation sites (PNGS) are the most evolutionarily dynamic component of one of the most evolutionarily dynamic proteins known to date. While the number of mutations associated with the increase or decrease of PNGS frequency over time is high, there are a set of relatively stable sites that persist within and between longitudinally sampled individuals. Here, I identify the most conserved stable PNGSs and suggest their potential roles in host-virus interplay. In addition, I have identified, for the first time, what may be a gp-120-based environmental preference for N-linked glycosylation sites.

ContributorsHepp, Crystal Marie, 1981- (Author) / Rosenberg, Michael S. (Thesis advisor) / Hedrick, Philip (Committee member) / Escalante, Ananias (Committee member) / Kumar, Sudhir (Committee member) / Arizona State University (Publisher)

Created2013

A phylogenetic revision of Minyomerus Horn, 1876 and Piscatopus Sleeper, 1960 (Curculionidae: Entiminae: Tanymecini: Tanymecina)

Description

A phylogenetic revision of the broad-nosed weevil genera Minyomerus Horn, 1876, and Piscatopus Sleeper, 1960 (Entiminae: Tanymecini) is presented. These genera are distributed throughout western North America, from Canada to Mexico and Baja California, primarily in arid and desert habitats, and feed on shrubs such as creosote (Larrea tridentata (DC.)…

A phylogenetic revision of the broad-nosed weevil genera Minyomerus Horn, 1876, and Piscatopus Sleeper, 1960 (Entiminae: Tanymecini) is presented. These genera are distributed throughout western North America, from Canada to Mexico and Baja California, primarily in arid and desert habitats, and feed on shrubs such as creosote (Larrea tridentata (DC.) Coville: Zygophyllaceae) and several Asteraceae. Piscatopus was considered monotypic, comprised solely of P. griseus Sleeper, 1960, whereas Minyomerus formerly was comprised of seven species: M. innocuus Horn, 1876 (designated as the type species for Minyomerus in Pierce, 1913), M. caseyi (Sharp, 1891), M. conicollis Green, 1920, M. constrictus (Casey, 1888), M. languidus Horn, 1876, M. laticeps (Casey, 1888), M. microps (Say, 1831). This revision includes comprehensive redescriptions of the previously described species in these genera and descriptions of ten new species: M. imberbus sp. nov., M. caponei sp. nov., M. reburrus sp. nov., M. cracens sp. nov., M. trisetosus sp. nov., M. puticulatus sp. nov., M. bulbifrons sp. nov., M. politus sp. nov., M. gravivultus sp. nov., and M. rutellirostris sp. nov. A cladistic analysis using 46 morphological characters of 22 terminal taxa (5 outgroup, 17 ingroup) was carried out in WinClada and yielded a single most-parsimonious cladogram (length = 82, consistency index = 65, retention index = 82). The monophyly of Minyomerus is supported by the preferred cladogram. The results of the cladistic analysis place Piscatopus griseus within the genus Minyomerus as sister to M. rutellirostris. Therefore, Piscatopus is demoted to a junior synonym of Minyomerus and its sole member P. griseus, is moved to Minyomerus as M. griseus (Sleeper), new combination. Additionally, the species M. innocuus Horn, 1876 is demoted to a junior synonym of M. microps (Say, 1831), based on the principle of priority, and M. microps is elevated to the rank of type for the genus. The species M. languidus, M. microps, and M. trisetosus are putatively considered parthenogenetic, and lack male specimens over a broad range of sampling events. The diversity in exterior and genitalic morphology, range of host plants, overlapping species distributions, and geographic extent suggests an origin during the Miocene (~15 mya).

ContributorsJansen, Michael Andrew (Author) / Franz, Nico M (Thesis advisor) / Wojciechowski, Martin (Committee member) / Rosenberg, Michael (Committee member) / Arizona State University (Publisher)

Created2014

Pathogen origins and evolution in the New World: a molecular and bioarchaeological approach to tuberculosis and leishmaniasis

Description

Studies of ancient pathogens are moving beyond simple confirmatory analysis of diseased bone; bioarchaeologists and ancient geneticists are posing nuanced questions and utilizing novel methods capable of confronting the debates surrounding pathogen origins and evolution, and the relationships between humans and disease in the past. This dissertation examines two ancient…

Studies of ancient pathogens are moving beyond simple confirmatory analysis of diseased bone; bioarchaeologists and ancient geneticists are posing nuanced questions and utilizing novel methods capable of confronting the debates surrounding pathogen origins and evolution, and the relationships between humans and disease in the past. This dissertation examines two ancient human diseases through molecular and bioarchaeological lines of evidence, relying on techniques in paleogenetics and phylogenetics to detect, isolate, sequence and analyze ancient and modern pathogen DNA within an evolutionary framework. Specifically this research addresses outstanding issues regarding a) the evolution, origin and phylogenetic placement of the pathogen causing skeletal tuberculosis in New World prior to European contact, and b) the phylogeny and origins of the parasite causing the human leishmaniasis disease complex. An additional chapter presents a review of the major technological and theoretical advances in ancient pathogen genomics to frame the contributions of this work within a rapidly developing field. This overview emphasizes that understanding the evolution of human disease is critical to contextualizing relationships between humans and pathogens, and the epidemiological shifts observed both in the past and in the present era of (re)emerging infectious diseases. These questions continue to be at the forefront of not only pathogen research, but also

bioarchaeological and paleopathological scholarship.

ContributorsHarkins, Kelly M (Author) / Buikstra, Jane E. (Thesis advisor) / Stone, Anne C (Thesis advisor) / Knudson, Kelly (Committee member) / Kumar, Sudhir (Committee member) / Krause, Johannes (Committee member) / Arizona State University (Publisher)

Created2014

The role of mutations in protein structural dynamics and function: a multi-scale computational approach

Description

Proteins are a fundamental unit in biology. Although proteins have been extensively studied, there is still much to investigate. The mechanism by which proteins fold into their native state, how evolution shapes structural dynamics, and the dynamic mechanisms of many diseases are not well understood. In this thesis, protein folding…

Proteins are a fundamental unit in biology. Although proteins have been extensively studied, there is still much to investigate. The mechanism by which proteins fold into their native state, how evolution shapes structural dynamics, and the dynamic mechanisms of many diseases are not well understood. In this thesis, protein folding is explored using a multi-scale modeling method including (i) geometric constraint based simulations that efficiently search for native like topologies and (ii) reservoir replica exchange molecular dynamics, which identify the low free energy structures and refines these structures toward the native conformation. A test set of eight proteins and three ancestral steroid receptor proteins are folded to 2.7Å all-atom RMSD from their experimental crystal structures. Protein evolution and disease associated mutations (DAMs) are most commonly studied by in silico multiple sequence alignment methods. Here, however, the structural dynamics are incorporated to give insight into the evolution of three ancestral proteins and the mechanism of several diseases in human ferritin protein. The differences in conformational dynamics of these evolutionary related, functionally diverged ancestral steroid receptor proteins are investigated by obtaining the most collective motion through essential dynamics. Strikingly, this analysis shows that evolutionary diverged proteins of the same family do not share the same dynamic subspace. Rather, those sharing the same function are simultaneously clustered together and distant from those functionally diverged homologs. This dynamics analysis also identifies 77% of mutations (functional and permissive) necessary to evolve new function. In silico methods for prediction of DAMs rely on differences in evolution rate due to purifying selection and therefore the accuracy of DAM prediction decreases at fast and slow evolvable sites. Here, we investigate structural dynamics through computing the contribution of each residue to the biologically relevant fluctuations and from this define a metric: the dynamic stability index (DSI). Using DSI we study the mechanism for three diseases observed in the human ferritin protein. The T30I and R40G DAMs show a loss of dynamic stability at the C-terminus helix and nearby regulatory loop, agreeing with experimental results implicating the same regulatory loop as a cause in cataracts syndrome.

ContributorsGlembo, Tyler J (Author) / Ozkan, Sefika B (Thesis advisor) / Thorpe, Michael F (Committee member) / Ros, Robert (Committee member) / Kumar, Sudhir (Committee member) / Shumway, John (Committee member) / Arizona State University (Publisher)

Created2011

Multi-task learning via structured regularization: formulations, algorithms, and applications

Description

Multi-task learning (MTL) aims to improve the generalization performance (of the resulting classifiers) by learning multiple related tasks simultaneously. Specifically, MTL exploits the intrinsic task relatedness, based on which the informative domain knowledge from each task can be shared across multiple tasks and thus facilitate the individual task learning. It…

Multi-task learning (MTL) aims to improve the generalization performance (of the resulting classifiers) by learning multiple related tasks simultaneously. Specifically, MTL exploits the intrinsic task relatedness, based on which the informative domain knowledge from each task can be shared across multiple tasks and thus facilitate the individual task learning. It is particularly desirable to share the domain knowledge (among the tasks) when there are a number of related tasks but only limited training data is available for each task. Modeling the relationship of multiple tasks is critical to the generalization performance of the MTL algorithms. In this dissertation, I propose a series of MTL approaches which assume that multiple tasks are intrinsically related via a shared low-dimensional feature space. The proposed MTL approaches are developed to deal with different scenarios and settings; they are respectively formulated as mathematical optimization problems of minimizing the empirical loss regularized by different structures. For all proposed MTL formulations, I develop the associated optimization algorithms to find their globally optimal solution efficiently. I also conduct theoretical analysis for certain MTL approaches by deriving the globally optimal solution recovery condition and the performance bound. To demonstrate the practical performance, I apply the proposed MTL approaches on different real-world applications: (1) Automated annotation of the Drosophila gene expression pattern images; (2) Categorization of the Yahoo web pages. Our experimental results demonstrate the efficiency and effectiveness of the proposed algorithms.

ContributorsChen, Jianhui (Author) / Ye, Jieping (Thesis advisor) / Kumar, Sudhir (Committee member) / Liu, Huan (Committee member) / Xue, Guoliang (Committee member) / Arizona State University (Publisher)

Created2011

Protist and cyanobacterial contributions to particle flux in oligotrophic ocean regions

Description

The oceans play an essential role in global biogeochemical cycles and in regulating climate. The biological carbon pump, the photosynthetic fixation of carbon dioxide by phytoplankton and subsequent sequestration of organic carbon into deep water, combined with the physical carbon pump, make the oceans the only long-term net sink for…

The oceans play an essential role in global biogeochemical cycles and in regulating climate. The biological carbon pump, the photosynthetic fixation of carbon dioxide by phytoplankton and subsequent sequestration of organic carbon into deep water, combined with the physical carbon pump, make the oceans the only long-term net sink for anthropogenic carbon dioxide. A full understanding of the workings of the biological carbon pump requires a knowledge of the role of different taxonomic groups of phytoplankton (protists and cyanobacteria) to organic carbon export. However, this has been difficult due to the degraded nature of particles sinking into particle traps, the main tools employed by oceanographers to collect sinking particulate matter in the ocean. In this study DNA-based molecular methods, including denaturing gradient gel electrophoresis, cloning and sequencing, and taxon-specific quantitative PCR, allowed for the first time for the identification of which protists and cyanobacteria contributed to the material collected by the traps in relation to their presence in the euphotic zone. I conducted this study at two time-series stations in the subtropical North Atlantic Ocean, one north of the Canary Islands, and one located south of Bermuda. The Bermuda study allowed me to investigate seasonal and interannual changes in the contribution of the plankton community to particle flux. I could also show that small unarmored taxa, including representatives of prasinophytes and cyanobacteria, constituted a significant fraction of sequences recovered from sediment trap material. Prasinophyte sequences alone could account for up to 13% of the clone library sequences of trap material during bloom periods. These observations contradict a long-standing paradigm in biological oceanography that only large taxa with mineral shells are capable of sinking while smaller, unarmored cells are recycled in the euphotic zone through the microbial loop. Climate change and a subsequent warming of the surface ocean may lead to a shift in the protist community toward smaller cell size in the future, but in light of these findings these changes may not necessarily lead to a reduction in the strength of the biological carbon pump.

ContributorsAmacher, Jessica (Author) / Neuer, Susanne (Thesis advisor) / Garcia-Pichel, Ferran (Committee member) / Lomas, Michael (Committee member) / Wojciechowski, Martin (Committee member) / Stout, Valerie (Committee member) / Arizona State University (Publisher)

Created2011

Identification of neo-antigens for a cancer vaccine by transcriptome analysis

Description

We propose a novel solution to prevent cancer by developing a prophylactic cancer. Several sources of antigens for cancer vaccines have been published. Among these, antigens that contain a frame-shift (FS) peptide or viral peptide are quite attractive for a variety of reasons. FS sequences, from either mistake in RNA…

We propose a novel solution to prevent cancer by developing a prophylactic cancer. Several sources of antigens for cancer vaccines have been published. Among these, antigens that contain a frame-shift (FS) peptide or viral peptide are quite attractive for a variety of reasons. FS sequences, from either mistake in RNA processing or in genomic DNA, may lead to generation of neo-peptides that are foreign to the immune system. Viral peptides presumably would originate from exogenous but integrated viral nucleic acid sequences. Both are non-self, therefore lessen concerns about development of autoimmunity. I have developed a bioinformatical approach to identify these aberrant transcripts in the cancer transcriptome. Their suitability for use in a vaccine is evaluated by establishing their frequencies and predicting possible epitopes along with their population coverage according to the prevalence of major histocompatibility complex (MHC) types. Viral transcripts and transcripts with FS mutations from gene fusion, insertion/deletion at coding microsatellite DNA, and alternative splicing were identified in NCBI Expressed Sequence Tag (EST) database. 48 FS chimeric transcripts were validated in 50 breast cell lines and 68 primary breast tumor samples with their frequencies from 4% to 98% by RT-PCR and sequencing confirmation. These 48 FS peptides, if translated and presented, could be used to protect more than 90% of the population in Northern America based on the prediction of epitopes derived from them. Furthermore, we synthesized 150 peptides that correspond to FS and viral peptides that we predicted would exist in tumor patients and we tested over 200 different cancer patient sera. We found a number of serological reactive peptide sequences in cancer patients that had little to no reactivity in healthy controls; strong support for the strength of our bioinformatic approach. This study describes a process used to identify aberrant transcripts that lead to a new source of antigens that can be tested and used in a prophylactic cancer vaccine. The vast amount of transcriptome data of various cancers from the Cancer Genome Atlas (TCGA) project will enhance our ability to further select better cancer antigen candidates.

ContributorsLee, HoJoon (Author) / Johnston, Stephen A. (Thesis advisor) / Kumar, Sudhir (Committee member) / Miller, Laurence (Committee member) / Stafford, Phillip (Committee member) / Sykes, Kathryn (Committee member) / Arizona State University (Publisher)

Created2012

Developing a virtual flora portal for vascular plants of Saudi Arabia

Description

A floristic analysis is essential to understanding the current diversity and structure

of community associations of plants in a region. Also, a region’s floristic analysis is key not only to investigating their geographical origin(s) but is necessary to their management and protection as a reservoir of greater biodiversity. With an area…

A floristic analysis is essential to understanding the current diversity and structure

of community associations of plants in a region. Also, a region’s floristic analysis is key not only to investigating their geographical origin(s) but is necessary to their management and protection as a reservoir of greater biodiversity. With an area of 2,250,000 square kilometers, the country of Saudi Arabia covers almost four-fifths of the Arabian Peninsula. Efforts to document information on the flora of Saudi Arabia began in the 1700s and have resulted in several comprehensive publications over the last 25 years. There is no doubt that these studies have helped both the community of scientific researchers as well as the public to gain knowledge about the number of species, types of plants, and their distribution in Saudi Arabia. However, there has been no effort to use digital technology to make the data contained in various Saudi herbarium collections easily accessible online for research and teaching purposes. This research project aims to develop a “virtual flora” portal for the vascular plants of Saudi Arabia. Based on SEINet and the Symbiota software used to power it, a preliminary website portal was established to begin an effort to make information of Saudi Arabia’s flora available on the world- wide web. Data comprising a total of 12,834 specimens representing 175 families were acquired from different organizations and used to create a database for the designed website. After analyzing the data, the Fabaceae family (“legumes”) was identified as a largest family and chosen for further analysis. This study contributes to help scientific researchers, government workers and the general public to have easy, unlimited access to the plant information for a variety of purposes.

ContributorsAlbediwi, Albatool (Author) / Wojciechowski, Martin (Thesis advisor) / Franz, Nico (Committee member) / Makings, Elizabeth (Committee member) / Arizona State University (Publisher)

Created2017