Matching Items (17)
Filtering by

Clear all filters

152309-Thumbnail Image.png
Description
Vertebrate genomes demonstrate a remarkable range of sizes from 0.3 to 133 gigabase pairs. The proliferation of repeat elements are a major genomic expansion. In particular, long interspersed nuclear elements (LINES) are autonomous retrotransposons that have the ability to "cut and paste" themselves into a host genome through a mechanism

Vertebrate genomes demonstrate a remarkable range of sizes from 0.3 to 133 gigabase pairs. The proliferation of repeat elements are a major genomic expansion. In particular, long interspersed nuclear elements (LINES) are autonomous retrotransposons that have the ability to "cut and paste" themselves into a host genome through a mechanism called target-primed reverse transcription. LINES have been called "junk DNA," "viral DNA," and "selfish" DNA, and were once thought to be parasitic elements. However, LINES, which diversified before the emergence of many early vertebrates, has strongly shaped the evolution of eukaryotic genomes. This thesis will evaluate LINE abundance, diversity and activity in four anole lizards. An intrageneric analysis will be conducted using comparative phylogenetics and bioinformatics. Comparisons within the Anolis genus, which derives from a single lineage of an adaptive radiation, will be conducted to explore the relationship between LINE retrotransposon activity and causal changes in genomic size and composition.
ContributorsMay, Catherine (Author) / Kusumi, Kenro (Thesis advisor) / Gadau, Juergen (Committee member) / Rawls, Jeffery A (Committee member) / Arizona State University (Publisher)
Created2013
153508-Thumbnail Image.png
Description
Telomerase enzyme is a truly remarkable enzyme specialized for the addition of short, highly repetitive DNA sequences onto linear eukaryotic chromosome ends. The telomerase enzyme functions as a ribonucleoprotein, minimally composed of the highly conserved catalytic telomerase reverse transcriptase and essential telomerase RNA component containing an internalized short template

Telomerase enzyme is a truly remarkable enzyme specialized for the addition of short, highly repetitive DNA sequences onto linear eukaryotic chromosome ends. The telomerase enzyme functions as a ribonucleoprotein, minimally composed of the highly conserved catalytic telomerase reverse transcriptase and essential telomerase RNA component containing an internalized short template region within the vastly larger non-coding RNA. Even among closely related groups of species, telomerase RNA is astonishingly divergent in sequence, length, and secondary structure. This massive disparity is highly prohibitive for telomerase RNA identification from previously unexplored groups of species, which is fundamental for secondary structure determination. Combined biochemical enrichment and computational screening methods were employed for the discovery of numerous telomerase RNAs from the poorly characterized echinoderm lineage. This resulted in the revelation that--while closely related to the vertebrate lineage and grossly resembling vertebrate telomerase RNA--the echinoderm telomerase RNA central domain varies extensively in structure and sequence, diverging even within echinoderms amongst sea urchins and brittle stars. Furthermore, the origins of telomerase RNA within the eukaryotic lineage have remained a persistent mystery. The ancient Trypanosoma telomerase RNA was previously identified, however, a functionally verified secondary structure remained elusive. Synthetic Trypanosoma telomerase was generated for molecular dissection of Trypanosoma telomerase RNA revealing two RNA domains functionally equivalent to those found in known telomerase RNAs, yet structurally distinct. This work demonstrates that telomerase RNA is uncommonly divergent in gross architecture, while retaining critical universal elements.
ContributorsPodlevsky, Joshua (Author) / Chen, Julian (Thesis advisor) / Mangone, Marco (Committee member) / Kusumi, Kenro (Committee member) / Wilson-Rawls, Norma (Committee member) / Arizona State University (Publisher)
Created2015
149928-Thumbnail Image.png
Description
The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these data types are essential to unveil relevant disease mechanisms. Towards

The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these data types are essential to unveil relevant disease mechanisms. Towards these objectives, this research focuses on data integration within two scenarios: (1) transcriptomic, proteomic and functional information and (2) real-time sensor-based measurements motivated by single-cell technology. To assess relationships between protein abundance, transcriptomic and functional data, a nonlinear model was explored at static and temporal levels. The successful integration of these heterogeneous data sources through the stochastic gradient boosted tree approach and its improved predictability are some highlights of this work. Through the development of an innovative validation subroutine based on a permutation approach and the use of external information (i.e., operons), lack of a priori knowledge for undetected proteins was overcome. The integrative methodologies allowed for the identification of undetected proteins for Desulfovibrio vulgaris and Shewanella oneidensis for further biological exploration in laboratories towards finding functional relationships. In an effort to better understand diseases such as cancer at different developmental stages, the Microscale Life Science Center headquartered at the Arizona State University is pursuing single-cell studies by developing novel technologies. This research arranged and applied a statistical framework that tackled the following challenges: random noise, heterogeneous dynamic systems with multiple states, and understanding cell behavior within and across different Barrett's esophageal epithelial cell lines using oxygen consumption curves. These curves were characterized with good empirical fit using nonlinear models with simple structures which allowed extraction of a large number of features. Application of a supervised classification model to these features and the integration of experimental factors allowed for identification of subtle patterns among different cell types visualized through multidimensional scaling. Motivated by the challenges of analyzing real-time measurements, we further explored a unique two-dimensional representation of multiple time series using a wavelet approach which showcased promising results towards less complex approximations. Also, the benefits of external information were explored to improve the image representation.
ContributorsTorres Garcia, Wandaliz (Author) / Meldrum, Deirdre R. (Thesis advisor) / Runger, George C. (Thesis advisor) / Gel, Esma S. (Committee member) / Li, Jing (Committee member) / Zhang, Weiwen (Committee member) / Arizona State University (Publisher)
Created2011
151176-Thumbnail Image.png
Description
Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns from these massive datasets. The statistical challenges for addressing these

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns from these massive datasets. The statistical challenges for addressing these massive datasets lay in their complex structures, such as high-dimensionality, hierarchy, multi-modality, heterogeneity and data uncertainty. Besides the statistical challenges, the associated computational approaches are also considered essential in achieving efficiency, effectiveness, as well as the numerical stability in practice. On the other hand, some recent developments in statistics and machine learning, such as sparse learning, transfer learning, and some traditional methodologies which still hold potential, such as multi-level models, all shed lights on addressing these complex datasets in a statistically powerful and computationally efficient way. In this dissertation, we identify four kinds of general complex datasets, including "high-dimensional datasets", "hierarchically-structured datasets", "multimodality datasets" and "data uncertainties", which are ubiquitous in many domains, such as biology, medicine, neuroscience, health care delivery, manufacturing, etc. We depict the development of novel statistical models to analyze complex datasets which fall under these four categories, and we show how these models can be applied to some real-world applications, such as Alzheimer's disease research, nursing care process, and manufacturing.
ContributorsHuang, Shuai (Author) / Li, Jing (Thesis advisor) / Askin, Ronald (Committee member) / Ye, Jieping (Committee member) / Runger, George C. (Committee member) / Arizona State University (Publisher)
Created2012
156679-Thumbnail Image.png
Description
The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to hel

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.
ContributorsLin, Sangdi (Author) / Runger, George C. (Thesis advisor) / Kocher, Jean-Pierre A (Committee member) / Pan, Rong (Committee member) / Escobedo, Adolfo R. (Committee member) / Arizona State University (Publisher)
Created2018
135454-Thumbnail Image.png
Description
Mammary gland development in humans during puberty involves the enlargement of breast tissue, but this is not true in non-human primates. To identify potential causes of this difference, I examined variation in substitution rates across genes related to mammary development. Genes undergoing purifying selection show slower-than-average substitution rates, while genes

Mammary gland development in humans during puberty involves the enlargement of breast tissue, but this is not true in non-human primates. To identify potential causes of this difference, I examined variation in substitution rates across genes related to mammary development. Genes undergoing purifying selection show slower-than-average substitution rates, while genes undergoing positive selection show faster rates. These may be related to the difference between humans and other primates. Three genes were found to be accelerated were FOXF1, IGFBP5, and ATP2B2, but only the latter one was found in humans and it seems unlikely that it would be related to the differences between mammary gland development at puberty between humans and non-human primates.
ContributorsArroyo, Diana (Author) / Cartwright, Reed (Thesis director) / Wilson Sayres, Melissa (Committee member) / Schwartz, Rachel (Committee member) / School of Life Sciences (Contributor) / Barrett, The Honors College (Contributor)
Created2016-05
134524-Thumbnail Image.png
Description
With the rising data output and falling costs of Next Generation Sequencing technologies, research into data compression is crucial to maintaining storage efficiency and costs. High throughput sequencers such as the HiSeqX Ten can produce up to 1.8 terabases of data per run, and such large storage demands are even

With the rising data output and falling costs of Next Generation Sequencing technologies, research into data compression is crucial to maintaining storage efficiency and costs. High throughput sequencers such as the HiSeqX Ten can produce up to 1.8 terabases of data per run, and such large storage demands are even more important to consider for institutions that rely on their own servers rather than large data centers (cloud storage)1. Compression algorithms aim to reduce the amount of space taken up by large genomic datasets by encoding the most frequently occurring symbols with the shortest bit codewords and by changing the order of the data to make it easier to encode. Depending on the probability distribution of the symbols in the dataset or the structure of the data, choosing the wrong algorithm could result in a compressed file larger than the original or a poorly compressed file that results in a waste of time and space2. To test efficiency among compression algorithms for each file type, 37 open-source compression algorithms were used to compress six types of genomic datasets (FASTA, VCF, BCF, GFF, GTF, and SAM) and evaluated on compression speed, decompression speed, compression ratio, and file size using the benchmark test lzbench. Compressors that outpreformed the popular bioinformatics compressor Gzip (zlib -6) were evaluated against one another by ratio and speed for each file type and across the geometric means of all file types. Compressors that exhibited fast compression and decompression speeds were also evaluated by transmission time through variable speed internet pipes in scenarios where the file was compressed only once or compressed multiple times.
ContributorsHowell, Abigail (Author) / Cartwright, Reed (Thesis director) / Wilson Sayres, Melissa (Committee member) / Taylor, Jay (Committee member) / Barrett, The Honors College (Contributor)
Created2017-05
153689-Thumbnail Image.png
Description
Damage to the central nervous system due to spinal cord or traumatic brain injury, as well as degenerative musculoskeletal disorders such as arthritis, drastically impact the quality of life. Regeneration of complex structures is quite limited in mammals, though other vertebrates possess this ability. Lizards are the most closely related

Damage to the central nervous system due to spinal cord or traumatic brain injury, as well as degenerative musculoskeletal disorders such as arthritis, drastically impact the quality of life. Regeneration of complex structures is quite limited in mammals, though other vertebrates possess this ability. Lizards are the most closely related organism to humans that can regenerate de novo skeletal muscle, hyaline cartilage, spinal cord, vasculature, and skin. Progress in studying the cellular and molecular mechanisms of lizard regeneration has previously been limited by a lack of genomic resources. Building on the release of the genome of the green anole, Anolis carolinensis, we developed a second generation, robust RNA-Seq-based genome annotation, and performed the first transcriptomic analysis of tail regeneration in this species. In order to investigate gene expression in regenerating tissue, we performed whole transcriptome and microRNA transcriptome analysis of regenerating tail tip and base and associated tissues, identifying key genetic targets in the regenerative process. These studies have identified components of a genetic program for regeneration in the lizard that includes both developmental and adult repair mechanisms shared with mammals, indicating value in the translation of these findings to future regenerative therapies.
ContributorsHutchins, Elizabeth (Author) / Kusumi, Kenro (Thesis advisor) / Rawls, Jeffrey A. (Committee member) / Denardo, Dale F. (Committee member) / Huentelman, Matthew J. (Committee member) / Arizona State University (Publisher)
Created2015
155019-Thumbnail Image.png
Description
In species with highly heteromorphic sex chromosomes, the degradation of one of the sex chromosomes can result in unequal gene expression between the sexes (e.g., between XX females and XY males) and between the sex chromosomes and the autosomes. Dosage compensation is a process whereby genes on the sex chromosomes

In species with highly heteromorphic sex chromosomes, the degradation of one of the sex chromosomes can result in unequal gene expression between the sexes (e.g., between XX females and XY males) and between the sex chromosomes and the autosomes. Dosage compensation is a process whereby genes on the sex chromosomes achieve equal gene expression which prevents deleterious side effects from having too much or too little expression of genes on sex chromsomes. The green anole is part of a group of species that recently underwent an adaptive radiation. The green anole has XX/XY sex determination, but the content of the X chromosome and its evolution have not been described. Given its status as a model species, better understanding the green anole genome could reveal insights into other species. Genomic analyses are crucial for a comprehensive picture of sex chromosome differentiation and dosage compensation, in addition to understanding speciation.

In order to address this, multiple comparative genomics and bioinformatics analyses were conducted to elucidate patterns of evolution in the green anole and across multiple anole species. Comparative genomics analyses were used to infer additional X-linked loci in the green anole, RNAseq data from male and female samples were anayzed to quantify patterns of sex-biased gene expression across the genome, and the extent of dosage compensation on the anole X chromosome was characterized, providing evidence that the sex chromosomes in the green anole are dosage compensated.

In addition, X-linked genes have a lower ratio of nonsynonymous to synonymous substitution rates than the autosomes when compared to other Anolis species, and pairwise rates of evolution in genes across the anole genome were analyzed. To conduct this analysis a new pipeline was created for filtering alignments and performing batch calculations for whole genome coding sequences. This pipeline has been made publicly available.
ContributorsRupp, Shawn Michael (Author) / Wilson Sayres, Melissa A (Thesis advisor) / Kusumi, Kenro (Committee member) / DeNardo, Dale (Committee member) / Arizona State University (Publisher)
Created2016
Description

Agassiz’s desert tortoise (Gopherus agassizii) is a long-lived species native to the Mojave Desert and is listed as threatened under the US Endangered Species Act. To aid conservation efforts for preserving the genetic diversity of this species, we generated a whole genome reference sequence with an annotation based on dee

Agassiz’s desert tortoise (Gopherus agassizii) is a long-lived species native to the Mojave Desert and is listed as threatened under the US Endangered Species Act. To aid conservation efforts for preserving the genetic diversity of this species, we generated a whole genome reference sequence with an annotation based on deep transcriptome sequences of adult skeletal muscle, lung, brain, and blood. The draft genome assembly for G. agassizii has a scaffold N50 length of 252 kbp and a total length of 2.4 Gbp. Genome annotation reveals 20,172 protein-coding genes in the G. agassizii assembly, and that gene structure is more similar to chicken than other turtles. We provide a series of comparative analyses demonstrating (1) that turtles are among the slowest-evolving genome-enabled reptiles, (2) amino acid changes in genes controlling desert tortoise traits such as shell development, longevity and osmoregulation, and (3) fixed variants across the Gopherus species complex in genes related to desert adaptations, including circadian rhythm and innate immune response. This G. agassizii genome reference and annotation is the first such resource for any tortoise, and will serve as a foundation for future analysis of the genetic basis of adaptations to the desert environment, allow for investigation into genomic factors affecting tortoise health, disease and longevity, and serve as a valuable resource for additional studies in this species complex.

Data Availability: All genomic and transcriptomic sequence files are available from the NIH-NCBI BioProject database (accession numbers PRJNA352725, PRJNA352726, and PRJNA281763). All genome assembly, transcriptome assembly, predicted protein, transcript, genome annotation, repeatmasker, phylogenetic trees, .vcf and GO enrichment files are available on Harvard Dataverse (doi:10.7910/DVN/EH2S9K).

ContributorsTollis, Marc (Author) / DeNardo, Dale F (Author) / Cornelius, John A (Author) / Dolby, Greer A (Author) / Edwards, Taylor (Author) / Henen, Brian T. (Author) / Karl, Alice E. (Author) / Murphy, Robert W. (Author) / Kusumi, Kenro (Author)
Created2017-05-31