Matching Items (7)

Filtering by

Clear all filters

149928-Thumbnail Image.png

Integrative analyses of diverse biological data sources

Description

The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these

The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these data types are essential to unveil relevant disease mechanisms. Towards these objectives, this research focuses on data integration within two scenarios: (1) transcriptomic, proteomic and functional information and (2) real-time sensor-based measurements motivated by single-cell technology. To assess relationships between protein abundance, transcriptomic and functional data, a nonlinear model was explored at static and temporal levels. The successful integration of these heterogeneous data sources through the stochastic gradient boosted tree approach and its improved predictability are some highlights of this work. Through the development of an innovative validation subroutine based on a permutation approach and the use of external information (i.e., operons), lack of a priori knowledge for undetected proteins was overcome. The integrative methodologies allowed for the identification of undetected proteins for Desulfovibrio vulgaris and Shewanella oneidensis for further biological exploration in laboratories towards finding functional relationships. In an effort to better understand diseases such as cancer at different developmental stages, the Microscale Life Science Center headquartered at the Arizona State University is pursuing single-cell studies by developing novel technologies. This research arranged and applied a statistical framework that tackled the following challenges: random noise, heterogeneous dynamic systems with multiple states, and understanding cell behavior within and across different Barrett's esophageal epithelial cell lines using oxygen consumption curves. These curves were characterized with good empirical fit using nonlinear models with simple structures which allowed extraction of a large number of features. Application of a supervised classification model to these features and the integration of experimental factors allowed for identification of subtle patterns among different cell types visualized through multidimensional scaling. Motivated by the challenges of analyzing real-time measurements, we further explored a unique two-dimensional representation of multiple time series using a wavelet approach which showcased promising results towards less complex approximations. Also, the benefits of external information were explored to improve the image representation.

Contributors

Agent

Created

Date Created
2011

151176-Thumbnail Image.png

Novel statistical models for complex data structures

Description

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns from these massive datasets. The statistical challenges for addressing these massive datasets lay in their complex structures, such as high-dimensionality, hierarchy, multi-modality, heterogeneity and data uncertainty. Besides the statistical challenges, the associated computational approaches are also considered essential in achieving efficiency, effectiveness, as well as the numerical stability in practice. On the other hand, some recent developments in statistics and machine learning, such as sparse learning, transfer learning, and some traditional methodologies which still hold potential, such as multi-level models, all shed lights on addressing these complex datasets in a statistically powerful and computationally efficient way. In this dissertation, we identify four kinds of general complex datasets, including "high-dimensional datasets", "hierarchically-structured datasets", "multimodality datasets" and "data uncertainties", which are ubiquitous in many domains, such as biology, medicine, neuroscience, health care delivery, manufacturing, etc. We depict the development of novel statistical models to analyze complex datasets which fall under these four categories, and we show how these models can be applied to some real-world applications, such as Alzheimer's disease research, nursing care process, and manufacturing.

Contributors

Agent

Created

Date Created
2012

156520-Thumbnail Image.png

A Novel Approach to the Comparative Genomic Analysis of Canine and Human Cancers

Description

Study of canine cancer’s molecular underpinnings holds great potential for informing veterinary and human oncology. Sporadic canine cancers are highly abundant (~4 million diagnoses/year in the United States) and the dog’s unique genomic architecture due to selective inbreeding, alongside the

Study of canine cancer’s molecular underpinnings holds great potential for informing veterinary and human oncology. Sporadic canine cancers are highly abundant (~4 million diagnoses/year in the United States) and the dog’s unique genomic architecture due to selective inbreeding, alongside the high similarity between dog and human genomes both confer power for improving understanding of cancer genes. However, characterization of canine cancer genome landscapes has been limited. It is hindered by lack of canine-specific tools and resources. To enable robust and reproducible comparative genomic analysis of canine cancers, I have developed a workflow for somatic and germline variant calling in canine cancer genomic data. I have first adapted a human cancer genomics pipeline to create a semi-automated canine pipeline used to map genomic landscapes of canine melanoma, lung adenocarcinoma, osteosarcoma and lymphoma. This pipeline also forms the backbone of my novel comparative genomics workflow.

Practical impediments to comparative genomic analysis of dog and human include challenges identifying similarities in mutation type and function across species. For example, canine genes could have evolved different functions and their human orthologs may perform different functions. Hence, I undertook a systematic statistical evaluation of dog and human cancer genes and assessed functional similarities and differences between orthologs to improve understanding of the roles of these genes in cancer across species. I tested this pipeline canine and human Diffuse Large B-Cell Lymphoma (DLBCL), given that canine DLBCL is the most comprehensively genomically characterized canine cancer. Logistic regression with genes bearing somatic coding mutations in each cancer was used to determine if conservation metrics (sequence identity, network placement, etc.) could explain co-mutation of genes in both species. Using this model, I identified 25 co-mutated and evolutionarily similar genes that may be compelling cross-species cancer genes. For example, PCLO was identified as a co-mutated conserved gene with PCLO having been previously identified as recurrently mutated in human DLBCL, but with an unclear role in oncogenesis. Further investigation of these genes might shed new light on the biology of lymphoma in dogs and human and this approach may more broadly serve to prioritize new genes for comparative cancer biology studies.

Contributors

Agent

Created

Date Created
2018

157879-Thumbnail Image.png

Knowledge-driven methods for geographic information extraction in the biomedical domain

Description

Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such

Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain.

This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks.

Contributors

Agent

Created

Date Created
2019

155019-Thumbnail Image.png

Comparative genomics and novel bioinformatics methodology applied to the green anole reveal unique sex chromosome evolution

Description

In species with highly heteromorphic sex chromosomes, the degradation of one of the sex chromosomes can result in unequal gene expression between the sexes (e.g., between XX females and XY males) and between the sex chromosomes and the autosomes. Dosage

In species with highly heteromorphic sex chromosomes, the degradation of one of the sex chromosomes can result in unequal gene expression between the sexes (e.g., between XX females and XY males) and between the sex chromosomes and the autosomes. Dosage compensation is a process whereby genes on the sex chromosomes achieve equal gene expression which prevents deleterious side effects from having too much or too little expression of genes on sex chromsomes. The green anole is part of a group of species that recently underwent an adaptive radiation. The green anole has XX/XY sex determination, but the content of the X chromosome and its evolution have not been described. Given its status as a model species, better understanding the green anole genome could reveal insights into other species. Genomic analyses are crucial for a comprehensive picture of sex chromosome differentiation and dosage compensation, in addition to understanding speciation.

In order to address this, multiple comparative genomics and bioinformatics analyses were conducted to elucidate patterns of evolution in the green anole and across multiple anole species. Comparative genomics analyses were used to infer additional X-linked loci in the green anole, RNAseq data from male and female samples were anayzed to quantify patterns of sex-biased gene expression across the genome, and the extent of dosage compensation on the anole X chromosome was characterized, providing evidence that the sex chromosomes in the green anole are dosage compensated.

In addition, X-linked genes have a lower ratio of nonsynonymous to synonymous substitution rates than the autosomes when compared to other Anolis species, and pairwise rates of evolution in genes across the anole genome were analyzed. To conduct this analysis a new pipeline was created for filtering alignments and performing batch calculations for whole genome coding sequences. This pipeline has been made publicly available.

Contributors

Agent

Created

Date Created
2016

156679-Thumbnail Image.png

Machine Learning Models for High-dimensional Biomedical Data

Description

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery.

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.

Contributors

Agent

Created

Date Created
2018

158771-Thumbnail Image.png

Fine Mapping Functional Noncoding Genetic Elements Via Machine Learning

Description

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants, etc. Deciphering and cataloging these functional genetic elements in the non-coding regions of the genome is one of the biggest challenges in precision medicine and genetic research. This thesis presents two different approaches to identifying these elements: TreeMap and DeepCORE. The first approach involves identifying putative causal genetic variants in cis-eQTL accounting for multisite effects and genetic linkage at a locus. TreeMap performs an organized search for individual and multiple causal variants using a tree guided nested machine learning method. DeepCORE on the other hand explores novel deep learning techniques that models the relationship between genetic, epigenetic and transcriptional patterns across tissues and cell lines and identifies co-operative regulatory elements that affect gene regulation. These two methods are believed to be the link for genotype-phenotype association and a necessary step to explaining various complex diseases and missing heritability.

Contributors

Agent

Created

Date Created
2020