Search Content

Integrative analyses of diverse biological data sources

Description

The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these data types are essential to unveil relevant disease mechanisms. Towards…

The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these data types are essential to unveil relevant disease mechanisms. Towards these objectives, this research focuses on data integration within two scenarios: (1) transcriptomic, proteomic and functional information and (2) real-time sensor-based measurements motivated by single-cell technology. To assess relationships between protein abundance, transcriptomic and functional data, a nonlinear model was explored at static and temporal levels. The successful integration of these heterogeneous data sources through the stochastic gradient boosted tree approach and its improved predictability are some highlights of this work. Through the development of an innovative validation subroutine based on a permutation approach and the use of external information (i.e., operons), lack of a priori knowledge for undetected proteins was overcome. The integrative methodologies allowed for the identification of undetected proteins for Desulfovibrio vulgaris and Shewanella oneidensis for further biological exploration in laboratories towards finding functional relationships. In an effort to better understand diseases such as cancer at different developmental stages, the Microscale Life Science Center headquartered at the Arizona State University is pursuing single-cell studies by developing novel technologies. This research arranged and applied a statistical framework that tackled the following challenges: random noise, heterogeneous dynamic systems with multiple states, and understanding cell behavior within and across different Barrett's esophageal epithelial cell lines using oxygen consumption curves. These curves were characterized with good empirical fit using nonlinear models with simple structures which allowed extraction of a large number of features. Application of a supervised classification model to these features and the integration of experimental factors allowed for identification of subtle patterns among different cell types visualized through multidimensional scaling. Motivated by the challenges of analyzing real-time measurements, we further explored a unique two-dimensional representation of multiple time series using a wavelet approach which showcased promising results towards less complex approximations. Also, the benefits of external information were explored to improve the image representation.

ContributorsTorres Garcia, Wandaliz (Author) / Meldrum, Deirdre R. (Thesis advisor) / Runger, George C. (Thesis advisor) / Gel, Esma S. (Committee member) / Li, Jing (Committee member) / Zhang, Weiwen (Committee member) / Arizona State University (Publisher)

Created2011

Novel statistical models for complex data structures

Description

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns from these massive datasets. The statistical challenges for addressing these…

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns from these massive datasets. The statistical challenges for addressing these massive datasets lay in their complex structures, such as high-dimensionality, hierarchy, multi-modality, heterogeneity and data uncertainty. Besides the statistical challenges, the associated computational approaches are also considered essential in achieving efficiency, effectiveness, as well as the numerical stability in practice. On the other hand, some recent developments in statistics and machine learning, such as sparse learning, transfer learning, and some traditional methodologies which still hold potential, such as multi-level models, all shed lights on addressing these complex datasets in a statistically powerful and computationally efficient way. In this dissertation, we identify four kinds of general complex datasets, including "high-dimensional datasets", "hierarchically-structured datasets", "multimodality datasets" and "data uncertainties", which are ubiquitous in many domains, such as biology, medicine, neuroscience, health care delivery, manufacturing, etc. We depict the development of novel statistical models to analyze complex datasets which fall under these four categories, and we show how these models can be applied to some real-world applications, such as Alzheimer's disease research, nursing care process, and manufacturing.

ContributorsHuang, Shuai (Author) / Li, Jing (Thesis advisor) / Askin, Ronald (Committee member) / Ye, Jieping (Committee member) / Runger, George C. (Committee member) / Arizona State University (Publisher)

Created2012

Use of large, immunosignature databases to pose new questions about infection and health status

Description

Immunosignature is a technology that retrieves information from the immune system. The technology is based on microarrays with peptides chosen from random sequence space. My thesis focuses on improving the Immunosignature platform and using Immunosignatures to improve diagnosis for diseases. I first contributed to the optimization of the immunosignature platform…

Immunosignature is a technology that retrieves information from the immune system. The technology is based on microarrays with peptides chosen from random sequence space. My thesis focuses on improving the Immunosignature platform and using Immunosignatures to improve diagnosis for diseases. I first contributed to the optimization of the immunosignature platform by introducing scoring metrics to select optimal parameters, considering performance as well as practicality. Next, I primarily worked on identifying a signature shared across various pathogens that can distinguish them from the healthy population. I further retrieved consensus epitopes from the disease common signature and proposed that most pathogens could share the signature by studying the enrichment of the common signature in the pathogen proteomes. Following this, I worked on studying cancer samples from different stages and correlated the immune response with whether the epitope presented by tumor is similar to the pathogen proteome. An effective immune response is defined as an antibody titer increasing followed by decrease, suggesting elimination of the epitope. I found that an effective immune response usually correlates with epitopes that are more similar to pathogens. This suggests that the immune system might occupy a limited space and can be effective against only certain epitopes that have similarity with pathogens. I then participated in the attempt to solve the antibiotic resistance problem by developing a classification algorithm that can distinguish bacterial versus viral infection. This algorithm outperforms other currently available classification methods. Finally, I worked on the concept of deriving a single number to represent all the data on the immunosignature platform. This is in resemblance to the concept of temperature, which is an approximate measurement of whether an individual is healthy. The measure of Immune Entropy was found to work best as a single measurement to describe the immune system information derived from the immunosignature. Entropy is relatively invariant in healthy population, but shows significant differences when comparing healthy donors with patients either infected with a pathogen or have cancer.

ContributorsWang, Lu (Author) / Johnston, Stephen (Thesis advisor) / Stafford, Phillip (Committee member) / Buetow, Kenneth (Committee member) / McFadden, Grant (Committee member) / Arizona State University (Publisher)

Created2018

Systematic Analysis of the Factors Contributing to the Variation and Change of the Microbiome

Description

Understanding changes and trends in biomedical knowledge is crucial for individuals, groups, and institutions as biomedicine improves people’s lives, supports national economies, and facilitates innovation. However, as knowledge changes what evidence illustrates knowledge changes? In the case of microbiome, a multi-dimensional concept from biomedicine, there are significant increases in publications,…

Understanding changes and trends in biomedical knowledge is crucial for individuals, groups, and institutions as biomedicine improves people’s lives, supports national economies, and facilitates innovation. However, as knowledge changes what evidence illustrates knowledge changes? In the case of microbiome, a multi-dimensional concept from biomedicine, there are significant increases in publications, citations, funding, collaborations, and other explanatory variables or contextual factors. What is observed in the microbiome, or any historical evolution of a scientific field or scientific knowledge, is that these changes are related to changes in knowledge, but what is not understood is how to measure and track changes in knowledge. This investigation highlights how contextual factors from the language and social context of the microbiome are related to changes in the usage, meaning, and scientific knowledge on the microbiome. Two interconnected studies integrating qualitative and quantitative evidence examine the variation and change of the microbiome evidence are presented. First, the concepts microbiome, metagenome, and metabolome are compared to determine the boundaries of the microbiome concept in relation to other concepts where the conceptual boundaries have been cited as overlapping. A collection of publications for each concept or corpus is presented, with a focus on how to create, collect, curate, and analyze large data collections. This study concludes with suggestions on how to analyze biomedical concepts using a hybrid approach that combines results from the larger language context and individual words. Second, the results of a systematic review that describes the variation and change of microbiome research, funding, and knowledge are examined. A corpus of approximately 28,000 articles on the microbiome are characterized, and a spectrum of microbiome interpretations are suggested based on differences related to context. The collective results suggest the microbiome is a separate concept from the metagenome and metabolome, and the variation and change to the microbiome concept was influenced by contextual factors. These results provide insight into how concepts with extensive resources behave within biomedicine and suggest the microbiome is possibly representative of conceptual change or a preview of new dynamics within science that are expected in the future.

ContributorsAiello, Kenneth (Author) / Laubichler, Manfred D (Thesis advisor) / Simeone, Michael (Committee member) / Buetow, Kenneth (Committee member) / Walker, Sara I (Committee member) / Arizona State University (Publisher)

Created2018

A search for parent-of-origin effects in the parasitoid jewel wasp Nasonia vitripennis

Description

In most diploid cells, autosomal genes are equally expressed from the paternal and maternal alleles resulting in biallelic expression. However, as an exception, there exists a small number of genes that show a pattern of monoallelic or biased-allele expression based on the allele’s parent-of-origin. This phenomenon is termed genomic imprinting…

In most diploid cells, autosomal genes are equally expressed from the paternal and maternal alleles resulting in biallelic expression. However, as an exception, there exists a small number of genes that show a pattern of monoallelic or biased-allele expression based on the allele’s parent-of-origin. This phenomenon is termed genomic imprinting and is an evolutionary paradox. The best explanation for imprinting is David Haig's kinship theory, which hypothesizes that monoallelic gene expression is largely the result of evolutionary conflict between males and females over maternal involvement in their offspring. One previous RNAseq study has investigated the presence of parent-of-origin effects, or imprinting, in the parasitic jewel wasp Nasonia vitripennis (N. vitripennis) and its sister species Nasonia giraulti (N. giraulti) to test the predictions of kinship theory in a non-eusocial species for comparison to a eusocial one. In order to continue to tease apart the connection between social and eusocial Hymenoptera, this study proposed a similar RNAseq study that attempted to reproduce these results in unique samples of reciprocal F1 Nasonia hybrids. Building a pseudo N. giraulti reference genome, differences were observed when aligning RNAseq reads to a N. vitripennis reference genome compared to aligning reads to a pseudo N. giraulti reference. As well, no evidence for parent-of-origin or imprinting patterns in adult Nasonia were found. These results demonstrated a species-of-origin effect. Importantly, the study continued to build a repository of support with the aim to elucidate the mechanisms behind imprinting in an excellent epigenetic model species, as it can also help with understanding the phenomenon of imprinting in complex human diseases.

ContributorsUnderwood, Avery Elizabeth (Author) / Wilson, Melissa (Thesis advisor) / Buetow, Kenneth (Committee member) / Gile, Gillian (Committee member) / Arizona State University (Publisher)

Created2019

Patterns of Sex-Biased Gene Expression in the Human Brain

Description

Schizophrenia is a disease that affects 15.2/100,000 US citizens, with about 0.6-1.9% of the total population being afflicted with some range of severity of the disease. A lot of research has been done on the progression of the disease and its differences between males and females; however, the true underlying…

Schizophrenia is a disease that affects 15.2/100,000 US citizens, with about 0.6-1.9% of the total population being afflicted with some range of severity of the disease. A lot of research has been done on the progression of the disease and its differences between males and females; however, the true underlying cause of the disease remains unknown. In the literature, however, there is a lot of indication that a genetic cause for schizophrenia is the primary origin for the disorder. In order to establish a foundation in differential gene expression and isoform expression between males and females, we utilized the Genotype-Tissue Expression Project data set (which contains samples from healthy individuals at their time of death) for the amygdala, anterior cingulate cortex, and frontal cortex. We performed quality control on the data with Trimmomatic and visualized it with FastQC and MultiQC. We then aligned to a sex-specific reference genome with Hisat2. Finally, we performed a differential expression analysis dthrough the limma/voom package with inputs from featureCounts. An isoform level analysis was run on the anterior cingulate cortex with the IsoformSwitchAnalyzeR package. We were able to identify a few differentially expressed genes in the three tissue sites, which included XIST and other highly conserved, Y-linked genes. As for the isoform level analysis, we were able to identify 13 genes with significant levels of differential isoform usage and expression, two of which have clinical relevance (DAB1 and PACRG). These findings will allow for a comparison to be made by future studies on gene expression in brain tissue samples from patients that had been diagnosed with schizophrenia in their life. By identifying any unique genes in these patients, gene therapies can be developed to target and correct any misexpression that may be occurring.

ContributorsEvanovich, Austin Phillip (Author) / Wilson, Melissa (Thesis director) / Buetow, Kenneth (Committee member) / Natri, Heini Maaret (Committee member) / School of Life Sciences (Contributor, Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Machine Learning Methods for Diagnosis, Prognosis and Prediction of Long-term Treatment Outcome of Major Depression

Description

Major Depression, clinically called Major Depressive Disorder, is a mood disorder that affects about one eighth of population in US and is projected to be the second leading cause of disability in the world by the year 2020. Recent advances in biotechnology have enabled us to…

Major Depression, clinically called Major Depressive Disorder, is a mood disorder that affects about one eighth of population in US and is projected to be the second leading cause of disability in the world by the year 2020. Recent advances in biotechnology have enabled us to collect a great variety of data which could potentially offer us a deeper understanding of the disorder as well as advancing personalized medicine.

This dissertation focuses on developing methods for three different aspects of predictive analytics related to the disorder: automatic diagnosis, prognosis, and prediction of long-term treatment outcome. The data used for each task have their specific characteristics and demonstrate unique problems. Automatic diagnosis of melancholic depression is made on the basis of metabolic profiles and micro-array gene expression profiles where the presence of missing values and strong empirical correlation between the variables is not unusual. To deal with these problems, a method of generating a representative set of features is proposed. Prognosis is made on data collected from rating scales and questionnaires which consist mainly of categorical and ordinal variables and thus favor decision tree based predictive models. Decision tree models are known for the notorious problem of overfitting. A decision tree pruning method that overcomes the shortcomings of a greedy nature and reliance on heuristics inherent in traditional decision tree pruning approaches is proposed. The method is further extended to prune Gradient Boosting Decision Tree and tested on the task of prognosis of treatment outcome. Follow-up studies evaluating the long-term effect of the treatments on patients usually measure patients' depressive symptom severity monthly, resulting in the actual time of relapse upper bounded by the observed time of relapse. To resolve such uncertainty in response, a general loss function where the hypothesis could take different forms is proposed to predict the risk of relapse in situations where only an interval for time of relapse can be derived from the observed data.

ContributorsNie, Zhi (Author) / Ye, Jieping (Thesis advisor) / He, Jingrui (Thesis advisor) / Li, Baoxin (Committee member) / Xue, Guoliang (Committee member) / Li, Jing (Committee member) / Arizona State University (Publisher)

Created2017

Towards an Asynchronous Course-based Undergraduate Research Experience (CURE) Framework: A Pilot Case Study in Remote Genomics Research

Description

Course-based undergraduate research experiences (CUREs) are strategically designed to advance novel research and integrate future professionals into the scientific community by making relevant discoveries through iteration, communication, and collaboration. With Universities also expanding online undergraduate degree programs that incorporate students who are otherwise unable to attend college, there is a…

Course-based undergraduate research experiences (CUREs) are strategically designed to advance novel research and integrate future professionals into the scientific community by making relevant discoveries through iteration, communication, and collaboration. With Universities also expanding online undergraduate degree programs that incorporate students who are otherwise unable to attend college, there is a demand for online asynchronous courses to train online students in authentic research, thereby leading to a more skilled, diverse, and inclusive workforce. In this case-study, a pilot CURE leveraging the data-intensive field of genomics was presented as an inclusive opportunity for asynchronous, online students to increase their research experience without having to commit to in person or extra-curricular assignments. This online CURE was designed to investigate the effects of trimming software on high-throughput sequencing data when analyzing sex differential gene expression. Project-based objectives were developed to asynchronously teach (1) the biology behind the research, (2) the coding needed to conduct the research, and (3) professional development tools to communicate research findings. Course effectiveness was evaluated qualitatively and quantitatively using weekly, open-response progress reports and an assessment administered before and after term completion. This pilot study exhibited that students can be successful in remote research experiences that incorporate channels for communication, bespoke and accessible learning materials, and open-response reports to monitor challenges and coping strategies. In this iteration, remote students demonstrated improved learning outcomes and self-reported improved confidence as researchers. In addition, students gained more realistic expectations to self-assess computational research skill-levels and self-identified adaptive coping strategies that are transferrable to future research projects. Overall, this framework for an online asynchronous CURE effectively taught students computational skills to conduct genomics research in addition to professional skills to transition to and thrive in the workforce.

ContributorsAlarid, Danielle Olga (Author) / Wilson, Melissa A (Thesis advisor) / Buetow, Kenneth (Committee member) / Cooper, Katelyn (Committee member) / Arizona State University (Publisher)

Created2023

Novel Semi-Supervised Learning Models to Balance Data Inclusivity and Usability in Healthcare Applications

Description

Semi-supervised learning (SSL) is sub-field of statistical machine learning that is useful for problems that involve having only a few labeled instances with predictor (X) and target (Y) information, and abundance of unlabeled instances that only have predictor (X) information. SSL harnesses the target information available in the limited…

Semi-supervised learning (SSL) is sub-field of statistical machine learning that is useful for problems that involve having only a few labeled instances with predictor (X) and target (Y) information, and abundance of unlabeled instances that only have predictor (X) information. SSL harnesses the target information available in the limited labeled data, as well as the information in the abundant unlabeled data to build strong predictive models. However, not all the included information is useful. For example, some features may correspond to noise and including them will hurt the predictive model performance. Additionally, some instances may not be as relevant to model building and their inclusion will increase training time and potentially hurt the model performance. The objective of this research is to develop novel SSL models to balance data inclusivity and usability. My dissertation research focuses on applications of SSL in healthcare, driven by problems in brain cancer radiomics, migraine imaging, and Parkinson’s Disease telemonitoring.

The first topic introduces an integration of machine learning (ML) and a mechanistic model (PI) to develop an SSL model applied to predicting cell density of glioblastoma brain cancer using multi-parametric medical images. The proposed ML-PI hybrid model integrates imaging information from unbiopsied regions of the brain as well as underlying biological knowledge from the mechanistic model to predict spatial tumor density in the brain.

The second topic develops a multi-modality imaging-based diagnostic decision support system (MMI-DDS). MMI-DDS consists of modality-wise principal components analysis to incorporate imaging features at different aggregation levels (e.g., voxel-wise, connectivity-based, etc.), a constrained particle swarm optimization (cPSO) feature selection algorithm, and a clinical utility engine that utilizes inverse operators on chosen principal components for white-box classification models.

The final topic develops a new SSL regression model with integrated feature and instance selection called s2SSL (with “s2” referring to selection in two different ways: feature and instance). s2SSL integrates cPSO feature selection and graph-based instance selection to simultaneously choose the optimal features and instances and build accurate models for continuous prediction. s2SSL was applied to smartphone-based telemonitoring of Parkinson’s Disease patients.

ContributorsGaw, Nathan (Author) / Li, Jing (Thesis advisor) / Wu, Teresa (Committee member) / Yan, Hao (Committee member) / Hu, Leland (Committee member) / Arizona State University (Publisher)

Created2019

Pathway Analysis Reveals Sex Differences in Human Hepatocellular Carcinoma

Description

Hepatocellular carcinoma (HCC) is the third leading cause of cancer death worldwide and exhibits a male-bias in occurrence and mortality. Previous studies have provided insight into the role of inherited genetic regulation of transcription in modulating sex-differences in HCC etiology and mortality. This study uses pathway analysis to add insight…

Hepatocellular carcinoma (HCC) is the third leading cause of cancer death worldwide and exhibits a male-bias in occurrence and mortality. Previous studies have provided insight into the role of inherited genetic regulation of transcription in modulating sex-differences in HCC etiology and mortality. This study uses pathway analysis to add insight into the biological processes that drive sex-differences in HCC etiology as well as a provide additional framework for future studies on sex-biased cancers. Gene expression data from normal, tumor adjacent, and HCC liver tissue were used to calculate pathway scores using a tool called PathOlogist that not only takes into consideration the molecules in a biological pathway, but also the interaction type and directionality of the signaling pathways. Analysis of the pathway scores uncovered etiologically relevant pathways differentiating male and female HCC. In normal and tumor adjacent liver tissue, males showed higher activity of pathways related to translation factors and signaling. Females did not show higher activity of any pathways compared to males in normal and tumor adjacent liver tissue. Work suggest biologic processes that underlie sex-biases in HCC occurrence and mortality. Both males and females differed in the activation of pathways related apoptosis, cell cycle, signaling, and metabolism in HCC. These results identify clinically relevant pathways for future research and therapeutic targeting.

ContributorsRehling, Thomas E (Author) / Buetow, Kenneth (Thesis advisor) / Wilson, Melissa (Committee member) / Maley, Carlo (Committee member) / Arizona State University (Publisher)

Created2021

Filtering by