Search Content

Informatics approaches for integrative analysis of disparate high-throughput genomic datasets in cancer

Description

The processes of a human somatic cell are very complex with various genetic mechanisms governing its fate. Such cells undergo various genetic mutations, which translate to the genetic aberrations that we see in cancer. There are more than 100 types of cancer, each having many more subtypes with aberrations being…

The processes of a human somatic cell are very complex with various genetic mechanisms governing its fate. Such cells undergo various genetic mutations, which translate to the genetic aberrations that we see in cancer. There are more than 100 types of cancer, each having many more subtypes with aberrations being unique to each. In the past two decades, the widespread application of high-throughput genomic technologies, such as micro-arrays and next-generation sequencing, has led to the revelation of many such aberrations. Known types and subtypes can be readily identified using gene-expression profiling and more importantly, high-throughput genomic datasets have helped identify novel sub-types with distinct signatures. Recent studies showing usage of gene-expression profiling in clinical decision making in breast cancer patients underscore the utility of high-throughput datasets. Beyond prognosis, understanding the underlying cellular processes is essential for effective cancer treatment. Various high-throughput techniques are now available to look at a particular aspect of a genetic mechanism in cancer tissue. To look at these mechanisms individually is akin to looking at a broken watch; taking apart each of its parts, looking at them individually and finally making a list of all the faulty ones. Integrative approaches are needed to transform one-dimensional cancer signatures into multi-dimensional interaction and regulatory networks, consequently bettering our understanding of cellular processes in cancer. Here, I attempt to (i) address ways to effectively identify high quality variants when multiple assays on the same sample samples are available through two novel tools, snpSniffer and NGSPE; (ii) glean new biological insight into multiple myeloma through two novel integrative analysis approaches making use of disparate high-throughput datasets. While these methods focus on multiple myeloma datasets, the informatics approaches are applicable to all cancer datasets and will thus help advance cancer genomics.

ContributorsYellapantula, Venkata (Author) / Dinu, Valentin (Thesis advisor) / Scotch, Matthew (Committee member) / Wallstrom, Garrick (Committee member) / Keats, Jonathan (Committee member) / Arizona State University (Publisher)

Created2014

Structural variant detection: a novel approach

Description

Genomic structural variation (SV) is defined as gross alterations in the genome broadly classified as insertions/duplications, deletions inversions and translocations. DNA sequencing ushered structural variant discovery beyond laboratory detection techniques to high resolution informatics approaches. Bioinformatics tools for computational discovery of SVs however are still missing variants in the complex…

Genomic structural variation (SV) is defined as gross alterations in the genome broadly classified as insertions/duplications, deletions inversions and translocations. DNA sequencing ushered structural variant discovery beyond laboratory detection techniques to high resolution informatics approaches. Bioinformatics tools for computational discovery of SVs however are still missing variants in the complex cancer genome. This study aimed to define genomic context leading to tool failure and design novel algorithm addressing this context. Methods: The study tested the widely held but unproven hypothesis that tools fail to detect variants which lie in repeat regions. Publicly available 1000-Genomes dataset with experimentally validated variants was tested with SVDetect-tool for presence of true positives (TP) SVs versus false negative (FN) SVs, expecting that FNs would be overrepresented in repeat regions. Further, the novel algorithm designed to informatically capture the biological etiology of translocations (non-allelic homologous recombination and 3&ndashD; placement of chromosomes in cells –context) was tested using simulated dataset. Translocations were created in known translocation hotspots and the novel&ndashalgorithm; tool compared with SVDetect and BreakDancer. Results: 53% of false negative (FN) deletions were within repeat structure compared to 81% true positive (TP) deletions. Similarly, 33% FN insertions versus 42% TP, 26% FN duplication versus 57% TP and 54% FN novel sequences versus 62% TP were within repeats. Repeat structure was not driving the tool's inability to detect variants and could not be used as context. The novel algorithm with a redefined context, when tested against SVDetect and BreakDancer was able to detect 10/10 simulated translocations with 30X coverage dataset and 100% allele frequency, while SVDetect captured 4/10 and BreakDancer detected 6/10. For 15X coverage dataset with 100% allele frequency, novel algorithm was able to detect all ten translocations albeit with fewer reads supporting the same. BreakDancer detected 4/10 and SVDetect detected 2/10 Conclusion: This study showed that presence of repetitive elements in general within a structural variant did not influence the tool's ability to capture it. This context-based algorithm proved better than current tools even with half the genome coverage than accepted protocol and provides an important first step for novel translocation discovery in cancer genome.

ContributorsShetty, Sheetal (Author) / Dinu, Valentin (Thesis advisor) / Bussey, Kimberly (Committee member) / Scotch, Matthew (Committee member) / Wallstrom, Garrick (Committee member) / Arizona State University (Publisher)

Created2014

Novel methods of biomarker discovery and predictive modeling using Random Forest

Description

Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF…

Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval.

Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships.

Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets.

Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets.

ContributorsGuan, Xin (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Thesis advisor) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)

Created2017

From conflict to common ground: establishing Religious Cultural Competence in Evolution Education (ReCCEE)

Description

Evolution is the foundation of biology, yet it remains controversial even among college biology students. Acceptance of evolution is important for students if we want them to incorporate evolution into their scientific thinking. However, students’ religious beliefs are a consistent barrier to their acceptance of evolution due to a perceived…

Evolution is the foundation of biology, yet it remains controversial even among college biology students. Acceptance of evolution is important for students if we want them to incorporate evolution into their scientific thinking. However, students’ religious beliefs are a consistent barrier to their acceptance of evolution due to a perceived conflict between religion and evolution. Using pre-post instructional surveys of students in introductory college biology, Study 1 establishes instructional strategies that can be effective for reducing students' perceived conflict between religion and evolution. Through interviews and qualitative analyses, Study 2 documents how instructors teaching evolution at public universities may be resistant towards implementing strategies that can reduce students' perceived conflict, perhaps because of their own lack of religious beliefs and lack of training and awareness about students' conflict with evolution. Interviews with religious students in Study 3 reveals that religious college biology students can perceive their instructors as unfriendly towards religion which can negatively impact these students' perceived conflict between religion and evolution. Study 4 explores how instructors at Christian universities, who share the same Christian backgrounds as their students, do not struggle with implementing strategies that reduce students' perceived conflict between religion and evolution. Cumulatively, these studies reveal a need for a new instructional framework for evolution education that takes into account the religious cultural difference between instructors who are teaching evolution and students who are learning evolution. As such, a new instructional framework is then described, Religious Cultural Competence in Evolution Education (ReCCEE), that can help instructors teach evolution in a way that can reduce students' perceived conflict between religion and evolution, increase student acceptance of evolution, and create more inclusive college biology classrooms for religious students.

ContributorsBarnes, Maryann Elizabeth (Author) / Brownell, Sara (Thesis advisor) / Nesse, Randolph (Committee member) / Collins, James (Committee member) / Husman, Jenefer (Committee member) / Maienschein, Jane (Committee member) / Arizona State University (Publisher)

Created2018

Topological analysis of biological pathways : genes, microRNAs and pathways involved in hepatocellular carcinoma

Description

Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired miRNA-mRNA interactions are defined as the differential (rewiring) effects of…

Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired miRNA-mRNA interactions are defined as the differential (rewiring) effects of miRNAs on the topology of biological pathways between controls and cases. In the dissertation, it is discussed that how rewired biological pathways (Chapter 1) and/or rewired miRNA-mRNA interactions (Chapter 2) aberrantly influence the activity of biological pathways and their association with disease.

This dissertation proposes two PageRank-based analytical methods, Pathways of Topological Rank Analysis (PoTRA) and miR2Pathway, discussed in Chapter 1 and Chapter 2, respectively. PoTRA focuses on detecting pathways with an altered number of hub genes in corresponding pathways between two phenotypes. The basis for PoTRA is that the loss of connectivity is a common topological trait of cancer networks, as well as the prior knowledge that a normal biological network is a scale-free network whose degree distribution follows a power law where a small number of nodes are hubs and a large number of nodes are non-hubs. However, from normal to cancer, the process of the network losing connectivity might be the process of disrupting the scale-free structure of the network, namely, the number of hub genes might be altered in cancer compared to that in normal samples. Hence, it is hypothesized that if the number of hub genes is different in a pathway between normal and cancer, this pathway might be involved in cancer. MiR2Pathway focuses on quantifying the differential effects of miRNAs on the activity of a biological pathway when miRNA-mRNA connections are altered from normal to disease and rank disease risk of rewired miRNA-mediated biological pathways. This dissertation explores how rewired gene-gene interactions and rewired miRNA-mRNA interactions lead to aberrant activity of biological pathways, and rank pathways for their disease risk. The two methods proposed here can be used to complement existing genomics analysis methods to facilitate the study of biological mechanisms behind disease at the systems-level.

ContributorsLi, Chaoxing (Author) / Dinu, Valentin (Thesis advisor) / Kuang, Yang (Thesis advisor) / Liu, Li (Committee member) / Wang, Xiao (Committee member) / Arizona State University (Publisher)

Created2017

Discovering subclones and their driver genes in tumors sequenced at standard depths

Description

Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x depth). The

resulting algorithm from the work presented here, incorporates an…

Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x depth). The

resulting algorithm from the work presented here, incorporates an adaptive error model

into statistical decomposition of mixed populations, which corrects the mean-variance

dependency of sequencing data at the subclonal level and enables accurate subclonal

discovery in tumors sequenced at standard depths (30-50x). Tested on extensive computer

simulations and real-world data, this new method, named model-based adaptive grouping

of subclones (MAGOS), consistently outperforms existing methods on minimum

sequencing depth, decomposition accuracy and computation efficiency. MAGOS supports

subclone analysis using single nucleotide variants and copy number variants from one or

more samples of an individual tumor. GUST algorithm, on the other hand is a novel method

in detecting the cancer type specific driver genes. Combination of MAGOS and GUST

results can provide insights into cancer progression. Applications of MAGOS and GUST

to whole-exome sequencing data of 33 different cancer types’ samples discovered a

significant association between subclonal diversity and their drivers and patient overall

survival.

ContributorsAhmadinejad, Navid (Author) / Liu, Li (Thesis advisor) / Maley, Carlo (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)

Created2019

Fine Mapping Functional Noncoding Genetic Elements Via Machine Learning

Description

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants,…

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants, etc. Deciphering and cataloging these functional genetic elements in the non-coding regions of the genome is one of the biggest challenges in precision medicine and genetic research. This thesis presents two different approaches to identifying these elements: TreeMap and DeepCORE. The first approach involves identifying putative causal genetic variants in cis-eQTL accounting for multisite effects and genetic linkage at a locus. TreeMap performs an organized search for individual and multiple causal variants using a tree guided nested machine learning method. DeepCORE on the other hand explores novel deep learning techniques that models the relationship between genetic, epigenetic and transcriptional patterns across tissues and cell lines and identifies co-operative regulatory elements that affect gene regulation. These two methods are believed to be the link for genotype-phenotype association and a necessary step to explaining various complex diseases and missing heritability.

ContributorsChandrashekar, Pramod Bharadwaj (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)

Created2020

Statistical Methods for Analysis of Genomic Data with Applications in Oncology

Description

This dissertation presents three novel algorithms with real-world applications to genomic oncology. While the methodologies presented here were all developed to overcome various challenges associated with the adoption of high throughput genomic data in clinical oncology, they can be used in other domains as well. First, a network informed feature…

This dissertation presents three novel algorithms with real-world applications to genomic oncology. While the methodologies presented here were all developed to overcome various challenges associated with the adoption of high throughput genomic data in clinical oncology, they can be used in other domains as well. First, a network informed feature ranking algorithm is presented, which shows a significant increase in ability to select true predictive features from simulated data sets when compared to other state of the art graphical feature ranking methods. The methodology also shows an increased ability to predict pathological complete response to preoperative chemotherapy from genomic sequencing data of breast cancer patients utilizing domain knowledge from protein-protein interaction networks. Second, an algorithm that overcomes population biases inherent in the use of a human reference genome developed primarily from European populations is presented to classify microsatellite instability (MSI) status from next-generation-sequencing (NGS) data. The methodology significantly increases the accuracy of MSI status prediction in African and African American ancestries. Finally, a single variable model is presented to capture the bimodality inherent in genomic data stemming from heterogeneous diseases. This model shows improvements over other parametric models in the measurements of receiver-operator characteristic (ROC) curves for bimodal data. The model is used to estimate ROC curves for heterogeneous biomarkers in a dataset containing breast cancer and cancer-free specimen.

ContributorsSaul, Michelle (Author) / Dinu, Valentin (Thesis advisor) / Liu, Li (Committee member) / Wang, Junwen (Committee member) / Arizona State University (Publisher)

Created2021

BioCore Guide: A Tool for Interpreting the Core Concepts of Vision and Change for Biology Majors

Description

Vision and Change in Undergraduate Biology Education outlined five core concepts intended to guide undergraduate biology education: 1) evolution; 2) structure and function; 3) information flow, exchange, and storage; 4) pathways and transformations of energy and matter; and 5) systems. We have taken these general recommendations and created a Vision…

Vision and Change in Undergraduate Biology Education outlined five core concepts intended to guide undergraduate biology education: 1) evolution; 2) structure and function; 3) information flow, exchange, and storage; 4) pathways and transformations of energy and matter; and 5) systems. We have taken these general recommendations and created a Vision and Change BioCore Guide—a set of general principles and specific statements that expand upon the core concepts, creating a framework that biology departments can use to align with the goals of Vision and Change. We used a grassroots approach to generate the BioCore Guide, beginning with faculty ideas as the basis for an iterative process that incorporated feedback from more than 240 biologists and biology educators at a diverse range of academic institutions throughout the United States. The final validation step in this process demonstrated strong national consensus, with more than 90% of respondents agreeing with the importance and scientific accuracy of the statements. It is our hope that the BioCore Guide will serve as an agent of change for biology departments as we move toward transforming undergraduate biology education.

ContributorsBrownell, Sara (Author) / Freeman, Scott (Author) / Wenderoth, Mary Pat (Author) / Crowe, Alison J. (Author) / College of Liberal Arts and Sciences (Contributor)

Created2014-06-01

Course-Based Undergraduate Research Experiences Can Make Scientific Research More Inclusive

Description

The U.S. scientific research community does not reflect America's diversity. Hispanics, African Americans, and Native Americans made up 31% of the general population in 2010, but they represented only 18 and 7% of science, technology, engineering, and mathematics (STEM) bachelor's and doctoral degrees, respectively, and 6% of STEM faculty members…

The U.S. scientific research community does not reflect America's diversity. Hispanics, African Americans, and Native Americans made up 31% of the general population in 2010, but they represented only 18 and 7% of science, technology, engineering, and mathematics (STEM) bachelor's and doctoral degrees, respectively, and 6% of STEM faculty members (National Science Foundation [NSF], 2013). Equity in the scientific research community is important for a variety of reasons; a diverse community of researchers can minimize the negative influence of bias in scientific reasoning, because people from different backgrounds approach a problem from different perspectives and can raise awareness regarding biases (Intemann, 2009). Additionally, by failing to be attentive to equity, we may exclude some of the best and brightest scientific minds and limit the pool of possible scientists (Intemann, 2009). Given this need for equity, how can our scientific research community become more inclusive?

ContributorsBangera, Gita (Author) / Brownell, Sara (Author) / College of Liberal Arts and Sciences (Contributor)

Created2014-12-01

Filtering by