Search Content

A computational framework to model and learn context-specific gene regulatory networks from multi-source data

Description

Reverse engineering gene regulatory networks (GRNs) is an important problem in the domain of Systems Biology. Learning GRNs is challenging due to the inherent complexity of the real regulatory networks and the heterogeneity of samples in available biomedical data. Real world biological data are commonly collected from broad surveys (profiling…

Reverse engineering gene regulatory networks (GRNs) is an important problem in the domain of Systems Biology. Learning GRNs is challenging due to the inherent complexity of the real regulatory networks and the heterogeneity of samples in available biomedical data. Real world biological data are commonly collected from broad surveys (profiling studies) and aggregate highly heterogeneous biological samples. Popular methods to learn GRNs simplistically assume a single universal regulatory network corresponding to available data. They neglect regulatory network adaptation due to change in underlying conditions and cellular phenotype or both. This dissertation presents a novel computational framework to learn common regulatory interactions and networks underlying the different sets of relatively homogeneous samples from real world biological data. The characteristic set of samples/conditions and corresponding regulatory interactions defines the cellular context (context). Context, in this dissertation, represents the deterministic transcriptional activity within the specific cellular regulatory mechanism. The major contributions of this framework include - modeling and learning context specific GRNs; associating enriched samples with contexts to interpret contextual interactions using biological knowledge; pruning extraneous edges from the context-specific GRN to improve the precision of the final GRNs; integrating multisource data to learn inter and intra domain interactions and increase confidence in obtained GRNs; and finally, learning combinatorial conditioning factors from the data to identify regulatory cofactors. The framework, Expattern, was applied to both real world and synthetic data. Interesting insights were obtained into mechanism of action of drugs on analysis of NCI60 drug activity and gene expression data. Application to refractory cancer data and Glioblastoma multiforme yield GRNs that were readily annotated with context-specific phenotypic information. Refractory cancer GRNs also displayed associations between distinct cancers, not observed through only clustering. Performance comparisons on multi-context synthetic data show the framework Expattern performs better than other comparable methods.

ContributorsSen, Ina (Author) / Kim, Seungchan (Thesis advisor) / Baral, Chitta (Committee member) / Bittner, Michael (Committee member) / Konjevod, Goran (Committee member) / Arizona State University (Publisher)

Created2011

Gene Families in Cancer: Using phylogenetic data to examine an atavistic model of cancer

Description

Despite the 40-year war on cancer, very limited progress has been made in developing a cure for the disease. This failure has prompted the reevaluation of the causes and development of cancer. One resulting model, coined the atavistic model of cancer, posits that cancer is a default phenotype of the…

Despite the 40-year war on cancer, very limited progress has been made in developing a cure for the disease. This failure has prompted the reevaluation of the causes and development of cancer. One resulting model, coined the atavistic model of cancer, posits that cancer is a default phenotype of the cells of multicellular organisms which arises when the cell is subjected to an unusual amount of stress. Since this default phenotype is similar across cell types and even organisms, it seems it must be an evolutionarily ancestral phenotype. We take a phylostratigraphical approach, but systematically add species divergence time data to estimate gene ages numerically and use these ages to investigate the ages of genes involved in cancer. We find that ancient disease-recessive cancer genes are significantly enriched for DNA repair and SOS activity, which seems to imply that a core component of cancer development is not the regulation of growth, but the regulation of mutation. Verification of this finding could drastically improve cancer treatment and prevention.

ContributorsOrr, Adam James (Author) / Davies, Paul (Thesis director) / Bussey, Kimberly (Committee member) / Barrett, The Honors College (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Department of Chemistry and Biochemistry (Contributor) / School of Life Sciences (Contributor)

Created2015-05

Standard mapping protocols misestimate sex-biased gene expression

Description

There are several challenges to accurately inferring levels of transcription using RNA-sequencing (RNA-seq) data, including detecting and correcting for reference genome mapping bias. One potential confounder of RNA-seq analysis results from the application of a standardized pipeline to samples of different sexes in species with chromosomal sex determination. The homology…

There are several challenges to accurately inferring levels of transcription using RNA-sequencing (RNA-seq) data, including detecting and correcting for reference genome mapping bias. One potential confounder of RNA-seq analysis results from the application of a standardized pipeline to samples of different sexes in species with chromosomal sex determination. The homology between the human X and Y chromosomes will routinely cause mismapping to occur, artificially biasing estimates of sex-biased gene transcription. For this reason we tested sex-specific mapping scenarios in humans on RNA-seq samples from the brains of 5 genetic females and 5 genetic males to assess how inferences of differential gene expression patterns change depending on the reference genome. We first applied a mapping protocol where we mapped all individuals to the entire human reference genome (complete), including the X and Y chromosomes, and computed differential expression between the set of genetic male and genetic female samples. We next mapped the genetic female samples (46,XX) to the human reference genome with the Y chromosome removed (Y-excluded) and the genetic male samples (46, XY) to the human reference genome (including the Y chromosome), but with the pseudoautosomal regions of the Y chromosome hard-masked (YPARs-masked) for the two sex-specific mappings. Using the complete and sex-specific mapping protocols, we compared the differential expression measurements of genetic males and genetic females from cuffDiff outputs. The second strategy called 33 additional genes as being differentially expressed between the two sexes when compared to the complete mapping protocol. This research provides a framework for a new standard of reference genome mappings to correct for sex-biased gene expression estimates that can be used in future studies.

ContributorsBrotman, Sarah Marie (Author) / Wilson Sayres, Melissa (Thesis director) / Crook, Sharon (Committee member) / Webster, Timothy (Committee member) / School of Life Sciences (Contributor) / School of Mathematical and Natural Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Genetic diversity across the pseudoautosomal boundary varies across human populations

Description

Unlike the autosomes, recombination on the sex chromosomes is limited to the pseudoautosomal regions (PARs) at each end of the chromosome. PAR1 spans approximately 2.7 Mb from the tip of the proximal arm of each sex chromosome, and a pseudoautosomal boundary between the PAR1 and non-PAR region is thought to…

Unlike the autosomes, recombination on the sex chromosomes is limited to the pseudoautosomal regions (PARs) at each end of the chromosome. PAR1 spans approximately 2.7 Mb from the tip of the proximal arm of each sex chromosome, and a pseudoautosomal boundary between the PAR1 and non-PAR region is thought to have evolved from a Y-specific inversion that suppressed recombination across the boundary. In addition to the two PARs, there is also a human-specific X-transposed region (XTR) that was duplicated from the X to the Y chromosome. Genetic diversity is expected to be higher in recombining than nonrecombining regions, particularly because recombination reduces the effects of linked selection, allowing neutral variation to accumulate. We previously showed that diversity decreases linearly across the previously defined pseudoautosomal boundary (rather than drop suddenly at the boundary), suggesting that the pseudoautosomal boundary may not be as strict as previously thought. In this study, we analyzed data from 1271 genetic females to explore the extent to which the pseudoautosomal boundary varies among human populations (broadly, African, European, South Asian, East Asian, and the Americas). We found that, in all populations, genetic diversity was significantly higher in the PAR1 and XTR than in the non-PAR regions, and that diversity decreased linearly from the PAR1 to finally reach a non-PAR value well past the pseudoautosomal boundary in all populations. However, we also found that the location at which diversity changes from reflecting the higher PAR1 diversity to the lower nonPAR diversity varied by as much as 500 kb among populations. The lack of genetic evidence for a strict pseudoautosomal boundary and the variability in patterns of diversity across the pseudoautosomal boundary are consistent with two potential explanations: (1) the boundary itself may vary across populations, or (2) that population-specific demographic histories have shaped diversity across the pseudoautosomal boundary.

ContributorsCotter, Daniel Juetten (Author) / Wilson Sayres, Melissa (Thesis director) / Stone, Anne (Committee member) / Webster, Timothy (Committee member) / School of Life Sciences (Contributor) / School of International Letters and Cultures (Contributor) / Barrett, The Honors College (Contributor)

Created2016-12

Generalized linear models in Bayesian phylogeography

Description

Bayesian phylogeography is a framework that has enabled researchers to model the spatiotemporal diffusion of pathogens. In general, the framework assumes that discrete geographic sampling traits follow a continuous-time Markov chain process along the branches of an unknown phylogeny that is informed through nucleotide sequence data. Recently, this framework has…

Bayesian phylogeography is a framework that has enabled researchers to model the spatiotemporal diffusion of pathogens. In general, the framework assumes that discrete geographic sampling traits follow a continuous-time Markov chain process along the branches of an unknown phylogeny that is informed through nucleotide sequence data. Recently, this framework has been extended to model the transition rate matrix between discrete states as a generalized linear model (GLM) of predictors of interest to the pathogen. In this dissertation, I focus on these GLMs and describe their capabilities, limitations, and introduce a pipeline that may enable more researchers to utilize this framework.

I first demonstrate how a GLM can be employed and how the support for the predictors can be measured using influenza A/H5N1 in Egypt as an example. Secondly, I compare the GLM framework to two alternative frameworks of Bayesian phylogeography: one that uses an advanced computational technique and one that does not. For this assessment, I model the diffusion of influenza A/H3N2 in the United States during the 2014-15 flu season with five methods encapsulated by the three frameworks. I summarize metrics of the phylogenies created by each and demonstrate their reproducibility by performing analyses on several random sequence samples under a variety of population growth scenarios. Next, I demonstrate how discretization of the location trait for a given sequence set can influence phylogenies and support for predictors. That is, I perform several GLM analyses on a set of sequences and change how the sequences are pooled, then show how aggregating predictors at four levels of spatial resolution will alter posterior support. Finally, I provide a solution for researchers that wish to use the GLM framework but may be deterred by the tedious file-manipulation requirements that must be completed to do so. My pipeline, which is publicly available, should alleviate concerns pertaining to the difficulty and time-consuming nature of creating the files necessary to perform GLM analyses. This dissertation expands the knowledge of Bayesian phylogeographic GLMs and will facilitate the use of this framework, which may ultimately reveal the variables that drive the spread of pathogens.

ContributorsMagee, Daniel (Author) / Scotch, Matthew L (Thesis advisor) / Gonzalez, Graciela H (Committee member) / Taylor, Jesse E (Committee member) / Arizona State University (Publisher)

Created2017

Big data analysis of bacterial inhibitors in parallelized cellomics: a machine learning approach

Description

Identifying chemical compounds that inhibit bacterial infection has recently gained a considerable amount of attention given the increased number of highly resistant bacteria and the serious health threat it poses around the world. With the development of automated microscopy and image analysis systems, the process of identifying novel therapeutic drugs…

Identifying chemical compounds that inhibit bacterial infection has recently gained a considerable amount of attention given the increased number of highly resistant bacteria and the serious health threat it poses around the world. With the development of automated microscopy and image analysis systems, the process of identifying novel therapeutic drugs can generate an immense amount of data - easily reaching terabytes worth of information. Despite increasing the vast amount of data that is currently generated, traditional analytical methods have not increased the overall success rate of identifying active chemical compounds that eventually become novel therapeutic drugs. Moreover, multispectral imaging has become ubiquitous in drug discovery due to its ability to provide valuable information on cellular and sub-cellular processes using florescent reagents. These reagents are often costly and toxic to cells over an extended period of time causing limitations in experimental design. Thus, there is a significant need to develop a more efficient process of identifying active chemical compounds.

This dissertation introduces novel machine learning methods based on parallelized cellomics to analyze interactions between cells, bacteria, and chemical compounds while reducing the use of fluorescent reagents. Machine learning analysis using image-based high-content screening (HCS) data is compartmentalized into three primary components: (1) \textit{Image Analytics}, (2) \textit{Phenotypic Analytics}, and (3) \textit{Compound Analytics}. A novel software analytics tool called the Insights project is also introduced. The Insights project fully incorporates distributed processing, high performance computing, and database management that can rapidly and effectively utilize and store massive amounts of data generated using HCS biological assessments (bioassays). It is ideally suited for parallelized cellomics in high dimensional space.

Results demonstrate that a parallelized cellomics approach increases the quality of a bioassay while vastly decreasing the need for control data. The reduction in control data leads to less fluorescent reagent consumption. Furthermore, a novel proposed method that uses single-cell data points is proven to identify known active chemical compounds with a high degree of accuracy, despite traditional quality control measurements indicating the bioassay to be of poor quality. This, ultimately, decreases the time and resources needed in optimizing bioassays while still accurately identifying active compounds.

ContributorsTrevino, Robert (Author) / Liu, Huan (Thesis advisor) / Lamkin, Thomas J (Committee member) / He, Jingrui (Committee member) / Lee, Joohyung (Committee member) / Arizona State University (Publisher)

Created2016

Transcriptome gene expression analysis of breast cancer using RNA-Seq

Description

Background: Breast cancer is the most frequently diagnosed cancer and the leading cause of cancer deaths in females worldwide, accounting for 23% of all new cancer cases and 14% of all total cancer deaths in 2008. Five tumor-normal pairs of primary breast epithelial cells were treated for infinite proliferation by…

Background: Breast cancer is the most frequently diagnosed cancer and the leading cause of cancer deaths in females worldwide, accounting for 23% of all new cancer cases and 14% of all total cancer deaths in 2008. Five tumor-normal pairs of primary breast epithelial cells were treated for infinite proliferation by using a ROCK inhibitor and mouse feeder cells. Methods: Raw paired-end, 100x coverage RNA-Seq data was aligned to the Human Reference Genome Version 19 using BWA and Tophat. Gene differential expression analysis was completed using Cufflinks and Cuffdiff. Interactive Genome Viewer was used for data visualization. Results: 15 genes were found to be down-regulated by at least one log-fold change in 4/5 of tumor samples. 75 genes were found to be down-regulated in 3/5 of our tumor samples by at least one log-fold change. 11 genes were found to be up-regulated in 4/5 of our tumor samples, and 68 genes were identified to be up-regulated in 3/5 of the tumor samples by at least one-fold change. Conclusion: Expression changes in genes such as AZGP1, AGER, ALG11, and S1007 suggest a disruption in the glycosylation pathway. No correlation was found between Cufflink's Her2 gene-expression and DAKO score classification.

ContributorsHernandez, Fernando (Author) / Anderson, Karen (Thesis director) / Mangone, Marco (Committee member) / Park, Jin (Committee member) / Barrett, The Honors College (Contributor) / Department of Information Systems (Contributor)

Created2013-05

BioEve: user interface framework bridging IE and IR

Description

Continuous advancements in biomedical research have resulted in the production of vast amounts of scientific data and literature discussing them. The ultimate goal of computational biology is to translate these large amounts of data into actual knowledge of the complex biological processes and accurate life science models. The ability to…

Continuous advancements in biomedical research have resulted in the production of vast amounts of scientific data and literature discussing them. The ultimate goal of computational biology is to translate these large amounts of data into actual knowledge of the complex biological processes and accurate life science models. The ability to rapidly and effectively survey the literature is necessary for the creation of large scale models of the relationships among biomedical entities as well as hypothesis generation to guide biomedical research. To reduce the effort and time spent in performing these activities, an intelligent search system is required. Even though many systems aid in navigating through this wide collection of documents, the vastness and depth of this information overload can be overwhelming. An automated extraction system coupled with a cognitive search and navigation service over these document collections would not only save time and effort, but also facilitate discovery of the unknown information implicitly conveyed in the texts. This thesis presents the different approaches used for large scale biomedical named entity recognition, and the challenges faced in each. It also proposes BioEve: an integrative framework to fuse a faceted search with information extraction to provide a search service that addresses the user's desire for "completeness" of the query results, not just the top-ranked ones. This information extraction system enables discovery of important semantic relationships between entities such as genes, diseases, drugs, and cell lines and events from biomedical text on MEDLINE, which is the largest publicly available database of the world's biomedical journal literature. It is an innovative search and discovery service that makes it easier to search
avigate and discover knowledge hidden in life sciences literature. To demonstrate the utility of this system, this thesis also details a prototype enterprise quality search and discovery service that helps researchers with a guided step-by-step query refinement, by suggesting concepts enriched in intermediate results, and thereby facilitating the "discover more as you search" paradigm.

ContributorsKanwar, Pradeep (Author) / Davulcu, Hasan (Thesis advisor) / Dinu, Valentin (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)

Created2010

Spatial Transcriptomics Reveals Changes in Cell Type Proportions within Diseased Lung Tissue

Description

Idiopathic pulmonary fibrosis (IPF) is an interstitial lung disease (ILD) that results in the permanent scarring and damage of lung tissue. Currently, there is no known cause or viable treatment for this disease, and the majority of patients either receive a lung transplant or succumb to the disease within five…

Idiopathic pulmonary fibrosis (IPF) is an interstitial lung disease (ILD) that results in the permanent scarring and damage of lung tissue. Currently, there is no known cause or viable treatment for this disease, and the majority of patients either receive a lung transplant or succumb to the disease within five years of diagnosis. This project centers around studying IPF through analyzing gene expression patterns in healthy vs. diseased lung tissue via spatial transcriptomics. Spatial transcriptomics is the study of individual RNA transcripts within cells on a spatial level. With the novel technology MERFISH, we can detect gene expression in a spatial context with single-cell resolution, allowing us to make inferences about certain patterns of gene expression that are solely driven by the pathology of the disease. A total of 120 cells were selected from 21 different lung samples - 6 healthy; 15 ILD. Within those lung samples, selected from 4 different tissue features - control, less fibrotic, more fibrotic, and cystic. We built an analysis pipeline in R to analyze cell type composition around these features at different distances from the center cell (0-75, 76-150, and 150-225 μm). Cell types were annotated at both a broad (less specific) and fine (more specific) level. Upon analyzing the relationship between the proportions of various cell types and distance from tissue features, we found that within the broad cell type annotation level, airway epithelium cells had a negative relationship with distance and were statistically significant through linear regression models. Within the fine cell type annotation level, ciliated/secretory cells displayed this same trend. The results above support our current understanding of cystic tissue in lung tissue, and is a foundation for understanding disease pathology as a whole.

ContributorsMallapragada, Saahithi (Author) / Wilson, Melissa (Thesis director) / Banovich, Nick (Thesis director) / Vannan, Annika (Committee member) / Barrett, The Honors College (Contributor) / College of Health Solutions (Contributor) / School of Life Sciences (Contributor)

Created2023-05

Towards an Asynchronous Course-based Undergraduate Research Experience (CURE) Framework: A Pilot Case Study in Remote Genomics Research

Description

Course-based undergraduate research experiences (CUREs) are strategically designed to advance novel research and integrate future professionals into the scientific community by making relevant discoveries through iteration, communication, and collaboration. With Universities also expanding online undergraduate degree programs that incorporate students who are otherwise unable to attend college, there is a…

Course-based undergraduate research experiences (CUREs) are strategically designed to advance novel research and integrate future professionals into the scientific community by making relevant discoveries through iteration, communication, and collaboration. With Universities also expanding online undergraduate degree programs that incorporate students who are otherwise unable to attend college, there is a demand for online asynchronous courses to train online students in authentic research, thereby leading to a more skilled, diverse, and inclusive workforce. In this case-study, a pilot CURE leveraging the data-intensive field of genomics was presented as an inclusive opportunity for asynchronous, online students to increase their research experience without having to commit to in person or extra-curricular assignments. This online CURE was designed to investigate the effects of trimming software on high-throughput sequencing data when analyzing sex differential gene expression. Project-based objectives were developed to asynchronously teach (1) the biology behind the research, (2) the coding needed to conduct the research, and (3) professional development tools to communicate research findings. Course effectiveness was evaluated qualitatively and quantitatively using weekly, open-response progress reports and an assessment administered before and after term completion. This pilot study exhibited that students can be successful in remote research experiences that incorporate channels for communication, bespoke and accessible learning materials, and open-response reports to monitor challenges and coping strategies. In this iteration, remote students demonstrated improved learning outcomes and self-reported improved confidence as researchers. In addition, students gained more realistic expectations to self-assess computational research skill-levels and self-identified adaptive coping strategies that are transferrable to future research projects. Overall, this framework for an online asynchronous CURE effectively taught students computational skills to conduct genomics research in addition to professional skills to transition to and thrive in the workforce.

ContributorsAlarid, Danielle Olga (Author) / Wilson, Melissa A (Thesis advisor) / Buetow, Kenneth (Committee member) / Cooper, Katelyn (Committee member) / Arizona State University (Publisher)

Created2023

Filtering by