Matching Items (12)
Filtering by

Clear all filters

149794-Thumbnail Image.png
Description
Genes have widely different pertinences to the etiology and pathology of diseases. Thus, they can be ranked according to their disease-significance on a genomic scale, which is the subject of gene prioritization. Given a set of genes known to be related to a disease, it is reasonable to use them

Genes have widely different pertinences to the etiology and pathology of diseases. Thus, they can be ranked according to their disease-significance on a genomic scale, which is the subject of gene prioritization. Given a set of genes known to be related to a disease, it is reasonable to use them as a basis to determine the significance of other candidate genes, which will then be ranked based on the association they exhibit with respect to the given set of known genes. Experimental and computational data of various kinds have different reliability and relevance to a disease under study. This work presents a gene prioritization method based on integrated biological networks that incorporates and models the various levels of relevance and reliability of diverse sources. The method is shown to achieve significantly higher performance as compared to two well-known gene prioritization algorithms. Essentially, no bias in the performance was seen as it was applied to diseases of diverse ethnology, e.g., monogenic, polygenic and cancer. The method was highly stable and robust against significant levels of noise in the data. Biological networks are often sparse, which can impede the operation of associationbased gene prioritization algorithms such as the one presented here from a computational perspective. As a potential approach to overcome this limitation, we explore the value that transcription factor binding sites can have in elucidating suitable targets. Transcription factors are needed for the expression of most genes, especially in higher organisms and hence genes can be associated via their genetic regulatory properties. While each transcription factor recognizes specific DNA sequence patterns, such patterns are mostly unknown for many transcription factors. Even those that are known are inconsistently reported in the literature, implying a potentially high level of inaccuracy. We developed computational methods for prediction and improvement of transcription factor binding patterns. Tests performed on the improvement method by employing synthetic patterns under various conditions showed that the method is very robust and the patterns produced invariably converge to nearly identical series of patterns. Preliminary tests were conducted to incorporate knowledge from transcription factor binding sites into our networkbased model for prioritization, with encouraging results. Genes have widely different pertinences to the etiology and pathology of diseases. Thus, they can be ranked according to their disease-significance on a genomic scale, which is the subject of gene prioritization. Given a set of genes known to be related to a disease, it is reasonable to use them as a basis to determine the significance of other candidate genes, which will then be ranked based on the association they exhibit with respect to the given set of known genes. Experimental and computational data of various kinds have different reliability and relevance to a disease under study. This work presents a gene prioritization method based on integrated biological networks that incorporates and models the various levels of relevance and reliability of diverse sources. The method is shown to achieve significantly higher performance as compared to two well-known gene prioritization algorithms. Essentially, no bias in the performance was seen as it was applied to diseases of diverse ethnology, e.g., monogenic, polygenic and cancer. The method was highly stable and robust against significant levels of noise in the data. Biological networks are often sparse, which can impede the operation of associationbased gene prioritization algorithms such as the one presented here from a computational perspective. As a potential approach to overcome this limitation, we explore the value that transcription factor binding sites can have in elucidating suitable targets. Transcription factors are needed for the expression of most genes, especially in higher organisms and hence genes can be associated via their genetic regulatory properties. While each transcription factor recognizes specific DNA sequence patterns, such patterns are mostly unknown for many transcription factors. Even those that are known are inconsistently reported in the literature, implying a potentially high level of inaccuracy. We developed computational methods for prediction and improvement of transcription factor binding patterns. Tests performed on the improvement method by employing synthetic patterns under various conditions showed that the method is very robust and the patterns produced invariably converge to nearly identical series of patterns. Preliminary tests were conducted to incorporate knowledge from transcription factor binding sites into our networkbased model for prioritization, with encouraging results. To validate these approaches in a disease-specific context, we built a schizophreniaspecific network based on the inferred associations and performed a comprehensive prioritization of human genes with respect to the disease. These results are expected to be validated empirically, but computational validation using known targets are very positive.
ContributorsLee, Jang (Author) / Gonzalez, Graciela (Thesis advisor) / Ye, Jieping (Committee member) / Davulcu, Hasan (Committee member) / Gallitano-Mendel, Amelia (Committee member) / Arizona State University (Publisher)
Created2011
155994-Thumbnail Image.png
Description
Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired miRNA-mRNA interactions are defined as the differential (rewiring) effects of

Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired miRNA-mRNA interactions are defined as the differential (rewiring) effects of miRNAs on the topology of biological pathways between controls and cases. In the dissertation, it is discussed that how rewired biological pathways (Chapter 1) and/or rewired miRNA-mRNA interactions (Chapter 2) aberrantly influence the activity of biological pathways and their association with disease.

This dissertation proposes two PageRank-based analytical methods, Pathways of Topological Rank Analysis (PoTRA) and miR2Pathway, discussed in Chapter 1 and Chapter 2, respectively. PoTRA focuses on detecting pathways with an altered number of hub genes in corresponding pathways between two phenotypes. The basis for PoTRA is that the loss of connectivity is a common topological trait of cancer networks, as well as the prior knowledge that a normal biological network is a scale-free network whose degree distribution follows a power law where a small number of nodes are hubs and a large number of nodes are non-hubs. However, from normal to cancer, the process of the network losing connectivity might be the process of disrupting the scale-free structure of the network, namely, the number of hub genes might be altered in cancer compared to that in normal samples. Hence, it is hypothesized that if the number of hub genes is different in a pathway between normal and cancer, this pathway might be involved in cancer. MiR2Pathway focuses on quantifying the differential effects of miRNAs on the activity of a biological pathway when miRNA-mRNA connections are altered from normal to disease and rank disease risk of rewired miRNA-mediated biological pathways. This dissertation explores how rewired gene-gene interactions and rewired miRNA-mRNA interactions lead to aberrant activity of biological pathways, and rank pathways for their disease risk. The two methods proposed here can be used to complement existing genomics analysis methods to facilitate the study of biological mechanisms behind disease at the systems-level.
ContributorsLi, Chaoxing (Author) / Dinu, Valentin (Thesis advisor) / Kuang, Yang (Thesis advisor) / Liu, Li (Committee member) / Wang, Xiao (Committee member) / Arizona State University (Publisher)
Created2017
154641-Thumbnail Image.png
Description
Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on social media and pharmacovigilance have concentrated on

Adverse Drug Reaction (ADR) identification. Such methods employ a step of drug search followed

Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on social media and pharmacovigilance have concentrated on

Adverse Drug Reaction (ADR) identification. Such methods employ a step of drug search followed by classification of the associated text as consisting an ADR or not. Although this method works efficiently for ADR classifications, if ADR evidence is present in users posts over time, drug mentions fail to capture such ADRs. It also fails to record additional user information which may provide an opportunity to perform an in-depth analysis for lifestyle habits and possible reasons for any medical problems.

Pre-market clinical trials for drugs generally do not include pregnant women, and so their effects on pregnancy outcomes are not discovered early. This thesis presents a thorough, alternative strategy for assessing the safety profiles of drugs during pregnancy by utilizing user timelines from social media. I explore the use of a variety of state-of-the-art social media mining techniques, including rule-based and machine learning techniques, to identify pregnant women, monitor their drug usage patterns, categorize their birth outcomes, and attempt to discover associations between drugs and bad birth outcomes.

The technique used models user timelines as longitudinal patient networks, which provide us with a variety of key information about pregnancy, drug usage, and post-

birth reactions. I evaluate the distinct parts of the pipeline separately, validating the usefulness of each step. The approach to use user timelines in this fashion has produced very encouraging results, and can be employed for a range of other important tasks where users/patients are required to be followed over time to derive population-based measures.
ContributorsChandrashekar, Pramod Bharadwaj (Author) / Davulcu, Hasan (Thesis advisor) / Gonzalez, Graciela (Thesis advisor) / Hsiao, Sharon (Committee member) / Arizona State University (Publisher)
Created2016
155019-Thumbnail Image.png
Description
In species with highly heteromorphic sex chromosomes, the degradation of one of the sex chromosomes can result in unequal gene expression between the sexes (e.g., between XX females and XY males) and between the sex chromosomes and the autosomes. Dosage compensation is a process whereby genes on the sex chromosomes

In species with highly heteromorphic sex chromosomes, the degradation of one of the sex chromosomes can result in unequal gene expression between the sexes (e.g., between XX females and XY males) and between the sex chromosomes and the autosomes. Dosage compensation is a process whereby genes on the sex chromosomes achieve equal gene expression which prevents deleterious side effects from having too much or too little expression of genes on sex chromsomes. The green anole is part of a group of species that recently underwent an adaptive radiation. The green anole has XX/XY sex determination, but the content of the X chromosome and its evolution have not been described. Given its status as a model species, better understanding the green anole genome could reveal insights into other species. Genomic analyses are crucial for a comprehensive picture of sex chromosome differentiation and dosage compensation, in addition to understanding speciation.

In order to address this, multiple comparative genomics and bioinformatics analyses were conducted to elucidate patterns of evolution in the green anole and across multiple anole species. Comparative genomics analyses were used to infer additional X-linked loci in the green anole, RNAseq data from male and female samples were anayzed to quantify patterns of sex-biased gene expression across the genome, and the extent of dosage compensation on the anole X chromosome was characterized, providing evidence that the sex chromosomes in the green anole are dosage compensated.

In addition, X-linked genes have a lower ratio of nonsynonymous to synonymous substitution rates than the autosomes when compared to other Anolis species, and pairwise rates of evolution in genes across the anole genome were analyzed. To conduct this analysis a new pipeline was created for filtering alignments and performing batch calculations for whole genome coding sequences. This pipeline has been made publicly available.
ContributorsRupp, Shawn Michael (Author) / Wilson Sayres, Melissa A (Thesis advisor) / Kusumi, Kenro (Committee member) / DeNardo, Dale (Committee member) / Arizona State University (Publisher)
Created2016
154999-Thumbnail Image.png
Description
Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks such as pharmacovigilance via the use of Natural Language Processing (NLP) techniques. One of the critical steps in information extraction pipelines is Named Entity Recognition

Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks such as pharmacovigilance via the use of Natural Language Processing (NLP) techniques. One of the critical steps in information extraction pipelines is Named Entity Recognition (NER), where the mentions of entities such as diseases are located in text and their entity type are identified. However, the language in social media is highly informal, and user-expressed health-related concepts are often non-technical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and advanced machine learning-based NLP techniques have been underutilized. This work explores the effectiveness of different machine learning techniques, and particularly deep learning, to address the challenges associated with extraction of health-related concepts from social media. Deep learning has recently attracted a lot of attention in machine learning research and has shown remarkable success in several applications particularly imaging and speech recognition. However, thus far, deep learning techniques are relatively unexplored for biomedical text mining and, in particular, this is the first attempt in applying deep learning for health information extraction from social media.

This work presents ADRMine that uses a Conditional Random Field (CRF) sequence tagger for extraction of complex health-related concepts. It utilizes a large volume of unlabeled user posts for automatic learning of embedding cluster features, a novel application of deep learning in modeling the similarity between the tokens. ADRMine significantly improved the medical NER performance compared to the baseline systems.

This work also presents DeepHealthMiner, a deep learning pipeline for health-related concept extraction. Most of the machine learning methods require sophisticated task-specific manual feature design which is a challenging step in processing the informal and noisy content of social media. DeepHealthMiner automatically learns classification features using neural networks and utilizing a large volume of unlabeled user posts. Using a relatively small labeled training set, DeepHealthMiner could accurately identify most of the concepts, including the consumer expressions that were not observed in the training data or in the standard medical lexicons outperforming the state-of-the-art baseline techniques.
ContributorsNikfarjam, Azadeh (Author) / Gonzalez, Graciela (Thesis advisor) / Greenes, Robert (Committee member) / Scotch, Matthew (Committee member) / Arizona State University (Publisher)
Created2016
171582-Thumbnail Image.png
Description
High throughput transcriptome data analysis like Single-cell Ribonucleic Acid sequencing (scRNA-seq) and Circular Ribonucleic Acid (circRNA) data have made significant breakthroughs, especially in cancer genomics. Analysis of transcriptome time series data is core in identifying time point(s) where drastic changes in gene transcription are associated with homeostatic to non-homeostatic cellular

High throughput transcriptome data analysis like Single-cell Ribonucleic Acid sequencing (scRNA-seq) and Circular Ribonucleic Acid (circRNA) data have made significant breakthroughs, especially in cancer genomics. Analysis of transcriptome time series data is core in identifying time point(s) where drastic changes in gene transcription are associated with homeostatic to non-homeostatic cellular transition (tipping points). In Chapter 2 of this dissertation, I present a novel cell-type specific and co-expression-based tipping point detection method to identify target gene (TG) versus transcription factor (TF) pairs whose differential co-expression across time points drive biological changes in different cell types and the time point when these changes are observed. This method was applied to scRNA-seq data sets from a SARS-CoV-2 study (18 time points), a human cerebellum development study (9 time points), and a lung injury study (18 time points). Similarly, leveraging transcriptome data across treatment time points, I developed methodologies to identify treatment-induced and cell-type specific differentially co-expressed pairs (DCEPs). In part one of Chapter 3, I presented a pipeline that used a series of statistical tests to detect DCEPs. This method was applied to scRNA-seq data of patients with non-small cell lung cancer (NSCLC) sequenced across cancer treatment times. However, this pipeline does not account for correlations among multiple single cells from the same sample and correlations among multiple samples from the same patient. In Part 2 of Chapter 3, I presented a solution to this problem using a mixed-effect model. In Chapter 4, I present a summary of my work that focused on the cross-species analysis of circRNA transcriptome time series data. I compared circRNA profiles in neonatal pig and mouse hearts, identified orthologous circRNAs, and discussed regulation mechanisms of cardiomyocyte proliferation and myocardial regeneration conserved between mouse and pig at different time points.
ContributorsNyarige, Verah Mocheche (Author) / Liu, Li (Thesis advisor) / Wang, Junwen (Thesis advisor) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)
Created2022
157879-Thumbnail Image.png
Description
Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for

Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain.

This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks.
ContributorsTahsin, Tasnia (Author) / Gonzalez, Graciela (Thesis advisor) / Scotch, Matthew (Thesis advisor) / Runger, George C. (Committee member) / Arizona State University (Publisher)
Created2019
157966-Thumbnail Image.png
Description
Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x depth). The

resulting algorithm from the work presented here, incorporates an

Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x depth). The

resulting algorithm from the work presented here, incorporates an adaptive error model

into statistical decomposition of mixed populations, which corrects the mean-variance

dependency of sequencing data at the subclonal level and enables accurate subclonal

discovery in tumors sequenced at standard depths (30-50x). Tested on extensive computer

simulations and real-world data, this new method, named model-based adaptive grouping

of subclones (MAGOS), consistently outperforms existing methods on minimum

sequencing depth, decomposition accuracy and computation efficiency. MAGOS supports

subclone analysis using single nucleotide variants and copy number variants from one or

more samples of an individual tumor. GUST algorithm, on the other hand is a novel method

in detecting the cancer type specific driver genes. Combination of MAGOS and GUST

results can provide insights into cancer progression. Applications of MAGOS and GUST

to whole-exome sequencing data of 33 different cancer types’ samples discovered a

significant association between subclonal diversity and their drivers and patient overall

survival.
ContributorsAhmadinejad, Navid (Author) / Liu, Li (Thesis advisor) / Maley, Carlo (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)
Created2019
158895-Thumbnail Image.png
Description
The severity of the health and economic devastation resulting from outbreaks of viruses such as Zika, Ebola, SARS-CoV-1 and, most recently, SARS-CoV-2 underscores the need for tools which aim to delineate critical disease dynamical features underlying observed patterns of infectious disease spread. The growing emphasis placed on genome sequencing to

The severity of the health and economic devastation resulting from outbreaks of viruses such as Zika, Ebola, SARS-CoV-1 and, most recently, SARS-CoV-2 underscores the need for tools which aim to delineate critical disease dynamical features underlying observed patterns of infectious disease spread. The growing emphasis placed on genome sequencing to support pathogen outbreak response highlights the need to adapt traditional epidemiological metrics to leverage this increasingly rich data stream. Further, the rapidity with which pathogen molecular sequence data is now generated, coupled with advent of sophisticated, Bayesian statistical techniques for pathogen molecular sequence analysis, creates an unprecedented opportunity to disrupt and innovate public health surveillance using 21st century tools. Bayesian phylogeography is a modeling framework which assumes discrete traits -- such as age, location of sampling, or species -- evolve according to a continuous-time Markov chain process along a phylogenetic tree topology which is inferred from molecular sequence data.

While myriad studies exist which reconstruct patterns of discrete trait evolution along an inferred phylogeny, attempts to translate the results of phyloegographic analyses into actionable metrics that can be used by public health agencies to direct the development of interventions aimed at reducing pathogen spread are conspicuously absent from the literature. In this dissertation, I focus on developing an intuitive metric, the phylogenetic risk ratio (PRR), which I use to translate the results of Bayesian phylogeographic modeling studies into a form actionable by public health agencies. I apply the PRR to two case studies: i) age-associated diffusion of influenza A/H3N2 during the 2016-17 US epidemic and ii) host associated diffusion of West Nile virus in the US. I discuss the limitations of this (and Bayesian phylogeographic) approaches when studying non-geographic traits for which limited metadata is available in public molecular sequence databases and statistically principled solutions to the missing metadata problem in the phylogenetic context. Then, I perform a simulation study to evaluate the statistical performance of the missing metadata solution. Finally, I provide a solution for researchers whom are interested in using the PRR and phylogenetic UTMs in their own genomic epidemiological studies yet are deterred by the idiosyncratic, error-prone processes required to implement these methods using popular Bayesian phylogenetic inference software packages. My solution, Build-A-BEAST, is a publicly available, object-oriented system written in python which aims to reduce the complexity and idiosyncrasy of creating XML files necessary to perform the aforementioned analyses. This dissertation extends the conceptual framework of Bayesian phylogeographic methods, develops a method to translates the output of phylogenetic models into an actionable form, evaluates the use of priors for missing metadata, and, finally, provides a solution which eases the implementation of these methods. In doing so, I lay the foundation for future work in disseminating and implementing Bayesian phylogeographic methods for routine public health surveillance.
ContributorsVaiente, Matteo (Author) / Scotch, Matthew (Thesis advisor) / Mubayi, Anuj (Committee member) / Liu, Li (Committee member) / Arizona State University (Publisher)
Created2020
158771-Thumbnail Image.png
Description
All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants,

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants, etc. Deciphering and cataloging these functional genetic elements in the non-coding regions of the genome is one of the biggest challenges in precision medicine and genetic research. This thesis presents two different approaches to identifying these elements: TreeMap and DeepCORE. The first approach involves identifying putative causal genetic variants in cis-eQTL accounting for multisite effects and genetic linkage at a locus. TreeMap performs an organized search for individual and multiple causal variants using a tree guided nested machine learning method. DeepCORE on the other hand explores novel deep learning techniques that models the relationship between genetic, epigenetic and transcriptional patterns across tissues and cell lines and identifies co-operative regulatory elements that affect gene regulation. These two methods are believed to be the link for genotype-phenotype association and a necessary step to explaining various complex diseases and missing heritability.
ContributorsChandrashekar, Pramod Bharadwaj (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)
Created2020