Theses and Dissertations
Displaying 1 - 2 of 2
Filtering by
- All Subjects: Biomedical Engineering
Description
While techniques for reading DNA in some capacity has been possible for decades,
the ability to accurately edit genomes at scale has remained elusive. Novel techniques
have been introduced recently to aid in the writing of DNA sequences. While writing
DNA is more accessible, it still remains expensive, justifying the increased interest in
in silico predictions of cell behavior. In order to accurately predict the behavior of
cells it is necessary to extensively model the cell environment, including gene-to-gene
interactions as completely as possible.
Significant algorithmic advances have been made for identifying these interactions,
but despite these improvements current techniques fail to infer some edges, and
fail to capture some complexities in the network. Much of this limitation is due to
heavily underdetermined problems, whereby tens of thousands of variables are to be
inferred using datasets with the power to resolve only a small fraction of the variables.
Additionally, failure to correctly resolve gene isoforms using short reads contributes
significantly to noise in gene quantification measures.
This dissertation introduces novel mathematical models, machine learning techniques,
and biological techniques to solve the problems described above. Mathematical
models are proposed for simulation of gene network motifs, and raw read simulation.
Machine learning techniques are shown for DNA sequence matching, and DNA
sequence correction.
Results provide novel insights into the low level functionality of gene networks. Also
shown is the ability to use normalization techniques to aggregate data for gene network
inference leading to larger data sets while minimizing increases in inter-experimental
noise. Results also demonstrate that high error rates experienced by third generation
sequencing are significantly different than previous error profiles, and that these errors can be modeled, simulated, and rectified. Finally, techniques are provided for amending this DNA error that preserve the benefits of third generation sequencing.
the ability to accurately edit genomes at scale has remained elusive. Novel techniques
have been introduced recently to aid in the writing of DNA sequences. While writing
DNA is more accessible, it still remains expensive, justifying the increased interest in
in silico predictions of cell behavior. In order to accurately predict the behavior of
cells it is necessary to extensively model the cell environment, including gene-to-gene
interactions as completely as possible.
Significant algorithmic advances have been made for identifying these interactions,
but despite these improvements current techniques fail to infer some edges, and
fail to capture some complexities in the network. Much of this limitation is due to
heavily underdetermined problems, whereby tens of thousands of variables are to be
inferred using datasets with the power to resolve only a small fraction of the variables.
Additionally, failure to correctly resolve gene isoforms using short reads contributes
significantly to noise in gene quantification measures.
This dissertation introduces novel mathematical models, machine learning techniques,
and biological techniques to solve the problems described above. Mathematical
models are proposed for simulation of gene network motifs, and raw read simulation.
Machine learning techniques are shown for DNA sequence matching, and DNA
sequence correction.
Results provide novel insights into the low level functionality of gene networks. Also
shown is the ability to use normalization techniques to aggregate data for gene network
inference leading to larger data sets while minimizing increases in inter-experimental
noise. Results also demonstrate that high error rates experienced by third generation
sequencing are significantly different than previous error profiles, and that these errors can be modeled, simulated, and rectified. Finally, techniques are provided for amending this DNA error that preserve the benefits of third generation sequencing.
ContributorsFaucon, Philippe Christophe (Author) / Liu, Huan (Thesis advisor) / Wang, Xiao (Committee member) / Crook, Sharon M (Committee member) / Wang, Yalin (Committee member) / Sarjoughian, Hessam S. (Committee member) / Arizona State University (Publisher)
Created2017
Description
Online social networks are the hubs of social activity in cyberspace, and using them to exchange knowledge, experiences, and opinions is common. In this work, an advanced topic modeling framework is designed to analyse complex longitudinal health information from social media with minimal human annotation, and Adverse Drug Events and Reaction (ADR) information is extracted and automatically processed by using a biased topic modeling method. This framework improves and extends existing topic modelling algorithms that incorporate background knowledge. Using this approach, background knowledge such as ADR terms and other biomedical knowledge can be incorporated during the text mining process, with scores which indicate the presence of ADR being generated. A case control study has been performed on a data set of twitter timelines of women that announced their pregnancy, the goals of the study is to compare the ADR risk of medication usage from each medication category during the pregnancy.
In addition, to evaluate the prediction power of this approach, another important aspect of personalized medicine was addressed: the prediction of medication usage through the identification of risk groups. During the prediction process, the health information from Twitter timeline, such as diseases, symptoms, treatments, effects, and etc., is summarized by the topic modelling processes and the summarization results is used for prediction. Dimension reduction and topic similarity measurement are integrated into this framework for timeline classification and prediction. This work could be applied to provide guidelines for FDA drug risk categories. Currently, this process is done based on laboratory results and reported cases.
Finally, a multi-dimensional text data warehouse (MTD) to manage the output from the topic modelling is proposed. Some attempts have been also made to incorporate topic structure (ontology) and the MTD hierarchy. Results demonstrate that proposed methods show promise and this system represents a low-cost approach for drug safety early warning.
In addition, to evaluate the prediction power of this approach, another important aspect of personalized medicine was addressed: the prediction of medication usage through the identification of risk groups. During the prediction process, the health information from Twitter timeline, such as diseases, symptoms, treatments, effects, and etc., is summarized by the topic modelling processes and the summarization results is used for prediction. Dimension reduction and topic similarity measurement are integrated into this framework for timeline classification and prediction. This work could be applied to provide guidelines for FDA drug risk categories. Currently, this process is done based on laboratory results and reported cases.
Finally, a multi-dimensional text data warehouse (MTD) to manage the output from the topic modelling is proposed. Some attempts have been also made to incorporate topic structure (ontology) and the MTD hierarchy. Results demonstrate that proposed methods show promise and this system represents a low-cost approach for drug safety early warning.
ContributorsYang, Jian (Author) / Gonzalez, Graciela (Thesis advisor) / Davulcu, Hasan (Thesis advisor) / Liu, Huan (Committee member) / Papotti, Paolo (Committee member) / Arizona State University (Publisher)
Created2017