Filtering by
- All Subjects: Bioinformatics
- Creators: Kusumi, Kenro
Adverse Drug Reaction (ADR) identification. Such methods employ a step of drug search followed by classification of the associated text as consisting an ADR or not. Although this method works efficiently for ADR classifications, if ADR evidence is present in users posts over time, drug mentions fail to capture such ADRs. It also fails to record additional user information which may provide an opportunity to perform an in-depth analysis for lifestyle habits and possible reasons for any medical problems.
Pre-market clinical trials for drugs generally do not include pregnant women, and so their effects on pregnancy outcomes are not discovered early. This thesis presents a thorough, alternative strategy for assessing the safety profiles of drugs during pregnancy by utilizing user timelines from social media. I explore the use of a variety of state-of-the-art social media mining techniques, including rule-based and machine learning techniques, to identify pregnant women, monitor their drug usage patterns, categorize their birth outcomes, and attempt to discover associations between drugs and bad birth outcomes.
The technique used models user timelines as longitudinal patient networks, which provide us with a variety of key information about pregnancy, drug usage, and post-
birth reactions. I evaluate the distinct parts of the pipeline separately, validating the usefulness of each step. The approach to use user timelines in this fashion has produced very encouraging results, and can be employed for a range of other important tasks where users/patients are required to be followed over time to derive population-based measures.
In order to address this, multiple comparative genomics and bioinformatics analyses were conducted to elucidate patterns of evolution in the green anole and across multiple anole species. Comparative genomics analyses were used to infer additional X-linked loci in the green anole, RNAseq data from male and female samples were anayzed to quantify patterns of sex-biased gene expression across the genome, and the extent of dosage compensation on the anole X chromosome was characterized, providing evidence that the sex chromosomes in the green anole are dosage compensated.
In addition, X-linked genes have a lower ratio of nonsynonymous to synonymous substitution rates than the autosomes when compared to other Anolis species, and pairwise rates of evolution in genes across the anole genome were analyzed. To conduct this analysis a new pipeline was created for filtering alignments and performing batch calculations for whole genome coding sequences. This pipeline has been made publicly available.
This work presents ADRMine that uses a Conditional Random Field (CRF) sequence tagger for extraction of complex health-related concepts. It utilizes a large volume of unlabeled user posts for automatic learning of embedding cluster features, a novel application of deep learning in modeling the similarity between the tokens. ADRMine significantly improved the medical NER performance compared to the baseline systems.
This work also presents DeepHealthMiner, a deep learning pipeline for health-related concept extraction. Most of the machine learning methods require sophisticated task-specific manual feature design which is a challenging step in processing the informal and noisy content of social media. DeepHealthMiner automatically learns classification features using neural networks and utilizing a large volume of unlabeled user posts. Using a relatively small labeled training set, DeepHealthMiner could accurately identify most of the concepts, including the consumer expressions that were not observed in the training data or in the standard medical lexicons outperforming the state-of-the-art baseline techniques.
Agassiz’s desert tortoise (Gopherus agassizii) is a long-lived species native to the Mojave Desert and is listed as threatened under the US Endangered Species Act. To aid conservation efforts for preserving the genetic diversity of this species, we generated a whole genome reference sequence with an annotation based on deep transcriptome sequences of adult skeletal muscle, lung, brain, and blood. The draft genome assembly for G. agassizii has a scaffold N50 length of 252 kbp and a total length of 2.4 Gbp. Genome annotation reveals 20,172 protein-coding genes in the G. agassizii assembly, and that gene structure is more similar to chicken than other turtles. We provide a series of comparative analyses demonstrating (1) that turtles are among the slowest-evolving genome-enabled reptiles, (2) amino acid changes in genes controlling desert tortoise traits such as shell development, longevity and osmoregulation, and (3) fixed variants across the Gopherus species complex in genes related to desert adaptations, including circadian rhythm and innate immune response. This G. agassizii genome reference and annotation is the first such resource for any tortoise, and will serve as a foundation for future analysis of the genetic basis of adaptations to the desert environment, allow for investigation into genomic factors affecting tortoise health, disease and longevity, and serve as a valuable resource for additional studies in this species complex.
Data Availability: All genomic and transcriptomic sequence files are available from the NIH-NCBI BioProject database (accession numbers PRJNA352725, PRJNA352726, and PRJNA281763). All genome assembly, transcriptome assembly, predicted protein, transcript, genome annotation, repeatmasker, phylogenetic trees, .vcf and GO enrichment files are available on Harvard Dataverse (doi:10.7910/DVN/EH2S9K).
This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks.