Search Content

Bayesian Biogeographical Analyses with Beast: Assessment Using Simulated Data

Description

Biogeography is the study of the spatial distribution of the earth's biota, both in the present and the past. Traditionally, biogeographical studies have relied on a combination of surveys of existing populations, fossil evidence, and the geological record of the earth. However, with the advent of relatively inexpensive methods of…

Biogeography is the study of the spatial distribution of the earth's biota, both in the present and the past. Traditionally, biogeographical studies have relied on a combination of surveys of existing populations, fossil evidence, and the geological record of the earth. However, with the advent of relatively inexpensive methods of DNA sequencing, it is now possible to use information concerning the genetic relatedness of individuals in populations to address questions about how those populations came to be where they are today. For example, biogeographical studies of HIV-I provide strong support for the hypothesis that this virus arose in Africa through a host switch from chimpanzees to humans and only began to spread to human populations located on other continents some 60 to 70 years ago (Sharp & Hahn, 2010).

ContributorsZheng, Wenyu (Author) / Taylor, Jesse (Thesis director) / Escalante, Ananias (Committee member) / Thieme, Horst (Committee member) / Barrett, The Honors College (Contributor)

Created2015-05

Evolution of multigene families and single copy genes in Plasmodium spp

Description

The complex life cycle and widespread range of infection of Plasmodium parasites, the causal agent of malaria in humans, makes them the perfect organism for the study of various evolutionary mechanisms. In particular, multigene families are considered one of the main sources for genome adaptability and innovation. Within Plasmodium, numerous…

The complex life cycle and widespread range of infection of Plasmodium parasites, the causal agent of malaria in humans, makes them the perfect organism for the study of various evolutionary mechanisms. In particular, multigene families are considered one of the main sources for genome adaptability and innovation. Within Plasmodium, numerous species- and clade-specific multigene families have major functions in the development and maintenance of infection. Nonetheless, while the evolutionary mechanisms predominant on many species- and clade-specific multigene families have been previously studied, there are far less studies dedicated to analyzing genus common multigene families (GCMFs). I studied the patterns of natural selection and recombination in 90 GCMFs with diverse numbers of gene gain/loss events. I found that the majority of GCMFs are formed by duplications events that predate speciation of mammal Plasmodium species, with many paralogs being neutrally maintained thereafter. In general, multigene families involved in immune evasion and host cell invasion commonly showed signs of positive selection and species-specific gain/loss events; particularly, on Plasmodium species is the simian and rodent clades. A particular multigene family: the merozoite surface protein-7 (msp7) family, is found in all Plasmodium species and has functions related to the erythrocyte invasion. Within Plasmodium vivax, differences in the number of paralogs in this multigene family has been previously explained, at least in part, as potential adaptations to the human host. To investigate this I studied msp7 orthologs in closely related non-human primate parasites where homology was evident. I also estimated paralogs’ evolutionary history and genetic polymorphism. The emerging patterns where compared with those of Plasmodium falciparum. I found that the evolution of the msp7 multigene family is consistent with a Birth-and-Death model where duplications, pseudogenization and gene lost events are common. In order to study additional aspects in the evolution of Plasmodium, I evaluated the trends of long term and short term evolution and the putative effects of vertebrate- host’s immune pressure of gametocytes across various Plasmodium species. Gametocytes, represent the only sexual stage within the Plasmodium life cycle, and are also the transition stages from the vertebrate to the mosquito vector. I found that, while male and female gametocytes showed different levels of immunogenicity, signs of positive selection were not entirely related to the location and presence of immune epitope regions. Overall, these studies further highlight the complex evolutionary patterns observed in Plasmodium.

ContributorsCastillo Siri, Andreina I (Author) / Rosenberg, Michael (Thesis advisor) / Escalante, Ananias (Committee member) / Taylor, Jesse (Committee member) / Collins, James (Committee member) / Arizona State University (Publisher)

Created2016

Methods for Detecting Mutations in Non-model Organisms

Description

Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and other non-random biases cannot be de-
tected by increased depth. The…

Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and other non-random biases cannot be de-
tected by increased depth. The problem of accurate genotyping is exacerbated when
there is not a reference genome or other auxiliary information available.
I explore several methods for sensitively detecting mutations in non-model or-
ganisms using an example Eucalyptus melliodora individual. I use the structure of
the tree to find bounds on its somatic mutation rate and evaluate several algorithms
for variant calling. I find that conventional methods are suitable if the genome of a
close relative can be adapted to the study organism. However, with structured data,
a likelihood framework that is aware of this structure is more accurate. I use the
techniques developed here to evaluate a reference-free variant calling algorithm.
I also use this data to evaluate a k-mer based base quality score recalibrator
(KBBQ), a tool I developed to recalibrate base quality scores attached to sequencing
data. Base quality scores can help detect errors in sequencing reads, but are often
inaccurate. The most popular method for correcting this issue requires a known
set of variant sites, which is unavailable in most cases. I simulate data and show
that errors in this set of variant sites can cause calibration errors. I then show that
KBBQ accurately recalibrates base quality scores while requiring no reference or other
information and performs as well as other methods.
Finally, I use the Eucalyptus data to investigate the impact of quality score calibra-
tion on the quality of output variant calls and show that improved base quality score
calibration increases the sensitivity and reduces the false positive rate of a variant
calling algorithm.

ContributorsOrr, Adam James (Author) / Cartwright, Reed (Thesis advisor) / Wilson, Melissa (Committee member) / Kusumi, Kenro (Committee member) / Taylor, Jesse (Committee member) / Pfeifer, Susanne (Committee member) / Arizona State University (Publisher)

Created2020

Statistical Sequence Alignment of Protein Coding Regions

Description

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in comparative and functional genomic studies. While such errors are eventually…

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in comparative and functional genomic studies. While such errors are eventually fixed in the reference genomes of model organisms, many genomes used by researchers contain these artifacts, often forcing researchers to discard large amounts of data to prevent artifacts from impacting results. I developed COATi, a statistical, codon-aware pairwise aligner designed to align protein-coding sequences in the presence of artifacts commonly introduced by sequencing or annotation errors, such as early stop codons and abiological frameshifts. Unlike common sequence aligners, which rely on amino acid translations, only model insertion and deletions between codons, or lack a statistical model, COATi combines a codon substitution model specifically designed for protein-coding regions, a complex insertion-deletion model, and a sequencing base calling error step. The alignment algorithm is based on finite state transducers (FSTs), computational machines well-suited for modeling sequence evolution. I show that COATi outperforms available methods using a simulated empirical pairwise alignment dataset as a benchmark. The FST-based model and alignment algorithm in COATi is resource-intense for sequences longer than a few kilobases. To address this constraint, I developed an approximate model compatible with traditional dynamic programming alignment algorithms. I describe how the original codon substitution model is transformed to build an approximate model and how the alignment algorithm is implemented by modifying the popular Gotoh algorithm. I simulated a benchmark of alignments and measured how well the marginal models approximate the original method. Finally, I present a novel tool for analyzing sequence alignments. Available metrics can measure the similarity between two alignments or the column uncertainty within an alignment but cannot produce a site-specific comparison of two or more alignments. AlnDotPlot is an R software package inspired by traditional dot plots that can provide valuable insights when comparing pairwise alignments. I describe AlnDotPlot and showcase its utility in displaying a single alignment, comparing different pairwise alignments, and summarizing alignment space.

ContributorsGarcia Mesa, Juan Jose (Author) / Cartwright, Reed A (Thesis advisor) / Taylor, Jesse (Committee member) / Pavlic, Theodore (Committee member) / Ozkan, Banu (Committee member) / Arizona State University (Publisher)

Created2023

Filtering by

Bayesian Biogeographical Analyses with Beast: Assessment Using Simulated Data

Evolution of multigene families and single copy genes in Plasmodium spp

Methods for Detecting Mutations in Non-model Organisms

Statistical Sequence Alignment of Protein Coding Regions