Matching Items (98)

134524-Thumbnail Image.png

An Analysis of the Benchmark Test lzbench for Open-Source Compressors

Description

With the rising data output and falling costs of Next Generation Sequencing technologies, research into data compression is crucial to maintaining storage efficiency and costs. High throughput sequencers such as

With the rising data output and falling costs of Next Generation Sequencing technologies, research into data compression is crucial to maintaining storage efficiency and costs. High throughput sequencers such as the HiSeqX Ten can produce up to 1.8 terabases of data per run, and such large storage demands are even more important to consider for institutions that rely on their own servers rather than large data centers (cloud storage)1. Compression algorithms aim to reduce the amount of space taken up by large genomic datasets by encoding the most frequently occurring symbols with the shortest bit codewords and by changing the order of the data to make it easier to encode. Depending on the probability distribution of the symbols in the dataset or the structure of the data, choosing the wrong algorithm could result in a compressed file larger than the original or a poorly compressed file that results in a waste of time and space2. To test efficiency among compression algorithms for each file type, 37 open-source compression algorithms were used to compress six types of genomic datasets (FASTA, VCF, BCF, GFF, GTF, and SAM) and evaluated on compression speed, decompression speed, compression ratio, and file size using the benchmark test lzbench. Compressors that outpreformed the popular bioinformatics compressor Gzip (zlib -6) were evaluated against one another by ratio and speed for each file type and across the geometric means of all file types. Compressors that exhibited fast compression and decompression speeds were also evaluated by transmission time through variable speed internet pipes in scenarios where the file was compressed only once or compressed multiple times.

Contributors

Agent

Created

Date Created
  • 2017-05

134629-Thumbnail Image.png

Improving the Valley Fever Gene Annotation Through Proteogenomic Analysis

Description

Valley Fever, also known as coccidioidomycosis, is a respiratory disease that affects 10,000 people annually, primarily in Arizona and California. Due to a lack of gene annotation, diagnosis and treatment

Valley Fever, also known as coccidioidomycosis, is a respiratory disease that affects 10,000 people annually, primarily in Arizona and California. Due to a lack of gene annotation, diagnosis and treatment of Valley Fever is severely limited. In turn, gene annotation efforts are also hampered by incomplete genome sequencing. We intend to use proteogenomic analysis to reannotate the Coccidioides posadasii str. Silveira genome from protein-level data. Protein samples extracted from both phases of Silveira were fragmented into peptides, sequenced, and compared against databases of known and predicted proteins sequences, as well as a de novo six-frame translation of the genome. 288 unique peptides were located that did not match a known Silveira annotation, and of those 169 were associated with another Coccidioides strain. Additionally, 17 peptides were found at the boundary of, or outside of, the current gene annotation comprising four distinct clusters. For one of these clusters, we were able to calculate a lower bound and an estimate for the size of the gap between two Silveira contigs using the Coccidioides immitis RS transcript associated with that cluster's peptides \u2014 these predictions were consistent with the current annotation's scaffold structure. Three peptides were associated with an actively translated transposon, and a putative active site was located within an intact LTR retrotransposon. We note that gene annotation is necessarily hindered by the quality and level of detail in prior genome sequencing efforts, and recommend that future studies involving reannotation include additional sequencing as well as gene annotation via proteogenomics or other methods.

Contributors

Agent

Created

Date Created
  • 2016-12

134237-Thumbnail Image.png

Neoantigen Prediction Pipeline

Description

Cells become cancerous due to changes in their genetic makeup. In cancers, an altered amino acid due to a tumor mutation can result in proteins that are identified as "foreign"

Cells become cancerous due to changes in their genetic makeup. In cancers, an altered amino acid due to a tumor mutation can result in proteins that are identified as "foreign" by the immune system. An MHC molecule will bind to these "foreign" peptide fragments, also called neoantigens. There are 2 classes of MHC molecules. While the MHC I complex is found in all cells with a nucleus, MHC II complexes are mostly found in antigen presenting cells (APCs), such as macrophages, B cells, and dendritic cells. The MHC molecule then presents the neoantigen on the cell's surface. If an immune cell, such as a T-cell, is able to bind to the neoantigen, it can then destroy the tumor cell. However, there are molecules that act as checkpoints on certain immune cells that have to be activated or inactivated to start an immune response. This ensures that healthy cells are not being killed. However, sometimes cancer cells can find ways to use these checkpoints to avoid being attacked. An example of immunotherapy which has had clinical successes is checkpoint blockade inhibition, which means blocking the activity of immune checkpoint proteins in order to release the "brakes" on the immune system to increase its ability to destroy cancer cells. Studies have found that there is a correlation between mutational load and response to immunotherapy. The goal of this project is to create a pipeline that identifies tumor neoantigens. This involved researching various softwares and implementing them to work together. This project involved developing a neoantigen prediction pipeline, which works with TGen's genomics pipeline, to help understand a patient's immune response. The neoantigen prediction pipeline first creates two protein fastas from the high quality non-synonymous mutations, frameshifts, codon insertions, and codon deletions from vcfmerger. One of the protein fastas includes the mutations, while the other one does not representing the wildtype protein. The pipeline then predicts both classes of HLA genotypes of the MHC molecules using DNA or RNA expression in the form of fastqs. The protein fastas and each HLA are fed into IEDB to obtain peptide-MHC binding predictions. Wildtype peptides and neoantigens with low binding affinities are then removed. RNA expression information is then added into the final text file from dseq and sailfish files from TGen's genomics pipeline.

Contributors

Agent

Created

Date Created
  • 2017-05

131069-Thumbnail Image.png

A review of pathway-based visualization and quantification analysis tools using microarray data

Description

Pathway analysis helps researchers gain insight into the biology behind gene expression-based data. By applying this data to known biological pathways, we can learn about mutations or other changes in

Pathway analysis helps researchers gain insight into the biology behind gene expression-based data. By applying this data to known biological pathways, we can learn about mutations or other changes in cellular function, such as those seen in cancer. There are many tools that can be used to analyze pathways; however, it can be difficult to find and learn about the which tool is optimal for use in a certain experiment. This thesis aims to comprehensively review four tools, Cytoscape, PaxtoolsR, PathOlogist, and Reactome, and their role in pathway analysis. This is done by applying a known microarray data set to each tool and testing their different functions. The functions of these programs will then be analyzed to determine their roles in learning about biology and assisting new researchers with their experiments. It was found that each tools holds a very unique and important role in pathway analysis. Visualization pathways have the role of exploring individual pathways and interpreting genomic results. Quantification pathways use statistical tests to determine pathway significance. Together one can find pathways of interest and then explore areas of interest.

Contributors

Agent

Created

Date Created
  • 2020-05

131582-Thumbnail Image.png

Evaluating variant calling best practices

Description

Analyzing human DNA sequence data allows researchers to identify variants associated with disease, reconstruct the demographic histories of human populations, and further understand the structure and function of the genome.

Analyzing human DNA sequence data allows researchers to identify variants associated with disease, reconstruct the demographic histories of human populations, and further understand the structure and function of the genome. Identifying variants in whole genome sequences is a crucial bioinformatics step in sequence data processing and can be performed using multiple approaches. To investigate the consistency between different bioinformatics methods, we compared the accuracy and sensitivity of two genotyping strategies, joint variant calling and single-sample variant calling. Autosomal and sex chromosome variant call sets were produced by joint and single-sample calling variants for 10 female individuals. The accuracy of variant calls was assessed using SNP array genotype data collected from each individual. To compare the ability of joint and single-sample calling to capture low-frequency variants, folded site frequency spectra were constructed from variant call sets. To investigate the potential for these different variant calling methods to impact downstream analyses, we estimated nucleotide diversity for call sets produced using each approach. We found that while both methods were equally accurate when validated by SNP array sites, single-sample calling identified a greater number of singletons. However, estimates of nucleotide diversity were robust to these differences in the site frequency spectrum between call sets. Our results suggest that despite single-sample calling’s greater sensitivity for low-frequency variants, the differences between approaches have a minimal effect on downstream analyses. While joint calling may be a more efficient approach for genotyping many samples, in situations that preclude large sample sizes, our study suggests that single-sample calling is a suitable alternative.

Contributors

Agent

Created

Date Created
  • 2020-05

131533-Thumbnail Image.png

Conservation of m6A in evolving long-term E. coli populations

Description

Many factors are at play within the genome of an organism, contributing to much of the diversity and variation across the tree of life. While the genome is generally encoded

Many factors are at play within the genome of an organism, contributing to much of the diversity and variation across the tree of life. While the genome is generally encoded by four nucleotides, A, C, T, and G, this code can be expanded. One particular mechanism that we examine in this thesis is modification of bases—more specifically, methylation of Adenine (m6A) within the GATC motif of Escherichia coli. These methylated adenines are especially important in a process called methyl-directed mismatch repair (MMR), a pathway responsible for repairing errors in the DNA sequence produced by replication. In this pathway, methylated adenines identify the parent strand and direct the repair proteins to correct the erroneous base in the daughter strand. While the primary role of methylated adenines at GATC sites is to direct the MMR pathway, this methylation has also been found to affect other processes, such as gene expression, the activity of transposable elements, and the timing of DNA replication. However, in the absence of MMR, the ability of these other processes to maintain adenine methylation and its targets is unknown.
To determine if the disruption of the MMR pathway results in the reduced conservation of methylated adenines as well as an increased tolerance for mutations that result in the loss or gain of new GATC sites, we surveyed individual clones isolated from experimentally evolving wild-type and MMR-deficient (mutL- ;conferring an 150x increase in mutation rate) populations of E. coli with whole-genome sequencing. Initial analysis revealed a lack of mutations affecting methylation sites (GATC tetranucleotides) in wild-type clones. However, the inherent low mutation rates conferred by the wild-type background render this result inconclusive, due to a lack of statistical power, and reveal a need for a more direct measure of changes in methylation status. Thus as a first step to comparative methylomics, we benchmarked four different methylation-calling pipelines on three biological replicates of the wildtype progenitor strain for our evolved populations.
While it is understood that these methylated sites play a role in the MMR pathway, it is not fully understood the full extent of their effect on the genome. Thus the goal of this thesis was to better understand the forces which maintain the genome, specifically concerning m6A within the GATC motif.

Contributors

Agent

Created

Date Created
  • 2020-05

135359-Thumbnail Image.png

Utilizing MRI Texture Analysis and APOE Genotype to Predict the Aging Brain as a Potential Method for Early Assessment of Alzheimer's Disease

Description

Background: Noninvasive MRI methods that can accurately detect subtle brain changes are highly desirable when studying disease-modifying interventions. Texture analysis is a novel imaging technique which utilizes the extraction of

Background: Noninvasive MRI methods that can accurately detect subtle brain changes are highly desirable when studying disease-modifying interventions. Texture analysis is a novel imaging technique which utilizes the extraction of a large number of image features with high specificity and predictive power. In this investigation, we use texture analysis to assess and classify age-related changes in the right and left hippocampal regions, the areas known to show some of the earliest change in Alzheimer's disease (AD). Apolipoprotein E (APOE)'s e4 allele confers an increased risk for AD, so studying differences in APOE e4 carriers may help to ascertain subtle brain changes before there has been an obvious change in behavior. We examined texture analysis measures that predict age-related changes, which reflect atrophy in a group of cognitively normal individuals. We hypothesized that the APOE e4 carriers would exhibit significant age-related differences in texture features compared to non-carriers, so that the predictive texture features hold promise for early assessment of AD. Methods: 120 normal adults between the ages of 32 and 90 were recruited for this neuroimaging study from a larger parent study at Mayo Clinic Arizona studying longitudinal cognitive functioning (Caselli et al., 2009). As part of the parent study, the participants were genotyped for APOE genetic polymorphisms and received comprehensive cognitive testing every two years, on average. Neuroimaging was done at Barrow Neurological Institute and a 3D T1-weighted magnetic resonance image was obtained during scanning that allowed for subsequent texture analysis processing. Voxel-based features of the appearance, structure, and arrangement of these regions of interest were extracted utilizing the Mayo Clinic Python Texture Analysis Pipeline (pyTAP). Algorithms applied in feature extraction included Grey-Level Co-Occurrence Matrix (GLCM), Gabor Filter Banks (GFB), Local Binary Patterns (LBP), Discrete Orthogonal Stockwell Transform (DOST), and Laplacian-of-Gaussian Histograms (LoGH). Principal component (PC) analysis was used to reduce the dimensionality of the algorithmically selected features to 13 PCs. A stepwise forward regression model was used to determine the effect of APOE status (APOE e4 carriers vs. noncarriers), and the texture feature principal components on age (as a continuous variable). After identification of 5 significant predictors of age in the model, the individual feature coefficients of those principal components were examined to determine which features contributed most significantly to the prediction of an aging brain. Results: 70 texture features were extracted for the two regions of interest in each participant's scan. The texture features were coded as 70 initial components andwere rotated to generate 13 principal components (PC) that contributed 75% of the variance in the dataset by scree plot analysis. The forward stepwise regression model used in this exploratory study significantly predicted age, accounting for approximately 40% of the variance in the data. The regression model revealed 5 significant regressors (2 right PC's, APOE status, and 2 left PC by APOE interactions). Finally, the specific texture features that contributed to each significant PCs were identified. Conclusion: Analysis of image texture features resulted in a statistical model that was able to detect subtle changes in brain integrity associated with age in a group of participants who are cognitively normal, but have an increased risk of developing AD based on the presence of the APOE e4 phenotype. This is an important finding, given that detecting subtle changes in regions vulnerable to the effects of AD in patients could allow certain texture features to serve as noninvasive, sensitive biomarkers predictive of AD. Even with only a small number of patients, the ability for us to determine sensitive imaging biomarkers could facilitate great improvement in speed of detection and effectiveness of AD interventions..

Contributors

Agent

Created

Date Created
  • 2016-05

133254-Thumbnail Image.png

Identifying Novel Nanobodies for Traumatic Brain Injury Therapeutics

Description

Traumatic brain injury (TBI) is a serious health problem around the world with few available treatments. TBI pathology can be divided into two phases: the primary insult and the secondary

Traumatic brain injury (TBI) is a serious health problem around the world with few available treatments. TBI pathology can be divided into two phases: the primary insult and the secondary injury. The primary insult results from the bump or blow to the head that causes the initial injury. Secondary injury lasts from hours to months after the initial injury and worsens the primary insult, creating a greater area of tissue damage and cell death. Many current treatments focus on lessening the severity of secondary injury. Secondary injury results from the cyclical nature of tissue damage. Inflammatory pathways cause damage to tissue, which in turn reinforces inflammation. Since many inflammatory pathways are interconnected, targeting individual products within these pathways is impractical. A target at the beginning of the pathway, such as a receptor, must be chosen to break the cycle. This project aims to identify novel nanobodies that could temporarily inactivate the CD36 receptor, which is a receptor found on many immune and endothelial cells. CD36 initiates and perpetuates the immune system's inflammatory responses. By inactivating this receptor temporarily, inflammation and immune cell entry could be lessened, and therefore secondary injury could be attenuated. This project utilized phage display as a method of nanobody selection. The specific phage library utilized in this experiment consists of human heavy chain (V_H) segments, also known as domain antibodies (dAbs), displayed on M13 filamentous bacteriophage. Phage display mimics the process of immune selection. The target is bound to a well as a means of displaying it to the phage. The phage library is then incubated with the target to allow antibodies to bind. After, the well is washed thoroughly to detach any phage that are not strongly bound. The remaining phage are then amplified in bacteria and run again through the same assay to select for mutations that resulted in higher affinity binding. This process, called biopanning, was performed three times for this project. After biopanning, the library was sequenced using Next Generation sequencing (NGS). This platform enables the entire library to be sequenced, as opposed to traditional Sanger sequencing, which can only sequence single select clones at a time thereby limiting population sampling. This type of genetic sequencing allows trends in the complementarity determining regions (CDRs) of the domain antibody library to be analyzed, using bioinformatics programs such as RStudio, FastAptamer, and Swiss Model. Ultimately, two nanobody candidates were identified for the CD36 receptor.

Contributors

Agent

Created

Date Created
  • 2018-05

133551-Thumbnail Image.png

ConstrictR and ConstrictPy: R Package and Python Tool for Microbiome Analysis

Description

I, Christopher Negrich, am the sole author of this paper, but the tools described were designed in collaboration with Andrew Hoetker. ConstrictR (constrictor) and ConstrictPy are an R package and

I, Christopher Negrich, am the sole author of this paper, but the tools described were designed in collaboration with Andrew Hoetker. ConstrictR (constrictor) and ConstrictPy are an R package and python tool designed together. ConstrictPy implements the functions and methods defined in ConstrictR and applies data handling, data parsing, input/output (I/O), and a user interface to increase usability. ConstrictR implements a variety of common data analysis methods used for statistical and subnetwork analysis. The majority of these methods are inspired by Lionel Guidi's 2016 paper, Plankton networks driving carbon export in the oligotrophic ocean. Additional methods were added to expand functionality, usability, and applicability to different areas of data science. Both ConstrictR and ConstrictPy are currently publicly available and usable, however, they are both ongoing projects. ConstrictR is available at github.com/cnegrich and ConstrictPy is available at github.com/ahoetker. Currently, ConstrictR has implemented functions for descriptive statistics, correlation, covariance, rank, sparsity, and weighted correlation network analysis with clustering, centrality, profiling, error handling, and data parsing methods to be released soon. ConstrictPy has fully implemented and integrated the features in ConstrictR as well as created functions for I/O and conversion between pandas and R data frames with a full feature user interface to be released soon. Both ConstrictR and ConstrictPy are designed to work with minimal dependencies and maximum available information on the algorithms implemented. As a result, ConstrictR is only dependent on base R (v3.4.4) functions with no libraries imported. ConstrictPy is dependent upon only pandas, Rpy2, and ConstrictR. This was done to increase longevity and independence of these tools. Additionally, all mathematical information is documented alongside the code, increasing the available information on how these tools function. Although neither tool is in its final version, this paper documents the code, mathematics, and instructions for use, in addition to plans for future work, for of the current versions of ConstrictR (v0.0.1) and ConstrictPy (v0.0.1).

Contributors

Agent

Created

Date Created
  • 2018-05

135454-Thumbnail Image.png

Identifying Variation Within Substitution Rates in Mammary Gland Development Genes within Primate Genomes

Description

Mammary gland development in humans during puberty involves the enlargement of breast tissue, but this is not true in non-human primates. To identify potential causes of this difference, I examined

Mammary gland development in humans during puberty involves the enlargement of breast tissue, but this is not true in non-human primates. To identify potential causes of this difference, I examined variation in substitution rates across genes related to mammary development. Genes undergoing purifying selection show slower-than-average substitution rates, while genes undergoing positive selection show faster rates. These may be related to the difference between humans and other primates. Three genes were found to be accelerated were FOXF1, IGFBP5, and ATP2B2, but only the latter one was found in humans and it seems unlikely that it would be related to the differences between mammary gland development at puberty between humans and non-human primates.

Contributors

Agent

Created

Date Created
  • 2016-05