Search Content

Structural variant detection: a novel approach

Description

Genomic structural variation (SV) is defined as gross alterations in the genome broadly classified as insertions/duplications, deletions inversions and translocations. DNA sequencing ushered structural variant discovery beyond laboratory detection techniques to high resolution informatics approaches. Bioinformatics tools for computational discovery of SVs however are still missing variants in the complex…

Genomic structural variation (SV) is defined as gross alterations in the genome broadly classified as insertions/duplications, deletions inversions and translocations. DNA sequencing ushered structural variant discovery beyond laboratory detection techniques to high resolution informatics approaches. Bioinformatics tools for computational discovery of SVs however are still missing variants in the complex cancer genome. This study aimed to define genomic context leading to tool failure and design novel algorithm addressing this context. Methods: The study tested the widely held but unproven hypothesis that tools fail to detect variants which lie in repeat regions. Publicly available 1000-Genomes dataset with experimentally validated variants was tested with SVDetect-tool for presence of true positives (TP) SVs versus false negative (FN) SVs, expecting that FNs would be overrepresented in repeat regions. Further, the novel algorithm designed to informatically capture the biological etiology of translocations (non-allelic homologous recombination and 3&ndashD; placement of chromosomes in cells –context) was tested using simulated dataset. Translocations were created in known translocation hotspots and the novel&ndashalgorithm; tool compared with SVDetect and BreakDancer. Results: 53% of false negative (FN) deletions were within repeat structure compared to 81% true positive (TP) deletions. Similarly, 33% FN insertions versus 42% TP, 26% FN duplication versus 57% TP and 54% FN novel sequences versus 62% TP were within repeats. Repeat structure was not driving the tool's inability to detect variants and could not be used as context. The novel algorithm with a redefined context, when tested against SVDetect and BreakDancer was able to detect 10/10 simulated translocations with 30X coverage dataset and 100% allele frequency, while SVDetect captured 4/10 and BreakDancer detected 6/10. For 15X coverage dataset with 100% allele frequency, novel algorithm was able to detect all ten translocations albeit with fewer reads supporting the same. BreakDancer detected 4/10 and SVDetect detected 2/10 Conclusion: This study showed that presence of repetitive elements in general within a structural variant did not influence the tool's ability to capture it. This context-based algorithm proved better than current tools even with half the genome coverage than accepted protocol and provides an important first step for novel translocation discovery in cancer genome.

ContributorsShetty, Sheetal (Author) / Dinu, Valentin (Thesis advisor) / Bussey, Kimberly (Committee member) / Scotch, Matthew (Committee member) / Wallstrom, Garrick (Committee member) / Arizona State University (Publisher)

Created2014

Functional and proteome differences in skeletal muscle mitochondria between lean and obese humans

Description

Skeletal muscle (SM) mitochondria generate the majority of adenosine triphosphate (ATP) in SM, and help regulate whole-body energy expenditure. Obesity is associated with alterations in SM mitochondria, which are unique with respect to their arrangement within cells; some mitochondria are located directly beneath the sarcolemma (i.e., subsarcolemmal (SS) mitochondria), while…

Skeletal muscle (SM) mitochondria generate the majority of adenosine triphosphate (ATP) in SM, and help regulate whole-body energy expenditure. Obesity is associated with alterations in SM mitochondria, which are unique with respect to their arrangement within cells; some mitochondria are located directly beneath the sarcolemma (i.e., subsarcolemmal (SS) mitochondria), while other are nested between the myofibrils (i.e., intermyofibrillar (IMF) mitochondria). Functional and proteome differences specific to SS versus IMF mitochondria in obese individuals may contribute to reduced capacity for muscle ATP production seen in obesity. The overall goals of this work were to (1) isolate functional muscle SS and IMF mitochondria from lean and obese individuals, (2) assess enzyme activities associated with the electron transport chain and ATP production, (3) determine if elevated plasma amino acids enhance SS and IMF mitochondrial respiration and ATP production rates in SM of obese humans, and (4) determine differences in mitochondrial proteome regulating energy metabolism and key biological processes associated with SS and IMF mitochondria between lean and obese humans.

Polarography was used to determine functional differences in isolated SS and IMF mitochondria between lean (37 ± 3 yrs; n = 10) and obese (35 ± 3 yrs; n = 11) subjects during either saline (control) or amino acid (AA) infusions. AA infusion increased ADP-stimulated respiration (i.e., coupled respiration), non-ADP stimulated respiration (i.e., uncoupled respiration), and ATP production rates in SS, but not IMF mitochondria in lean (n = 10; P < 0.05). Neither infusion increased any of the above parameters in muscle SS or IMF mitochondria of the obese subjects.

Using label free quantitative mass spectrometry, we determined differences in proteomes of SM SS and IMF mitochondria between lean (33 ± 3 yrs; n = 16) and obese (32 ± 3 yrs; n = 17) subjects. Differentially-expressed mitochondrial proteins in SS versus IMF mitochondria of obese subjects were associated with biological processes that regulate: electron transport chain (P<0.0001), citric acid cycle (P<0.0001), oxidative phosphorylation (P<0.001), branched-chain amino acid degradation, (P<0.0001), and fatty acid degradation (P<0.001). Overall, these findings show that obesity is associated with redistribution of key biological processes within the mitochondrial reticulum responsible for regulating energy metabolism in human skeletal muscle.

ContributorsKras, Katon Anthony (Author) / Katsanos, Christos (Thesis advisor) / Chandler, Douglas (Committee member) / Dinu, Valentin (Committee member) / Mor, Tsafrir S. (Committee member) / Arizona State University (Publisher)

Created2017

Novel methods of biomarker discovery and predictive modeling using Random Forest

Description

Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF…

Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval.

Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships.

Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets.

Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets.

ContributorsGuan, Xin (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Thesis advisor) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)

Created2017

Investigation of DNA methylation in obesity and its underlying insulin resistance

Description

Obesity and its underlying insulin resistance are caused by environmental and genetic factors. DNA methylation provides a mechanism by which environmental factors can regulate transcriptional activity. The overall goal of the work herein was to (1) identify alterations in DNA methylation in human skeletal muscle with obesity and its underlying…

Obesity and its underlying insulin resistance are caused by environmental and genetic factors. DNA methylation provides a mechanism by which environmental factors can regulate transcriptional activity. The overall goal of the work herein was to (1) identify alterations in DNA methylation in human skeletal muscle with obesity and its underlying insulin resistance, (2) to determine if these changes in methylation can be altered through weight-loss induced by bariatric surgery, and (3) to identify DNA methylation biomarkers in whole blood that can be used as a surrogate for skeletal muscle.

Assessment of DNA methylation was performed on human skeletal muscle and blood using reduced representation bisulfite sequencing (RRBS) for high-throughput identification and pyrosequencing for site-specific confirmation. Sorbin and SH3 homology domain 3 (SORBS3) was identified in skeletal muscle to be increased in methylation (+5.0 to +24.4 %) in the promoter and 5’untranslated region (UTR) in the obese participants (n= 10) compared to lean (n=12), and this finding corresponded with a decrease in gene expression (fold change: -1.9, P=0.0001). Furthermore, SORBS3 was demonstrated in a separate cohort of morbidly obese participants (n=7) undergoing weight-loss induced by surgery, to decrease in methylation (-5.6 to -24.2%) and increase in gene expression (fold change: +1.7; P=0.05) post-surgery. Moreover, SORBS3 promoter methylation was demonstrated in vitro to inhibit transcriptional activity (P=0.000003). The methylation and transcriptional changes for SORBS3 were significantly (P≤0.05) correlated with obesity measures and fasting insulin levels. SORBS3 was not identified in the blood methylation analysis of lean (n=10) and obese (n=10) participants suggesting that it is a muscle specific marker. However, solute carrier family 19 member 1 (SLC19A1) was identified in blood and skeletal muscle to have decreased 5’UTR methylation in obese participants, and this was significantly (P≤0.05) predicted by insulin sensitivity.

These findings suggest SLC19A1 as a potential blood-based biomarker for obese, insulin resistant states. The collective findings of SORBS3 DNA methylation and gene expression present an exciting novel target in skeletal muscle for further understanding obesity and its underlying insulin resistance. Moreover, the dynamic changes to SORBS3 in response to metabolic improvements and weight-loss induced by surgery.

ContributorsDay, Samantha Elaine (Author) / Coletta, Dawn K. (Thesis advisor) / Katsanos, Christos (Committee member) / Mandarino, Lawrence J. (Committee member) / Shaibi, Gabriel Q. (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)

Created2017

Topological analysis of biological pathways : genes, microRNAs and pathways involved in hepatocellular carcinoma

Description

Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired miRNA-mRNA interactions are defined as the differential (rewiring) effects of…

Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired miRNA-mRNA interactions are defined as the differential (rewiring) effects of miRNAs on the topology of biological pathways between controls and cases. In the dissertation, it is discussed that how rewired biological pathways (Chapter 1) and/or rewired miRNA-mRNA interactions (Chapter 2) aberrantly influence the activity of biological pathways and their association with disease.

This dissertation proposes two PageRank-based analytical methods, Pathways of Topological Rank Analysis (PoTRA) and miR2Pathway, discussed in Chapter 1 and Chapter 2, respectively. PoTRA focuses on detecting pathways with an altered number of hub genes in corresponding pathways between two phenotypes. The basis for PoTRA is that the loss of connectivity is a common topological trait of cancer networks, as well as the prior knowledge that a normal biological network is a scale-free network whose degree distribution follows a power law where a small number of nodes are hubs and a large number of nodes are non-hubs. However, from normal to cancer, the process of the network losing connectivity might be the process of disrupting the scale-free structure of the network, namely, the number of hub genes might be altered in cancer compared to that in normal samples. Hence, it is hypothesized that if the number of hub genes is different in a pathway between normal and cancer, this pathway might be involved in cancer. MiR2Pathway focuses on quantifying the differential effects of miRNAs on the activity of a biological pathway when miRNA-mRNA connections are altered from normal to disease and rank disease risk of rewired miRNA-mediated biological pathways. This dissertation explores how rewired gene-gene interactions and rewired miRNA-mRNA interactions lead to aberrant activity of biological pathways, and rank pathways for their disease risk. The two methods proposed here can be used to complement existing genomics analysis methods to facilitate the study of biological mechanisms behind disease at the systems-level.

ContributorsLi, Chaoxing (Author) / Dinu, Valentin (Thesis advisor) / Kuang, Yang (Thesis advisor) / Liu, Li (Committee member) / Wang, Xiao (Committee member) / Arizona State University (Publisher)

Created2017

Discovering subclones and their driver genes in tumors sequenced at standard depths

Description

Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x depth). The

resulting algorithm from the work presented here, incorporates an…

Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x depth). The

resulting algorithm from the work presented here, incorporates an adaptive error model

into statistical decomposition of mixed populations, which corrects the mean-variance

dependency of sequencing data at the subclonal level and enables accurate subclonal

discovery in tumors sequenced at standard depths (30-50x). Tested on extensive computer

simulations and real-world data, this new method, named model-based adaptive grouping

of subclones (MAGOS), consistently outperforms existing methods on minimum

sequencing depth, decomposition accuracy and computation efficiency. MAGOS supports

subclone analysis using single nucleotide variants and copy number variants from one or

more samples of an individual tumor. GUST algorithm, on the other hand is a novel method

in detecting the cancer type specific driver genes. Combination of MAGOS and GUST

results can provide insights into cancer progression. Applications of MAGOS and GUST

to whole-exome sequencing data of 33 different cancer types’ samples discovered a

significant association between subclonal diversity and their drivers and patient overall

survival.

ContributorsAhmadinejad, Navid (Author) / Liu, Li (Thesis advisor) / Maley, Carlo (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)

Created2019

Fine Mapping Functional Noncoding Genetic Elements Via Machine Learning

Description

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants,…

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants, etc. Deciphering and cataloging these functional genetic elements in the non-coding regions of the genome is one of the biggest challenges in precision medicine and genetic research. This thesis presents two different approaches to identifying these elements: TreeMap and DeepCORE. The first approach involves identifying putative causal genetic variants in cis-eQTL accounting for multisite effects and genetic linkage at a locus. TreeMap performs an organized search for individual and multiple causal variants using a tree guided nested machine learning method. DeepCORE on the other hand explores novel deep learning techniques that models the relationship between genetic, epigenetic and transcriptional patterns across tissues and cell lines and identifies co-operative regulatory elements that affect gene regulation. These two methods are believed to be the link for genotype-phenotype association and a necessary step to explaining various complex diseases and missing heritability.

ContributorsChandrashekar, Pramod Bharadwaj (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)

Created2020

Statistical Methods for Analysis of Genomic Data with Applications in Oncology

Description

This dissertation presents three novel algorithms with real-world applications to genomic oncology. While the methodologies presented here were all developed to overcome various challenges associated with the adoption of high throughput genomic data in clinical oncology, they can be used in other domains as well. First, a network informed feature…

This dissertation presents three novel algorithms with real-world applications to genomic oncology. While the methodologies presented here were all developed to overcome various challenges associated with the adoption of high throughput genomic data in clinical oncology, they can be used in other domains as well. First, a network informed feature ranking algorithm is presented, which shows a significant increase in ability to select true predictive features from simulated data sets when compared to other state of the art graphical feature ranking methods. The methodology also shows an increased ability to predict pathological complete response to preoperative chemotherapy from genomic sequencing data of breast cancer patients utilizing domain knowledge from protein-protein interaction networks. Second, an algorithm that overcomes population biases inherent in the use of a human reference genome developed primarily from European populations is presented to classify microsatellite instability (MSI) status from next-generation-sequencing (NGS) data. The methodology significantly increases the accuracy of MSI status prediction in African and African American ancestries. Finally, a single variable model is presented to capture the bimodality inherent in genomic data stemming from heterogeneous diseases. This model shows improvements over other parametric models in the measurements of receiver-operator characteristic (ROC) curves for bimodal data. The model is used to estimate ROC curves for heterogeneous biomarkers in a dataset containing breast cancer and cancer-free specimen.

ContributorsSaul, Michelle (Author) / Dinu, Valentin (Thesis advisor) / Liu, Li (Committee member) / Wang, Junwen (Committee member) / Arizona State University (Publisher)

Created2021

Delineation of Concentration Ranges and Longitudinal Changes of Human Plasma Protein Variants

Description

Human protein diversity arises as a result of alternative splicing, single nucleotide polymorphisms (SNPs) and posttranslational modifications. Because of these processes, each protein can exists as multiple variants in vivo. Tailored strategies are needed to study these protein variants and understand their role in health and disease. In this work…

Human protein diversity arises as a result of alternative splicing, single nucleotide polymorphisms (SNPs) and posttranslational modifications. Because of these processes, each protein can exists as multiple variants in vivo. Tailored strategies are needed to study these protein variants and understand their role in health and disease. In this work we utilized quantitative mass spectrometric immunoassays to determine the protein variants concentration of beta-2-microglobulin, cystatin C, retinol binding protein, and transthyretin, in a population of 500 healthy individuals. Additionally, we determined the longitudinal concentration changes for the protein variants from four individuals over a 6 month period. Along with the native forms of the four proteins, 13 posttranslationally modified variants and 7 SNP-derived variants were detected and their concentration determined. Correlations of the variants concentration with geographical origin, gender, and age of the individuals were also examined. This work represents an important step toward building a catalog of protein variants concentrations and examining their longitudinal changes.

ContributorsTrenchevska, Olgica (Author) / Phillips, David A. (Author) / Nelson, Randall (Author) / Nedelkov, Dobrin (Author) / Biodesign Institute (Contributor)

Created2014-06-23

Mass Spectrometric Immunoassays in Characterization of Clinically Significant Proteoforms

Description

Proteins can exist as multiple proteoforms in vivo, as a result of alternative splicing and single-nucleotide polymorphisms (SNPs), as well as posttranslational processing. To address their clinical significance in a context of diagnostic information, proteoforms require a more in-depth analysis. Mass spectrometric immunoassays (MSIA) have been devised for studying structural…

Proteins can exist as multiple proteoforms in vivo, as a result of alternative splicing and single-nucleotide polymorphisms (SNPs), as well as posttranslational processing. To address their clinical significance in a context of diagnostic information, proteoforms require a more in-depth analysis. Mass spectrometric immunoassays (MSIA) have been devised for studying structural diversity in human proteins. MSIA enables protein profiling in a simple and high-throughput manner, by combining the selectivity of targeted immunoassays, with the specificity of mass spectrometric detection. MSIA has been used for qualitative and quantitative analysis of single and multiple proteoforms, distinguishing between normal fluctuations and changes related to clinical conditions. This mini review offers an overview of the development and application of mass spectrometric immunoassays for clinical and population proteomics studies. Provided are examples of some recent developments, and also discussed are the trends and challenges in mass spectrometry-based immunoassays for the next-phase of clinical applications.

ContributorsTrenchevska, Olgica (Author) / Nelson, Randall (Author) / Nedelkov, Dobrin (Author) / Biodesign Institute (Contributor)

Created2016-03-17