Search Content

A Robust scRNA-seq Data Analysis Pipeline for Measuring Gene Expression Noise

Description

The past decade has seen a drastic increase in collaboration between Computer Science (CS) and Molecular Biology (MB). Current foci in CS such as deep learning require very large amounts of data, and MB research can often be rapidly advanced by analysis and models from CS. One of the places…

The past decade has seen a drastic increase in collaboration between Computer Science (CS) and Molecular Biology (MB). Current foci in CS such as deep learning require very large amounts of data, and MB research can often be rapidly advanced by analysis and models from CS. One of the places where CS could aid MB is during analysis of sequences to find binding sites, prediction of folding patterns of proteins. Maintenance and replication of stem-like cells is possible for long terms as well as differentiation of these cells into various tissue types. These behaviors are possible by controlling the expression of specific genes. These genes then cascade into a network effect by either promoting or repressing downstream gene expression. The expression level of all gene transcripts within a single cell can be analyzed using single cell RNA sequencing (scRNA-seq). A significant portion of noise in scRNA-seq data are results of extrinsic factors and could only be removed by customized scRNA-seq analysis pipeline. scRNA-seq experiments utilize next-gen sequencing to measure genome scale gene expression levels with single cell resolution.

Almost every step during analysis and quantification requires the use of an often empirically determined threshold, which makes quantification of noise less accurate. In addition, each research group often develops their own data analysis pipeline making it impossible to compare data from different groups. To remedy this problem a streamlined and standardized scRNA-seq data analysis and normalization protocol was designed and developed. After analyzing multiple experiments we identified the possible pipeline stages, and tools needed. Our pipeline is capable of handling data with adapters and barcodes, which was not the case with pipelines from some experiments. Our pipeline can be used to analyze single experiment scRNA-seq data and also to compare scRNA-seq data across experiments. Various processes like data gathering, file conversion, and data merging were automated in the pipeline. The main focus was to standardize and normalize single-cell RNA-seq data to minimize technical noise introduced by disparate platforms.

ContributorsBalachandran, Parithi (Author) / Wang, Xiao (Thesis advisor) / Brafman, David (Committee member) / Lockhart, Thurmon (Committee member) / Arizona State University (Publisher)

Created2017

Generation of isogenic pluripotent stem cell lines for study of APOE, an Alzheimer’s risk factor

Description

Alzheimer’s disease (AD), despite over a century of research, does not have a clearly defined pathogenesis for the sporadic form that makes up the majority of disease incidence. A variety of correlative risk factors have been identified, including the three isoforms of apolipoprotein E (ApoE), a cholesterol transport protein in…

Alzheimer’s disease (AD), despite over a century of research, does not have a clearly defined pathogenesis for the sporadic form that makes up the majority of disease incidence. A variety of correlative risk factors have been identified, including the three isoforms of apolipoprotein E (ApoE), a cholesterol transport protein in the central nervous system. ApoE ε3 is the wild-type variant with no effect on risk. ApoE ε2, the protective and most rare variant, reduces risk of developing AD by 40%. ApoE ε4, the risk variant, increases risk by 3.2-fold and 14.9-fold for heterozygous and homozygous representation respectively. Study of these isoforms has been historically complex, but the advent of human induced pluripotent stem cells (hiPSC) provides the means for highly controlled, longitudinal in vitro study. The effect of ApoE variants can be further elucidated using this platform by generating isogenic hiPSC lines through precise genetic modification, the objective of this research. As the difference between alleles is determined by two cytosine-thymine polymorphisms, a specialized CRISPR/Cas9 system for direct base conversion was able to be successfully employed. The base conversion method for transitioning from the ε3 to ε2 allele was first verified using the HEK293 cell line as a model with delivery via electroporation. Following this verification, the transfection method was optimized using two hiPSC lines derived from ε4/ε4 patients, with a lipofection technique ultimately resulting in successful base conversion at the same site verified in the HEK293 model. Additional research performed included characterization of the pre-modification genotype with respect to likely off-target sites and methods of isolating clonal variants.

ContributorsLakers, Mary Frances (Author) / Brafman, David (Thesis advisor) / Haynes, Karmella (Committee member) / Wang, Xiao (Committee member) / Arizona State University (Publisher)

Created2017

Increased Enrichment and Generation of Isogenic Lines Using a Transient Reporter for Editing Enrichment

Description

Alzheimer’s disease (AD) affects over 5 million individuals each year in the United States. Furthermore, most cases of AD are sporadic, making it extremely difficult to model and study in vitro. CRISPR/Cas9 and base editing technologies have been of recent interest because of their ability to create single nucleotide edits…

Alzheimer’s disease (AD) affects over 5 million individuals each year in the United States. Furthermore, most cases of AD are sporadic, making it extremely difficult to model and study in vitro. CRISPR/Cas9 and base editing technologies have been of recent interest because of their ability to create single nucleotide edits at nearly any genomic sequence using a Cas9 protein and a guide RNA (sgRNA). Currently, there is no available phenotype to differentiate edited cells from unedited cells. Past research has employed fluorescent proteins bound to Cas9 proteins to attempt to enrich for edited cells, however, these methods are only reporters of transfection (RoT) and are no indicative of actual base-editing occurring. Thus, this study proposes a transient reporter for editing enrichment (TREE) and Cas9-mediated adenosine TREE (CasMasTREE) which use plasmids to co-transfect with CRISPR/Cas9 technologies to serve as an indicator of base-editing. Specifically, TREE features a blue fluorescent protein (BFP) mutant that, upon a C-T conversion, changes the emission spectrum to a green fluorescent protein (GFP). CasMasTREE features a mCherry and GFP protein separated by a stop codon which can be negated using an A-G conversion. By employing a sgRNA that targets one of the TREE plasmids and at least one genomic site, cells can be sorted for GFP(+) cells. Using these methods, base-edited isogenic hiPSC line generation using TREE (BIG-TREE) was created to generate isogenic hiPSC lines with AD-relevant edits. For example, BIG-TREE demonstrates the capability of converting Apolipoprotein E (APOE), a gene associated with AD-risk development, wildtype (3/3) into another isoform, APOE2/2, to create isogenic hiPSC lines. The capabilities of TREE are vast and can be applied to generate various models of diseases with specific genomic edits.

ContributorsNguyen, Toan Thai Tran (Author) / Brafman, David (Thesis advisor) / Wang, Xiao (Committee member) / Tian, Xiaojun (Committee member) / Arizona State University (Publisher)

Created2020

Engineering of Synthetic DNA/RNA Modules for Manipulating Gene Expression and Circuit Dynamics

Description

Gene circuit engineering facilitates the discovery and understanding of fundamental biology and has been widely used in various biological applications. In synthetic biology, gene circuits are often constructed by two main strategies: either monocistronic or polycistronic constructions. The Latter architecture can be commonly found in prokaryotes, eukaryotes, and viruses and…

Gene circuit engineering facilitates the discovery and understanding of fundamental biology and has been widely used in various biological applications. In synthetic biology, gene circuits are often constructed by two main strategies: either monocistronic or polycistronic constructions. The Latter architecture can be commonly found in prokaryotes, eukaryotes, and viruses and has been largely applied in gene circuit engineering. In this work, the effect of adjacent genes and noncoding regions are systematically investigated through the construction of batteries of gene circuits in diverse scenarios. Data-driven analysis yields a protein expression metric that strongly correlates with the features of adjacent transcriptional regions (ATRs). This novel mathematical tool helps the guide for circuit construction and has the implication for the design of synthetic ATRs to tune gene expression, illustrating its potential to facilitate engineering complex gene networks. The ability to tune RNA dynamics is greatly needed for biotech applications, including therapeutics and diagnostics. Diverse methods have been developed to tune gene expression through transcriptional or translational manipulation. Control of RNA stability/degradation is often overlooked and can be the lightweight alternative to regulate protein yields. To further extend the utility of engineered ATRs to regulate gene expression, a library of RNA modules named degradation-tuning RNAs (dtRNAs) are designed with the ability to form specific 5’ secondary structures prior to RBS. These modules can modulate transcript stability while having a minimal interference on translation initiation. Optimization of their functional structural features enables gene expression level to be tuned over a wide dynamic range. These engineered dtRNAs are capable of regulating gene circuit dynamics as well as noncoding RNA levels and can be further expanded into cell-free system for gene expression control in vitro. Finally, integrating dtRNA with synthetic toehold sensor enables improved paper-based viral diagnostics, illustrating the potential of using synthetic dtRNAs for biomedical applications.

ContributorsZhang, Qi (Author) / Wang, Xiao (Thesis advisor) / Green, Alexander (Committee member) / Brafman, David (Committee member) / Tian, Xiaojun (Committee member) / Plaisier, Christopher (Committee member) / Arizona State University (Publisher)

Created2020

Landscape of Gene Regulatory Network Motifs

Description

The human transcriptional regulatory machine utilizes hundreds of transcription factors which bind to specific genic sites resulting in either activation or repression of targeted genes. Networks comprised of nodes and edges can be constructed to model the relationships of regulators and their targets. Within these biological networks small enriched structural…

The human transcriptional regulatory machine utilizes hundreds of transcription factors which bind to specific genic sites resulting in either activation or repression of targeted genes. Networks comprised of nodes and edges can be constructed to model the relationships of regulators and their targets. Within these biological networks small enriched structural patterns containing at least three nodes can be identified as potential building blocks from which a network is organized. A first iteration computational pipeline was designed to generate a disease specific gene regulatory network for motif detection using established computational tools. The first goal was to identify motifs that can express themselves in a state that results in differential patient survival in one of the 32 different cancer types studied. This study identified issues for detecting strongly correlated motifs that also effect patient survival, yielding preliminary results for possible driving cancer etiology. Second, a comparison was performed for the topology of network motifs across multiple different data types to identify possible divergence from a conserved enrichment pattern in network perturbing diseases. The topology of enriched motifs across all the datasets converged upon a single conserved pattern reported in a previous study which did not appear to diverge dependent upon the type of disease. This report highlights possible methods to improve detection of disease driving motifs that can aid in identifying possible treatment targets in cancer. Finally, networks where only minimally perturbed, suggesting that regulatory programs were run from evolved circuits into a cancer context.

ContributorsStriker, Shawn Scott (Author) / Plaisier, Christopher (Thesis advisor) / Brafman, David (Committee member) / Wang, Xiao (Committee member) / Arizona State University (Publisher)

Created2020

Investigating the Role of APOE2 in Alzheimer's Disease Using Human Induced Pluripotent Stem Cell Derived Neurons and Astrocytes

Description

Genome wide association studies (GWAS) have identified polymorphism in the Apolipoprotein E (APOE) gene to be the most prominent risk factor for Alzheimer’s disease (AD). Compared to individuals homozygous for the APOE3 variant, individuals with the APOE4 variant have a significantly elevated risk of AD. On the other hand, longitudinal…

Genome wide association studies (GWAS) have identified polymorphism in the Apolipoprotein E (APOE) gene to be the most prominent risk factor for Alzheimer’s disease (AD). Compared to individuals homozygous for the APOE3 variant, individuals with the APOE4 variant have a significantly elevated risk of AD. On the other hand, longitudinal studies have shown that the presence of the APOE2 variant reduces lifetime risk of developing AD by 40 percent. While there has been significant research that has identified the risk-inducing effects of APOE4, the underlying mechanisms by which APOE2 influences AD onset and progression have not been extensively explored. The hallmarks of AD pathology manifest in human neurons in the form of extracellular amyloid deposits and intracellular neurofibrillary tangles, whereas astrocytes are the primary source of the APOE protein in the brain. In this study, an isogenic human induced pluripotent stem cell (hiPSC)-based system is utilized to demonstrate that conversion of APOE3 to APOE2 greatly reduced the production of amyloid-beta (Aβ) peptides in hiPSC-derived neural cultures. Mechanistically, analysis of pure populations of neurons and astrocytes derived from these neural cultures revealed that mitigating effects of APOE2 is mediated by cell autonomous and non-autonomous effects. In particular, it was demonstrated the reduction in Aβ and pathogenic β-C-terminal fragments (APP-βCTF) is potentially driven by a mechanism related to non-amyloidogenic processing of amyloid precursor protein (APP), suggesting a gain of protective function of the APOE2 variant. Together, this study provides insights into the risk-modifying effects associated with the APOE2 allele and establishes a platform to probe the mechanisms by which APOE2 enhances neuroprotection against AD.

ContributorsRaman, Sreedevi (Author) / Brafman, David (Thesis advisor) / Smith, Barbara (Committee member) / Plaiser, Christopher (Committee member) / Wang, Xiao (Committee member) / Tian, Xiaojun (Committee member) / Arizona State University (Publisher)

Created2021

Methods for Detecting Mutations in Non-model Organisms

Description

Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and other non-random biases cannot be de-
tected by increased depth. The…

Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and other non-random biases cannot be de-
tected by increased depth. The problem of accurate genotyping is exacerbated when
there is not a reference genome or other auxiliary information available.
I explore several methods for sensitively detecting mutations in non-model or-
ganisms using an example Eucalyptus melliodora individual. I use the structure of
the tree to find bounds on its somatic mutation rate and evaluate several algorithms
for variant calling. I find that conventional methods are suitable if the genome of a
close relative can be adapted to the study organism. However, with structured data,
a likelihood framework that is aware of this structure is more accurate. I use the
techniques developed here to evaluate a reference-free variant calling algorithm.
I also use this data to evaluate a k-mer based base quality score recalibrator
(KBBQ), a tool I developed to recalibrate base quality scores attached to sequencing
data. Base quality scores can help detect errors in sequencing reads, but are often
inaccurate. The most popular method for correcting this issue requires a known
set of variant sites, which is unavailable in most cases. I simulate data and show
that errors in this set of variant sites can cause calibration errors. I then show that
KBBQ accurately recalibrates base quality scores while requiring no reference or other
information and performs as well as other methods.
Finally, I use the Eucalyptus data to investigate the impact of quality score calibra-
tion on the quality of output variant calls and show that improved base quality score
calibration increases the sensitivity and reduces the false positive rate of a variant
calling algorithm.

ContributorsOrr, Adam James (Author) / Cartwright, Reed (Thesis advisor) / Wilson, Melissa (Committee member) / Kusumi, Kenro (Committee member) / Taylor, Jesse (Committee member) / Pfeifer, Susanne (Committee member) / Arizona State University (Publisher)

Created2020

Pathways of Distinction Analysis of Liver Cancer Data: Genetic Differences Between Males and Females

Description

The Pathways of Distinction Analysis (PoDA) program calculates relationships between a given group of genes contained within a pathway, and a disease state. It was used here to investigate liver cancer, and to explore how genetic variability may contribute to the different rates of development of the disease in males…

The Pathways of Distinction Analysis (PoDA) program calculates relationships between a given group of genes contained within a pathway, and a disease state. It was used here to investigate liver cancer, and to explore how genetic variability may contribute to the different rates of development of the disease in males and females. The goal of the study was to identify germline variation that differs by sex in hepatocellular carcinoma. Using the program, multiple pathways and genes were identified to have significant differences in their relationship to liver cancer in males and females. In animal studies, the genes which were identified using the PoDA analysis have been shown to impact liver cancer, often with different results for males and females. While these genes are often the focus in animal models, they are absent from current Genome Wide Association Studies (GWAS) catalogs for humans. By working to bridge the results of animal studies and human studies, the results help to identify the causes of liver cancer, and more specifically, the reason the disease affects males at much higher rates. The differences in pathways identified to be significant for the two sexes indicate the germline variance may play sex-specific roles in the development of hepatocellular carcinoma. Additionally, these results reinforce the capacity of the PoDA analysis to identify genes that may be missed by more traditional GWAS methods. This study lays the groundwork for further investigations into the identified genes and pathways, and how they behave differently within males and females.

ContributorsOlson, Erik Jon (Author) / Buetow, Kenneth (Thesis advisor) / Wilson, Melissa (Committee member) / Cartwright, Reed (Committee member) / Arizona State University (Publisher)

Created2021

Delineation of Concentration Ranges and Longitudinal Changes of Human Plasma Protein Variants

Description

Human protein diversity arises as a result of alternative splicing, single nucleotide polymorphisms (SNPs) and posttranslational modifications. Because of these processes, each protein can exists as multiple variants in vivo. Tailored strategies are needed to study these protein variants and understand their role in health and disease. In this work…

Human protein diversity arises as a result of alternative splicing, single nucleotide polymorphisms (SNPs) and posttranslational modifications. Because of these processes, each protein can exists as multiple variants in vivo. Tailored strategies are needed to study these protein variants and understand their role in health and disease. In this work we utilized quantitative mass spectrometric immunoassays to determine the protein variants concentration of beta-2-microglobulin, cystatin C, retinol binding protein, and transthyretin, in a population of 500 healthy individuals. Additionally, we determined the longitudinal concentration changes for the protein variants from four individuals over a 6 month period. Along with the native forms of the four proteins, 13 posttranslationally modified variants and 7 SNP-derived variants were detected and their concentration determined. Correlations of the variants concentration with geographical origin, gender, and age of the individuals were also examined. This work represents an important step toward building a catalog of protein variants concentrations and examining their longitudinal changes.

ContributorsTrenchevska, Olgica (Author) / Phillips, David A. (Author) / Nelson, Randall (Author) / Nedelkov, Dobrin (Author) / Biodesign Institute (Contributor)

Created2014-06-23

Endogenous WNT Signaling Regulates hPSC-Derived Neural Progenitor Cell Heterogeneity and Specifies Their Regional Identity

Description

Neural progenitor cells (NPCs) derived from human pluripotent stem cells (hPSCs) are a multipotent cell population that is capable of nearly indefinite expansion and subsequent differentiation into the various neuronal and supporting cell types that comprise the CNS. However, current protocols for differentiating NPCs toward neuronal lineages result in a…

Neural progenitor cells (NPCs) derived from human pluripotent stem cells (hPSCs) are a multipotent cell population that is capable of nearly indefinite expansion and subsequent differentiation into the various neuronal and supporting cell types that comprise the CNS. However, current protocols for differentiating NPCs toward neuronal lineages result in a mixture of neurons from various regions of the CNS. In this study, we determined that endogenous WNT signaling is a primary contributor to the heterogeneity observed in NPC cultures and neuronal differentiation. Furthermore, exogenous manipulation of WNT signaling during neural differentiation, through either activation or inhibition, reduces this heterogeneity in NPC cultures, thereby promoting the formation of regionally homogeneous NPC and neuronal cultures. The ability to manipulate WNT signaling to generate regionally specific NPCs and neurons will be useful for studying human neural development and will greatly enhance the translational potential of hPSCs for neural-related therapies.

ContributorsMoya, Noel (Author) / Cutts, Joshua (Author) / Gaasterland, Terry (Author) / Willert, Karl (Author) / Brafman, David (Author) / Ira A. Fulton Schools of Engineering (Contributor)

Created2014-12-09

Filtering by