This dissertation proposes two PageRank-based analytical methods, Pathways of Topological Rank Analysis (PoTRA) and miR2Pathway, discussed in Chapter 1 and Chapter 2, respectively. PoTRA focuses on detecting pathways with an altered number of hub genes in corresponding pathways between two phenotypes. The basis for PoTRA is that the loss of connectivity is a common topological trait of cancer networks, as well as the prior knowledge that a normal biological network is a scale-free network whose degree distribution follows a power law where a small number of nodes are hubs and a large number of nodes are non-hubs. However, from normal to cancer, the process of the network losing connectivity might be the process of disrupting the scale-free structure of the network, namely, the number of hub genes might be altered in cancer compared to that in normal samples. Hence, it is hypothesized that if the number of hub genes is different in a pathway between normal and cancer, this pathway might be involved in cancer. MiR2Pathway focuses on quantifying the differential effects of miRNAs on the activity of a biological pathway when miRNA-mRNA connections are altered from normal to disease and rank disease risk of rewired miRNA-mediated biological pathways. This dissertation explores how rewired gene-gene interactions and rewired miRNA-mRNA interactions lead to aberrant activity of biological pathways, and rank pathways for their disease risk. The two methods proposed here can be used to complement existing genomics analysis methods to facilitate the study of biological mechanisms behind disease at the systems-level.
The overarching goal of my research unfolds over three aims: (i) evaluating circRNAs and their predicted impact on transcriptional regulatory networks in cell-specific RNAseq data; (ii) developing a novel solution for de novo detection of full length circRNAs as well as in silico validation of selected circRNA junctions using assembly; and (iii) application of these assembly based detection and validation workflows, and integrating existing tools, to systematically identify and characterize circRNAs in functionally distinct human brain regions. To this end, I have developed novel bioinformatics workflows that are applicable to non-polyA selected RNAseq datasets and can be used to characterize circRNA expression across various sample types and diseases. Further, I establish a reference dataset of circRNA expression profiles and regulatory networks in a brain region-specific manner. This resource along with existing databases such as circBase will be invaluable in advancing circRNA research as well as improving our understanding of their role in transcriptional regulation and various neurological conditions.
Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships.
Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets.
Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets.
Bridging social capital describes the diffusion of information across networks built between individuals of different social identities. This project aims to understand if the bridging ties of economic connectedness (EC), measured by data from Facebook friends and calculated as the average share of high socioeconomic status friends that an individual from a low socioeconomic status has, can be a predictor of variations in COVID-19 infection risk across Arizona ZIP code tabulation areas (ZCTAs). Economic connectedness values across Arizona ZCTAs was examined in addition to the correlation of EC to various social and demographic factors such as age, sex, race and ethnicity, educational background, income, and health insurance coverage. A multiple linear regression model was conducted to examine the association of EC to biweekly COVID-19 growth rate from October 2020 to November 2021, and to examine the longitudinal trends in the association between these two factors. The study found that the bridging ties of economic connectedness has a significant effect size comparable to that of other demographic features, and has implications in being used to identify vulnerabilities and health disparities in communities during the pandemic.
Computational and systems biology are rapidly growing fields of academic study, but unfamiliar researchers are impeded by a lack of accessible, programming-optional, modelling tools. To address this gap, I developed BioSSA, a web framework built on JavaScript and D3.js which allows users to explore a small library of curated biophysical models as well as create and simulate their own reaction network. The mathematical foundation of BioSSA is the Stochastic Gillespie Algorithm, which is widely used in mathematical modeling and biology to represent chemical reaction systems. SGA is particularly well-suited as an introductory modelling tool because of its flexibility, broad applicability, and its ability to numerically approximate systems when analytical solutions are not available. BioSSA is freely available to the community and further improvements are planned.