Matching Items (13)
Filtering by

Clear all filters

156200-Thumbnail Image.png
Description
Modern, advanced statistical tools from data mining and machine learning have become commonplace in molecular biology in large part because of the “big data” demands of various kinds of “-omics” (e.g., genomics, transcriptomics, metabolomics, etc.). However, in other fields of biology where empirical data sets are conventionally smaller, more

Modern, advanced statistical tools from data mining and machine learning have become commonplace in molecular biology in large part because of the “big data” demands of various kinds of “-omics” (e.g., genomics, transcriptomics, metabolomics, etc.). However, in other fields of biology where empirical data sets are conventionally smaller, more traditional statistical methods of inference are still very effective and widely used. Nevertheless, with the decrease in cost of high-performance computing, these fields are starting to employ simulation models to generate insights into questions that have been elusive in the laboratory and field. Although these computational models allow for exquisite control over large numbers of parameters, they also generate data at a qualitatively different scale than most experts in these fields are accustomed to. Thus, more sophisticated methods from big-data statistics have an opportunity to better facilitate the often-forgotten area of bioinformatics that might be called “in-silicomics”.

As a case study, this thesis develops methods for the analysis of large amounts of data generated from a simulated ecosystem designed to understand how mammalian biomechanics interact with environmental complexity to modulate the outcomes of predator–prey interactions. These simulations investigate how other biomechanical parameters relating to the agility of animals in predator–prey pairs are better predictors of pursuit outcomes. Traditional modelling techniques such as forward, backward, and stepwise variable selection are initially used to study these data, but the number of parameters and potentially relevant interaction effects render these methods impractical. Consequently, new modelling techniques such as LASSO regularization are used and compared to the traditional techniques in terms of accuracy and computational complexity. Finally, the splitting rules and instances in the leaves of classification trees provide the basis for future simulation with an economical number of additional runs. In general, this thesis shows the increased utility of these sophisticated statistical techniques with simulated ecological data compared to the approaches traditionally used in these fields. These techniques combined with methods from industrial Design of Experiments will help ecologists extract novel insights from simulations that combine habitat complexity, population structure, and biomechanics.
ContributorsSeto, Christian (Author) / Pavlic, Theodore (Thesis advisor) / Li, Jing (Committee member) / Yan, Hao (Committee member) / Arizona State University (Publisher)
Created2018
156575-Thumbnail Image.png
Description
Mathematical modeling and decision-making within the healthcare industry have given means to quantitatively evaluate the impact of decisions into diagnosis, screening, and treatment of diseases. In this work, we look into a specific, yet very important disease, the Alzheimer. In the United States, Alzheimer’s Disease (AD) is the 6th leading

Mathematical modeling and decision-making within the healthcare industry have given means to quantitatively evaluate the impact of decisions into diagnosis, screening, and treatment of diseases. In this work, we look into a specific, yet very important disease, the Alzheimer. In the United States, Alzheimer’s Disease (AD) is the 6th leading cause of death. Diagnosis of AD cannot be confidently confirmed until after death. This has prompted the importance of early diagnosis of AD, based upon symptoms of cognitive decline. A symptom of early cognitive decline and indicator of AD is Mild Cognitive Impairment (MCI). In addition to this qualitative test, Biomarker tests have been proposed in the medical field including p-Tau, FDG-PET, and hippocampal. These tests can be administered to patients as early detectors of AD thus improving patients’ life quality and potentially reducing the costs of the health structure. Preliminary work has been conducted in the development of a Sequential Tree Based Classifier (STC), which helps medical providers predict if a patient will contract AD or not, by sequentially testing these biomarker tests. The STC model, however, has its limitations and the need for a more complex, robust model is needed. In fact, STC assumes a general linear model as the status of the patient based upon the tests results. We take a simulation perspective and try to define a more complex model that represents the patient evolution in time.

Specifically, this thesis focuses on the formulation of a Markov Chain model that is complex and robust. This Markov Chain model emulates the evolution of MCI patients based upon doctor visits and the sequential administration of biomarker tests. Data provided to create this Markov Chain model were collected by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The data lacked detailed information of the sequential administration of the biomarker tests and therefore, different analytical approaches were tried and conducted in order to calibrate the model. The resulting Markov Chain model provided the capability to conduct experiments regarding different parameters of the Markov Chain and yielded different results of patients that contracted AD and those that did not, leading to important insights into effect of thresholds and sequence on patient prediction capability as well as health costs reduction.



The data in this thesis was provided from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). ADNI investigators did not contribute to any analysis or writing of this thesis. A list of the ADNI investigators can be found at: http://adni.loni.usc.edu/about/governance/principal-investigators/ .
ContributorsCamarena, Raquel (Author) / Pedrielli, Giulia (Thesis advisor) / Li, Jing (Thesis advisor) / Wu, Teresa (Committee member) / Arizona State University (Publisher)
Created2018
156932-Thumbnail Image.png
Description
Transfer learning is a sub-field of statistical modeling and machine learning. It refers to methods that integrate the knowledge of other domains (called source domains) and the data of the target domain in a mathematically rigorous and intelligent way, to develop a better model for the target domain than a

Transfer learning is a sub-field of statistical modeling and machine learning. It refers to methods that integrate the knowledge of other domains (called source domains) and the data of the target domain in a mathematically rigorous and intelligent way, to develop a better model for the target domain than a model using the data of the target domain alone. While transfer learning is a promising approach in various application domains, my dissertation research focuses on the particular application in health care, including telemonitoring of Parkinson’s Disease (PD) and radiomics for glioblastoma.

The first topic is a Mixed Effects Transfer Learning (METL) model that can flexibly incorporate mixed effects and a general-form covariance matrix to better account for similarity and heterogeneity across subjects. I further develop computationally efficient procedures to handle unknown parameters and large covariance structures. Domain relations, such as domain similarity and domain covariance structure, are automatically quantified in the estimation steps. I demonstrate METL in an application of smartphone-based telemonitoring of PD.

The second topic focuses on an MRI-based transfer learning algorithm for non-invasive surgical guidance of glioblastoma patients. Limited biopsy samples per patient create a challenge to build a patient-specific model for glioblastoma. A transfer learning framework helps to leverage other patient’s knowledge for building a better predictive model. When modeling a target patient, not every patient’s information is helpful. Deciding the subset of other patients from which to transfer information to the modeling of the target patient is an important task to build an accurate predictive model. I define the subset of “transferrable” patients as those who have a positive rCBV-cell density correlation, because a positive correlation is confirmed by imaging theory and the its respective literature.

The last topic is a Privacy-Preserving Positive Transfer Learning (P3TL) model. Although negative transfer has been recognized as an important issue by the transfer learning research community, there is a lack of theoretical studies in evaluating the risk of negative transfer for a transfer learning method and identifying what causes the negative transfer. My work addresses this issue. Driven by the theoretical insights, I extend Bayesian Parameter Transfer (BPT) to a new method, i.e., P3TL. The unique features of P3TL include intelligent selection of patients to transfer in order to avoid negative transfer and maintain patient privacy. These features make P3TL an excellent model for telemonitoring of PD using an At-Home Testing Device.
ContributorsYoon, Hyunsoo (Author) / Li, Jing (Thesis advisor) / Wu, Teresa (Committee member) / Yan, Hao (Committee member) / Hu, Leland S. (Committee member) / Arizona State University (Publisher)
Created2018
154578-Thumbnail Image.png
Description
Buildings consume nearly 50% of the total energy in the United States, which drives the need to develop high-fidelity models for building energy systems. Extensive methods and techniques have been developed, studied, and applied to building energy simulation and forecasting, while most of work have focused on developing dedicated modeling

Buildings consume nearly 50% of the total energy in the United States, which drives the need to develop high-fidelity models for building energy systems. Extensive methods and techniques have been developed, studied, and applied to building energy simulation and forecasting, while most of work have focused on developing dedicated modeling approach for generic buildings. In this study, an integrated computationally efficient and high-fidelity building energy modeling framework is proposed, with the concentration on developing a generalized modeling approach for various types of buildings. First, a number of data-driven simulation models are reviewed and assessed on various types of computationally expensive simulation problems. Motivated by the conclusion that no model outperforms others if amortized over diverse problems, a meta-learning based recommendation system for data-driven simulation modeling is proposed. To test the feasibility of the proposed framework on the building energy system, an extended application of the recommendation system for short-term building energy forecasting is deployed on various buildings. Finally, Kalman filter-based data fusion technique is incorporated into the building recommendation system for on-line energy forecasting. Data fusion enables model calibration to update the state estimation in real-time, which filters out the noise and renders more accurate energy forecast. The framework is composed of two modules: off-line model recommendation module and on-line model calibration module. Specifically, the off-line model recommendation module includes 6 widely used data-driven simulation models, which are ranked by meta-learning recommendation system for off-line energy modeling on a given building scenario. Only a selective set of building physical and operational characteristic features is needed to complete the recommendation task. The on-line calibration module effectively addresses system uncertainties, where data fusion on off-line model is applied based on system identification and Kalman filtering methods. The developed data-driven modeling framework is validated on various genres of buildings, and the experimental results demonstrate desired performance on building energy forecasting in terms of accuracy and computational efficiency. The framework could be easily implemented into building energy model predictive control (MPC), demand response (DR) analysis and real-time operation decision support systems.
ContributorsCui, Can (Author) / Wu, Teresa (Thesis advisor) / Weir, Jeffery D. (Thesis advisor) / Li, Jing (Committee member) / Fowler, John (Committee member) / Hu, Mengqi (Committee member) / Arizona State University (Publisher)
Created2016
187808-Thumbnail Image.png
Description
This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth

This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth theoretical study. Next, various computational approaches to estimating causal effects with machine learning methods are compared with these theoretical desiderata in mind. Several improvements to current methods for causal machine learning are identified and compelling angles for further study are pinpointed. Finally, a common method used for “explaining” predictions of machine learning algorithms, SHAP, is evaluated critically through a statistical lens.
ContributorsHerren, Andrew (Author) / Hahn, P Richard (Thesis advisor) / Kao, Ming-Hung (Committee member) / Lopes, Hedibert (Committee member) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Arizona State University (Publisher)
Created2023
158527-Thumbnail Image.png
Description
With the increased demand for genetically modified T-cells in treating hematological malignancies, the need for an optimized measurement policy within the current good manufacturing practices for better quality control has grown greatly. There are several steps involved in manufacturing gene therapy. These steps are for the autologous-type gene therapy, in

With the increased demand for genetically modified T-cells in treating hematological malignancies, the need for an optimized measurement policy within the current good manufacturing practices for better quality control has grown greatly. There are several steps involved in manufacturing gene therapy. These steps are for the autologous-type gene therapy, in chronological order, are harvesting T-cells from the patient, activation of the cells (thawing the cryogenically frozen cells after transport to manufacturing center), viral vector transduction, Chimeric Antigen Receptor (CAR) attachment during T-cell expansion, then infusion into patient. The need for improved measurement heuristics within the transduction and expansion portions of the manufacturing process has reached an all-time high because of the costly nature of manufacturing the product, the high cycle time (approximately 14-28 days from activation to infusion), and the risk for external contamination during manufacturing that negatively impacts patients post infusion (such as illness and death).

The main objective of this work is to investigate and improve measurement policies on the basis of quality control in the transduction/expansion bio-manufacturing processes. More specifically, this study addresses the issue of measuring yield within the transduction/expansion phases of gene therapy. To do so, it was decided to model the process as a Markov Decision Process where the decisions being made are optimally chosen to create an overall optimal measurement policy; for a set of predefined parameters.
ContributorsStarkey, Michaela (Author) / Pedrielli, Giulia (Thesis advisor) / Li, Jing (Committee member) / Wu, Teresa (Committee member) / Arizona State University (Publisher)
Created2020
187395-Thumbnail Image.png
Description
This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of novel “reduced form” models which are designed to assess the particular challenges of different datasets. Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy. The question arises from the observation that companies issued going concern opinions were more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine learning sensitivity analysis is developed to answer this question. In exchange for additional flexibility and fewer assumptions, this approach loses point identification of causal effects and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of the causal effect of going concern opinions on bankruptcy. Reported in the simulations are different performance metrics of the model in comparison with other popular methods and a robust analysis of the sensitivity of the model to mis-specification. Results on empirical data indicate that forecasts of bankruptcies likely do have a small causal effect. Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in the United States. The dynamic nature of the pandemic complicates more straightforward regression adjustments and invalidates many alternative models. The chapter comments on the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in a semi-parametric regression model for causal analysis. Results are investigated thoroughly for empirical justification and stress-tested in simulated settings.
ContributorsPapakostas, Demetrios (Author) / Hahn, Paul (Thesis advisor) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Kao, Ming-Hung (Committee member) / Lan, Shiwei (Committee member) / Arizona State University (Publisher)
Created2023
171927-Thumbnail Image.png
Description
Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems.

Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems. Eilertson et al. (2019) propose using a state-space model combined with maximum likelihood methods for estimating measles transmission. A Bayesian approach that uses particle Markov Chain Monte Carlo (pMCMC) is proposed to estimate the parameters of the non-linear state-space model developed in Eilertson et al. (2019) and similar previous studies. This dissertation illustrates the performance of this approach by calculating posterior estimates of the model parameters and predictions of the unobserved states in simulations and case studies. Also, Iteration Filtering (IF2) is used as a support method to verify the Bayesian estimation and to inform the selection of prior distributions. In the second half of the thesis, a birth-death process is proposed to model the unobserved population size of a disease vector. This model studies the effect of a disease vector population size on a second affected population. The second population follows a non-homogenous Poisson process when conditioned on the vector process with a transition rate given by a scaled version of the vector population. The observation model also measures a potential threshold event when the host species population size surpasses a certain level yielding a higher transmission rate. A maximum likelihood procedure is developed for this model, which combines particle filtering with the Minorize-Maximization (MM) algorithm and extends the work of Crawford et al. (2014).
ContributorsMartinez Rivera, Wilmer Osvaldo (Author) / Fricks, John (Thesis advisor) / Reiser, Mark (Committee member) / Zhou, Shuang (Committee member) / Cheng, Dan (Committee member) / Lan, Shiwei (Committee member) / Arizona State University (Publisher)
Created2022
191496-Thumbnail Image.png
Description
This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART

This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART trees. This allows for extrapolation based on the most relevant data points and covariate variables determined by the trees' structure. The local GP technique is extended to the Bayesian causal forest (BCF) models to address the positivity violation issue in causal inference. Additionally, I introduce the LongBet model to estimate time-varying, heterogeneous treatment effects in panel data. Furthermore, I present a Poisson-based model, with a modified likelihood for XBART for the multi-class classification problem.
ContributorsWang, Meijia (Author) / Hahn, Paul (Thesis advisor) / He, Jingyu (Committee member) / Lan, Shiwei (Committee member) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Arizona State University (Publisher)
Created2024
158093-Thumbnail Image.png
Description
Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data in many fields, there exist some challenges for high dimensional dataset. It is noted that among a large number of features, some may not indeed contribute to delineate the cluster profiles. The inclusion of these “noisy” features will confuse the model to identify the real structure of the clusters and cost more computational time. Recognizing the issue, in this dissertation, I propose a new feature selection algorithm for continuous dataset first and then extend to mixed datatype. Finally, I conduct uncertainty quantification for the feature selection results as the third topic.

The first topic is an embedded feature selection algorithm termed Expectation-Selection-Maximization (ESM) model that can automatically select features while optimizing the parameters for Gaussian Mixture Model. I introduce a relevancy index (RI) revealing the contribution of the feature in the clustering process to assist feature selection. I demonstrate the efficacy of the ESM by studying two synthetic datasets, four benchmark datasets, and an Alzheimer’s Disease dataset.

The second topic focuses on extending the application of ESM algorithm to handle mixed datatypes. The Gaussian mixture model is generalized to Generalized Model of Mixture (GMoM), which can not only handle continuous features, but also binary and nominal features.

The last topic is about Uncertainty Quantification (UQ) of the feature selection. A new algorithm termed ESOM is proposed, which takes the variance information into consideration while conducting feature selection. Also, a set of outliers are generated in the feature selection process to infer the uncertainty in the input data. Finally, the selected features and detected outlier instances are evaluated by visualization comparison.
ContributorsFu, Yinlin (Author) / Wu, Teresa (Thesis advisor) / Mirchandani, Pitu (Committee member) / Li, Jing (Committee member) / Pedrielli, Giulia (Committee member) / Arizona State University (Publisher)
Created2020