Search Content

A Data Mining Approach to Modeling Customer Preference: A Case Study of Intel Corporation

Description

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents a decision support framework that provides a holistic view on customer preference by following a two-phase procedure. Phase-1 uses cluster…

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents a decision support framework that provides a holistic view on customer preference by following a two-phase procedure. Phase-1 uses cluster analysis to create product profiles based on which customer profiles are derived. Phase-2 then delves deep into each of the customer profiles and investigates causality behind their preference using Bayesian networks. This thesis illustrates the working of the framework using the case of Intel Corporation, world’s largest semiconductor manufacturing company.

ContributorsRam, Sudarshan Venkat (Author) / Kempf, Karl G. (Thesis advisor) / Wu, Teresa (Thesis advisor) / Ju, Feng (Committee member) / Arizona State University (Publisher)

Created2017

A Disease Progression Modeling Framework for Nonalcoholic Steatohepatitis Using Multiparametric Serial Magnetic Resonance Imaging and Elastography

Description

Nonalcoholic Steatohepatitis (NASH) is a severe form of Nonalcoholic fatty liverdisease, that is caused due to excessive calorie intake, sedentary lifestyle and in the absence of severe alcohol consumption. It is widely prevalent in the United States and in many other developed countries, affecting up to 25 percent of the population. Due to…

Nonalcoholic Steatohepatitis (NASH) is a severe form of Nonalcoholic fatty liverdisease, that is caused due to excessive calorie intake, sedentary lifestyle and in the absence of severe alcohol consumption. It is widely prevalent in the United States and in many other developed countries, affecting up to 25 percent of the population. Due to being asymptotic, it usually goes unnoticed and may lead to liver failure if not treated at the right time. Currently, liver biopsy is the gold standard to diagnose NASH, but being an invasive procedure, it comes with it's own complications along with the inconvenience of sampling repeated measurements over a period of time. Hence, noninvasive procedures to assess NASH are urgently required. Magnetic Resonance Elastography (MRE) based Shear Stiffness and Loss Modulus along with Magnetic Resonance Imaging based proton density fat fraction have been successfully combined to predict NASH stages However, their role in the prediction of disease progression still remains to be investigated. This thesis thus looks into combining features from serial MRE observations to develop statistical models to predict NASH progression. It utilizes data from an experiment conducted on male mice to develop progressive and regressive NASH and trains ordinal models, ordered probit regression and ordinal forest on labels generated from a logistic regression model. The models are assessed on histological data collected at the end point of the experiment. The models developed provide a framework to utilize a non-invasive tool to predict NASH disease progression.

ContributorsDeshpande, Eeshan (Author) / Ju, Feng (Thesis advisor) / Wu, Teresa (Committee member) / Yan, Hao (Committee member) / Arizona State University (Publisher)

Created2021

A Study on Optimization Measurement Policies for Quality Control Improvements in Gene Therapy Manufacturing

Description

With the increased demand for genetically modified T-cells in treating hematological malignancies, the need for an optimized measurement policy within the current good manufacturing practices for better quality control has grown greatly. There are several steps involved in manufacturing gene therapy. These steps are for the autologous-type gene therapy, in…

With the increased demand for genetically modified T-cells in treating hematological malignancies, the need for an optimized measurement policy within the current good manufacturing practices for better quality control has grown greatly. There are several steps involved in manufacturing gene therapy. These steps are for the autologous-type gene therapy, in chronological order, are harvesting T-cells from the patient, activation of the cells (thawing the cryogenically frozen cells after transport to manufacturing center), viral vector transduction, Chimeric Antigen Receptor (CAR) attachment during T-cell expansion, then infusion into patient. The need for improved measurement heuristics within the transduction and expansion portions of the manufacturing process has reached an all-time high because of the costly nature of manufacturing the product, the high cycle time (approximately 14-28 days from activation to infusion), and the risk for external contamination during manufacturing that negatively impacts patients post infusion (such as illness and death).

The main objective of this work is to investigate and improve measurement policies on the basis of quality control in the transduction/expansion bio-manufacturing processes. More specifically, this study addresses the issue of measuring yield within the transduction/expansion phases of gene therapy. To do so, it was decided to model the process as a Markov Decision Process where the decisions being made are optimally chosen to create an overall optimal measurement policy; for a set of predefined parameters.

ContributorsStarkey, Michaela (Author) / Pedrielli, Giulia (Thesis advisor) / Li, Jing (Committee member) / Wu, Teresa (Committee member) / Arizona State University (Publisher)

Created2020

Case Studies in Machine Learning of Reduced Form Models for Causal Inference

Description

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of…

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of novel “reduced form” models which are designed to assess the particular challenges of different datasets. Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy. The question arises from the observation that companies issued going concern opinions were more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine learning sensitivity analysis is developed to answer this question. In exchange for additional flexibility and fewer assumptions, this approach loses point identification of causal effects and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of the causal effect of going concern opinions on bankruptcy. Reported in the simulations are different performance metrics of the model in comparison with other popular methods and a robust analysis of the sensitivity of the model to mis-specification. Results on empirical data indicate that forecasts of bankruptcies likely do have a small causal effect. Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in the United States. The dynamic nature of the pandemic complicates more straightforward regression adjustments and invalidates many alternative models. The chapter comments on the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in a semi-parametric regression model for causal analysis. Results are investigated thoroughly for empirical justification and stress-tested in simulated settings.

ContributorsPapakostas, Demetrios (Author) / Hahn, Paul (Thesis advisor) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Kao, Ming-Hung (Committee member) / Lan, Shiwei (Committee member) / Arizona State University (Publisher)

Created2023

Estimation for Disease Models Across Scales

Description

Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems.…

Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems. Eilertson et al. (2019) propose using a state-space model combined with maximum likelihood methods for estimating measles transmission. A Bayesian approach that uses particle Markov Chain Monte Carlo (pMCMC) is proposed to estimate the parameters of the non-linear state-space model developed in Eilertson et al. (2019) and similar previous studies. This dissertation illustrates the performance of this approach by calculating posterior estimates of the model parameters and predictions of the unobserved states in simulations and case studies. Also, Iteration Filtering (IF2) is used as a support method to verify the Bayesian estimation and to inform the selection of prior distributions. In the second half of the thesis, a birth-death process is proposed to model the unobserved population size of a disease vector. This model studies the effect of a disease vector population size on a second affected population. The second population follows a non-homogenous Poisson process when conditioned on the vector process with a transition rate given by a scaled version of the vector population. The observation model also measures a potential threshold event when the host species population size surpasses a certain level yielding a higher transmission rate. A maximum likelihood procedure is developed for this model, which combines particle filtering with the Minorize-Maximization (MM) algorithm and extends the work of Crawford et al. (2014).

ContributorsMartinez Rivera, Wilmer Osvaldo (Author) / Fricks, John (Thesis advisor) / Reiser, Mark (Committee member) / Zhou, Shuang (Committee member) / Cheng, Dan (Committee member) / Lan, Shiwei (Committee member) / Arizona State University (Publisher)

Created2022

Multi-Variant Spatially Informed Rapid Testing for Epidemic Model

Description

The COVID-19 outbreak that started in 2020, brought the world to its knees and is still a menace after three years. Over eighty-five million cases and over a million deaths have occurred due to COVID-19 during that time in the United States alone. A great deal of research has gone…

The COVID-19 outbreak that started in 2020, brought the world to its knees and is still a menace after three years. Over eighty-five million cases and over a million deaths have occurred due to COVID-19 during that time in the United States alone. A great deal of research has gone into making epidemic models to show the impact of the virus by plotting the cases, deaths, and hospitalization due to COVID-19. However, there is very less research that has anything to do with mapping different variants of COVID-19. SARS-CoV-2, the virus that causes COVID-19, constantly mutates and multiple variants have emerged over time. The major variants include Beta, Gamma, Delta and the recent one, Omicron. The purpose of the research done in this thesis is to modify one of the epidemic models i.e., the Spatially Informed Rapid Testing for Epidemic Model (SIRTEM), in such a way that various variants of the virus will be modelled at the same time. The model will be assessed by adding the Omicron and the Delta variants and in doing so, the effects of different variants can be studied by looking at the positive cases, hospitalizations, and deaths from both the variants for the Arizona Population. The focus will be to find the best infection rate and testing rate by using Random numbers so that the published positive cases and the positive cases derived from the model have the least mean square error.

ContributorsVarghese, Allen Moncey (Author) / Pedrielli, Giulia (Thesis advisor) / Candan, Kasim S (Committee member) / Wu, Teresa (Committee member) / Arizona State University (Publisher)

Created2022

Computational Challenges in BART Modeling: Extrapolation, Classification, and Causal Inference

Description

This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART…

This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART trees. This allows for extrapolation based on the most relevant data points and covariate variables determined by the trees' structure. The local GP technique is extended to the Bayesian causal forest (BCF) models to address the positivity violation issue in causal inference. Additionally, I introduce the LongBet model to estimate time-varying, heterogeneous treatment effects in panel data. Furthermore, I present a Poisson-based model, with a modified likelihood for XBART for the multi-class classification problem.

ContributorsWang, Meijia (Author) / Hahn, Paul (Thesis advisor) / He, Jingyu (Committee member) / Lan, Shiwei (Committee member) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Arizona State University (Publisher)

Created2024

Novel Semi-Supervised Learning Models to Balance Data Inclusivity and Usability in Healthcare Applications

Description

Semi-supervised learning (SSL) is sub-field of statistical machine learning that is useful for problems that involve having only a few labeled instances with predictor (X) and target (Y) information, and abundance of unlabeled instances that only have predictor (X) information. SSL harnesses the target information available in the limited…

Semi-supervised learning (SSL) is sub-field of statistical machine learning that is useful for problems that involve having only a few labeled instances with predictor (X) and target (Y) information, and abundance of unlabeled instances that only have predictor (X) information. SSL harnesses the target information available in the limited labeled data, as well as the information in the abundant unlabeled data to build strong predictive models. However, not all the included information is useful. For example, some features may correspond to noise and including them will hurt the predictive model performance. Additionally, some instances may not be as relevant to model building and their inclusion will increase training time and potentially hurt the model performance. The objective of this research is to develop novel SSL models to balance data inclusivity and usability. My dissertation research focuses on applications of SSL in healthcare, driven by problems in brain cancer radiomics, migraine imaging, and Parkinson’s Disease telemonitoring.

The first topic introduces an integration of machine learning (ML) and a mechanistic model (PI) to develop an SSL model applied to predicting cell density of glioblastoma brain cancer using multi-parametric medical images. The proposed ML-PI hybrid model integrates imaging information from unbiopsied regions of the brain as well as underlying biological knowledge from the mechanistic model to predict spatial tumor density in the brain.

The second topic develops a multi-modality imaging-based diagnostic decision support system (MMI-DDS). MMI-DDS consists of modality-wise principal components analysis to incorporate imaging features at different aggregation levels (e.g., voxel-wise, connectivity-based, etc.), a constrained particle swarm optimization (cPSO) feature selection algorithm, and a clinical utility engine that utilizes inverse operators on chosen principal components for white-box classification models.

The final topic develops a new SSL regression model with integrated feature and instance selection called s2SSL (with “s2” referring to selection in two different ways: feature and instance). s2SSL integrates cPSO feature selection and graph-based instance selection to simultaneously choose the optimal features and instances and build accurate models for continuous prediction. s2SSL was applied to smartphone-based telemonitoring of Parkinson’s Disease patients.

ContributorsGaw, Nathan (Author) / Li, Jing (Thesis advisor) / Wu, Teresa (Committee member) / Yan, Hao (Committee member) / Hu, Leland (Committee member) / Arizona State University (Publisher)

Created2019

Embedded Feature Selection for Model-based Clustering

Description

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data…

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data in many fields, there exist some challenges for high dimensional dataset. It is noted that among a large number of features, some may not indeed contribute to delineate the cluster profiles. The inclusion of these “noisy” features will confuse the model to identify the real structure of the clusters and cost more computational time. Recognizing the issue, in this dissertation, I propose a new feature selection algorithm for continuous dataset first and then extend to mixed datatype. Finally, I conduct uncertainty quantification for the feature selection results as the third topic.

The first topic is an embedded feature selection algorithm termed Expectation-Selection-Maximization (ESM) model that can automatically select features while optimizing the parameters for Gaussian Mixture Model. I introduce a relevancy index (RI) revealing the contribution of the feature in the clustering process to assist feature selection. I demonstrate the efficacy of the ESM by studying two synthetic datasets, four benchmark datasets, and an Alzheimer’s Disease dataset.

The second topic focuses on extending the application of ESM algorithm to handle mixed datatypes. The Gaussian mixture model is generalized to Generalized Model of Mixture (GMoM), which can not only handle continuous features, but also binary and nominal features.

The last topic is about Uncertainty Quantification (UQ) of the feature selection. A new algorithm termed ESOM is proposed, which takes the variance information into consideration while conducting feature selection. Also, a set of outliers are generated in the feature selection process to infer the uncertainty in the input data. Finally, the selected features and detected outlier instances are evaluated by visualization comparison.

ContributorsFu, Yinlin (Author) / Wu, Teresa (Thesis advisor) / Mirchandani, Pitu (Committee member) / Li, Jing (Committee member) / Pedrielli, Giulia (Committee member) / Arizona State University (Publisher)

Created2020

Spatial Regression and Gaussian Process BART

Description

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were…

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were proposed to overcome the challenges in practice. There are three major parts in the dissertation.

In the first part, nonlinear regression models were embedded into a multistage workflow to predict the spatial abundance of reef fish species in the Gulf of Mexico. There were two challenges, zero-inflated data and out of sample prediction. The methods and models in the workflow could effectively handle the zero-inflated sampling data without strong assumptions. Three strategies were proposed to solve the out of sample prediction problem. The results and discussions showed that the nonlinear prediction had the advantages of high accuracy, low bias and well-performed in multi-resolution.

In the second part, a two-stage spatial regression model was proposed for analyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear mixed model that captured the linear and stationary effects. In the second stage, a generalized additive model was used to explain the nonlinear and nonstationary effects. The results illustrated that the two-stage model had good interpretability in understanding the effect of covariates, meanwhile, it kept high prediction accuracy which is competitive to the popular machine learning models, like, random forest, xgboost and support vector machine.

A new nonlinear regression model, Gaussian process BART (Bayesian additive regression tree), was proposed in the third part. Combining advantages in both BART and Gaussian process, the model could capture the nonlinear effects of both observed and latent covariates. To develop the model, first, the traditional BART was generalized to accommodate correlated errors. Then, the failure of likelihood based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed. Based on the idea of analysis of variation, back comparing and tuning range, were proposed to tackle this failure. Finally, effectiveness of the new model was examined by experiments on both simulation and real data.

ContributorsLu, Xuetao (Author) / McCulloch, Robert (Thesis advisor) / Hahn, Paul (Committee member) / Lan, Shiwei (Committee member) / Zhou, Shuang (Committee member) / Saul, Steven (Committee member) / Arizona State University (Publisher)

Created2020

Filtering by