Search Content

Public health surveillance in high-dimensions with supervised learning

Description

Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change…

Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change that only occurs within a region, initially unspecified, defined by these covariates. Current methods are typically limited to spatial and/or temporal covariate information and often fail to use all the information available in modern data that can be paramount in unveiling these subtle changes. Additional complexities associated with modern health data that are often not accounted for by traditional methods include: covariates of mixed type, missing values, and high-order interactions among covariates. This work proposes a transform of public health surveillance to supervised learning, so that an appropriate learner can inherently address all the complexities described previously. At the same time, quantitative measures from the learner can be used to define signal criteria to detect changes in rates of events. A Feature Selection (FS) method is used to identify covariates that contribute to a model and to generate a signal. A measure of statistical significance is included to control false alarms. An alternative Percentile method identifies the specific cases that lead to changes using class probability estimates from tree-based ensembles. This second method is intended to be less computationally intensive and significantly simpler to implement. Finally, a third method labeled Rule-Based Feature Value Selection (RBFVS) is proposed for identifying the specific regions in high-dimensional space where the changes are occurring. Results on simulated examples are used to compare the FS method and the Percentile method. Note this work emphasizes the application of the proposed methods on public health surveillance. Nonetheless, these methods can easily be extended to a variety of applications where counts (or rates) of events are monitored for changes. Such problems commonly occur in domains such as manufacturing, economics, environmental systems, engineering, as well as in public health.

ContributorsDavila, Saylisse (Author) / Runger, George C. (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Young, Dennis (Committee member) / Gel, Esma (Committee member) / Arizona State University (Publisher)

Created2010

Alternative methods via random forest to identify interactions in a general framework and variable importance in the context of value-added models

Description

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’…

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions.

ContributorsValdivia, Arturo (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Broatch, Jennifer (Committee member) / Arizona State University (Publisher)

Created2013

A Comparison of Two Approaches for Constructing Exact Designs for Mixed Binary and Continuous Responses.

Description

This thesis is concerned with experimental designs for studies a controllable independent variable X, a continuous response variable Y and a binary response variable Z. It is known that judiciously selected design allows experimenters to collect informative data for making precise and valid statistical inferences with minimum cost. However, for…

This thesis is concerned with experimental designs for studies a controllable independent variable X, a continuous response variable Y and a binary response variable Z. It is known that judiciously selected design allows experimenters to collect informative data for making precise and valid statistical inferences with minimum cost. However, for the complex set- ting that this thesis consider, designs that yield a high expected estimation precision may still possess a high probability of having non-estimable parameters, especially when the sample size is small. Such an observation has been reported in some previous works on the separation issue for, e.g., the logistic regression. Therefore, when selecting a study design, it is important to consider both the expected variances of the parameter estimates, and the probability for having non-estimable parameters.A comparison of two approaches for constructing designs for the previously mentioned setting with a mixed responses model is presented in this work. The two design approaches are the locally A-optimal design approach, and a penalized A-optimal design approach that involves the optimization of A-optimality criterion plus the penalty term to reduce the chance of including designs points that have a high probability to make some parameters non-estimable.

ContributorsAbdullah, Bayan Abdulaziz (Author) / Kao, Ming-Hung MK (Thesis advisor) / Cheng, Dan DC (Committee member) / Zheng, Yi YZ (Committee member) / Arizona State University (Publisher)

Created2021

Measurement, Detection, and Parameter Estimation of Single Photon Correlations

Description

The continuous time-tagging of photon arrival times for high count rate sources isnecessary for applications such as optical communications, quantum key encryption, and astronomical measurements. Detection of Hanbury-Brown and Twiss (HBT) single photon correlations from thermal sources, such as stars, requires a combination of high dynamic range, long integration times, and low systematics…

The continuous time-tagging of photon arrival times for high count rate sources isnecessary for applications such as optical communications, quantum key encryption, and astronomical measurements. Detection of Hanbury-Brown and Twiss (HBT) single photon correlations from thermal sources, such as stars, requires a combination of high dynamic range, long integration times, and low systematics in the photon detection and time tagging system. The continuous nature of the measurements and the need for highly accurate timing resolution requires a customized time-to-digital converter (TDC). A custom built, two-channel, field programmable gate array (FPGA)-based TDC capable of continuously time tagging single photons with sub clock cycle timing resolution was characterized. Auto-correlation and cross-correlation measurements were used to constrain spurious systematic effects in the pulse count data as a function of system variables. These variables included, but were not limited to, incident photon count rate, incoming signal attenuation, and measurements of fixed signals. Additionally, a generalized likelihood ratio test using maximum likelihood estimators (MLEs) was derived as a means to detect and estimate correlated photon signal parameters. The derived GLRT was capable of detecting correlated photon signals in a laboratory setting with a high degree of statistical confidence. A proof is presented in which the MLE for the amplitude of the correlated photon signal is shown to be the minimum variance unbiased estimator (MVUE). The fully characterized TDC was used in preliminary measurements of astronomical sources using ground based telescopes. Finally, preliminary theoretical groundwork is established for the deep space optical communications system of the proposed Breakthrough Starshot project, in which low-mass craft will travel to the Alpha Centauri system to collect scientific data from Proxima B. This theoretical groundwork utilizes recent and upcoming space based optical communication systems as starting points for the Starshot communication system.

ContributorsHodges, Todd Michael William (Author) / Mauskopf, Philip (Thesis advisor) / Trichopoulos, George (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Bliss, Daniel (Committee member) / Arizona State University (Publisher)

Created2022

Adversarial Machine Learning for Recommendation Systems

Description

Recently, Generative Adversarial Networks (GANs) have been applied to the problem of Cold-Start Recommendation, but the training performance of these models is hampered by the extreme sparsity in warm user purchase behavior. This thesis introduces a novel representation for user-vectors by combining user demographics and user preferences, making the model…

Recently, Generative Adversarial Networks (GANs) have been applied to the problem of Cold-Start Recommendation, but the training performance of these models is hampered by the extreme sparsity in warm user purchase behavior. This thesis introduces a novel representation for user-vectors by combining user demographics and user preferences, making the model a hybrid system which uses Collaborative Filtering and Content Based Recommendation. This system models user purchase behavior using weighted user-product preferences (explicit feedback) rather than binary user-product interactions (implicit feedback). Using this a novel sparse adversarial model, Sparse ReguLarized Generative Adversarial Network (SRLGAN), is developed for Cold-Start Recommendation. SRLGAN leverages the sparse user-purchase behavior which ensures training stability and avoids over-fitting on warm users. The performance of SRLGAN is evaluated on two popular datasets and demonstrate state-of-the-art results.

ContributorsShah, Aksheshkumar Ajaykumar (Author) / Venkateswara, Hemanth (Thesis advisor) / Berman, Spring (Thesis advisor) / Ladani, Leila J (Committee member) / Arizona State University (Publisher)

Created2022

Comprehensive Two-dimensional Gas Chromatography as a Tool for Exploring the In Vitro Volatile Metabolome of Pseudomonas aeruginosa: A Case Study in Untargeted Metabolomics

Description

For untargeted volatile metabolomics analyses, comprehensive two-dimensional gas chromatography (GC×GC) is a powerful tool for separating complex mixtures and can provide highly specific information about the chemical composition of a variety of samples. With respect to human disease, the application of GC×GC in untargeted metabolomics is contributing to the development…

For untargeted volatile metabolomics analyses, comprehensive two-dimensional gas chromatography (GC×GC) is a powerful tool for separating complex mixtures and can provide highly specific information about the chemical composition of a variety of samples. With respect to human disease, the application of GC×GC in untargeted metabolomics is contributing to the development of diagnostics for a range of diseases, most notably bacterial infections. Pseudomonas aeruginosa, in particular, is an important human pathogen, and for individuals with cystic fibrosis (CF), chronic P. aeruginosa lung infections significantly increase morbidity and mortality. Developing non-invasive tools that detect these infections earlier is critical for improving patient outcomes, and untargeted profiling of P. aeruginosa volatile metabolites could be leveraged to meet this challenge. The work presented in this dissertation serves as a case study of the application of GC×GC in this area.Using headspace solid-phase microextraction and time-of-flight mass spectrometry coupled with GC×GC (HS-SPME GC×GC-TOFMS), the volatile metabolomes of P. aeruginosa isolates from early and late chronic CF lung infections were characterized. Through this study, the size of the P. aeruginosa pan-volatilome was increased by almost 40%, and differences in the relative abundances of the volatile metabolites between early- and late-infection isolates were identified. These differences were also strongly associated with isolate phenotype. Subsequent analyses sought to connect these metabolome-phenome trends to the genome by profiling the volatile metabolomes of P. aeruginosa strains harboring mutations in genes that are important for regulating chronic infection phenotypes. Subsets of volatile metabolites that accurately distinguish between wild-type and mutant strains were identified. Together, these results highlight the utility of GC×GC in the search for prognostic volatile biomarkers for P. aeruginosa CF lung infections. Finally, the complex data sets acquired from untargeted GC×GC studies pose major challenges in downstream statistical analysis. Missing data, in particular, severely limits even the most robust statistical tools and must be remediated, commonly through imputation. A comparison of imputation strategies showed that algorithmic approaches such as Random Forest have superior performance over simpler methods, and imputing within replicate samples reinforces volatile metabolite reproducibility.

ContributorsDavis, Trenton James (Author) / Bean, Heather D (Thesis advisor) / Haydel, Shelley E (Committee member) / Lake, Douglas F (Committee member) / Runger, George C (Committee member) / Arizona State University (Publisher)

Created2022

A Bayesian Forecast Model for the Climatic Response of Unsaturated Soils

Description

The climate-driven volumetric response of unsaturated soils (shrink-swell and frost heave) frequently causes costly distresses in lightly loaded structures (pavements and shallow foundations) due to the sporadic climatic fluctuations and soil heterogeneity which is not captured during the geotechnical design. The complexity associated with the unsaturated soil mechanics combined with…

The climate-driven volumetric response of unsaturated soils (shrink-swell and frost heave) frequently causes costly distresses in lightly loaded structures (pavements and shallow foundations) due to the sporadic climatic fluctuations and soil heterogeneity which is not captured during the geotechnical design. The complexity associated with the unsaturated soil mechanics combined with the high degree of variability in both the natural characteristics of soil and the empirical models which are commonly implemented tends to lead to engineering judgment outweighing the results of deterministic computations for the basis of design. Recent advances in the application of statistical techniques and Bayesian Inference in geotechnical modeling allows for the inclusion of both parameter and model uncertainty, providing a quantifiable representation of this invaluable engineering judgement. The overall goal achieved in this study was to develop, validate, and implement a new method to evaluate climate-driven volume change of shrink-swell soils using a framework that encompasses predominantly stochastic time-series techniques and mechanistic shrink-swell volume change computations. Four valuable objectives were accomplished during this research study while on the path to complete the overall goal: 1) development of an procedure for automating the selection of the Fourier Series form of the soil suction diffusion equations used to represent the natural seasonal variations in suction at the ground surface, 2) development of an improved framework for deterministic estimation of shrink-swell soil volume change using historical climate data and the Fourier series suction model, 3) development of a Bayesian approach to randomly generate combinations of correlated soil properties for use in stochastic simulations, and 4) development of a procedure to stochastically forecast the climatic parameters required for shrink-swell soil volume change estimations. The models presented can be easily implemented into existing foundation and pavement design procedures or used for forensic evaluations using historical data. For pavement design, the new framework for stochastically forecasting the variability of shrink-swell soil volume change provides significant improvement over the existing empirical models that have been used for more than four decades.

ContributorsOlaiz, Austin Hunter (Author) / Zapata, Claudia (Thesis advisor) / Houston, Sandra (Committee member) / Kavazanjian, Edward (Committee member) / Soltanpour, Yasser (Committee member) / Arizona State University (Publisher)

Created2022

Optimal Designs under Logistic Mixed Models

Description

Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on optimal design theory, the optimality criteria depend on the estimated…

Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on optimal design theory, the optimality criteria depend on the estimated parameters, which leads to local optimality. Moreover, the information matrix under a GLMM doesn't have a closed-form expression. My dissertation includes three topics related to this design problem. The first part is searching for locally optimal designs under GLMMs with longitudinal data. I apply penalized quasi-likelihood (PQL) method to approximate the information matrix and compare several approximations to show the superiority of PQL over other approximations. Under different local parameters and design restrictions, locally D- and A- optimal designs are constructed based on the approximation. An interesting finding is that locally optimal designs sometimes apply different designs to different subjects. Finally, the robustness of these locally optimal designs is discussed. In the second part, an unknown observational covariate is added to the previous model. With an unknown observational variable in the experiment, expected optimality criteria are considered. Under different assumptions of the unknown variable and parameter settings, locally optimal designs are constructed and discussed. In the last part, Bayesian optimal designs are considered under logistic mixed models. Considering different priors of the local parameters, Bayesian optimal designs are generated. Bayesian design under such a model is usually expensive in time. The running time in this dissertation is optimized to an acceptable amount with accurate results. I also discuss the robustness of these Bayesian optimal designs, which is the motivation of applying such an approach.

ContributorsShi, Yao (Author) / Stufken, John (Thesis advisor) / Kao, Ming-Hung (Thesis advisor) / Lan, Shiwei (Committee member) / Pan, Rong (Committee member) / Reiser, Mark (Committee member) / Arizona State University (Publisher)

Created2022

Statistical Inference for Multiple Change Points and Implicit Network Structures

Description

This dissertation contains two research projects: Multiple Change Point Detection in Linear Models and Statistical Inference for Implicit Network Structures. In the first project, a new method to detect the number and locations of change points in piecewise linear models under stationary Gaussian noise is proposed. The method transforms the problem…

This dissertation contains two research projects: Multiple Change Point Detection in Linear Models and Statistical Inference for Implicit Network Structures. In the first project, a new method to detect the number and locations of change points in piecewise linear models under stationary Gaussian noise is proposed. The method transforms the problem of detecting change points to the detection of local extrema by kernel smoothing and differentiating the data sequence. The change points are detected by computing the p-values for all local extrema using the derived peak height distributions of smooth Gaussian processes, and then applying the Benjamini-Hochberg procedure to identify significant local extrema. Theoretical results show that the method can guarantee asymptotic control of the False Discover Rate (FDR) and power consistency, as the length of the sequence, and the size of slope changes and jumps get large. In addition, compared to traditional methods for change point detection based on recursive segmentation, The proposed method tests the candidate local extrema only one time, achieving the smallest computational complexity. Numerical studies show that the properties on FDR control and power consistency are maintained in non-asymptotic cases. In the second project, identifiability and estimation consistency under mild conditions in hub model are proved. Hub Model is a model-based approach, introduced by Zhao and Weko (2019), to infer implicit network structuress from grouping behavior. The hub model assumes that each member of the group is brought together by a member of the group called the hub. This paper generalize the hub model by introducing a model component that allows hubless groups in which individual nodes spontaneously appear independent of any other individual. The new model bridges the gap between the hub model and the degenerate case of the mixture model -- the Bernoulli product. Furthermore, a penalized likelihood approach is proposed to estimate the set of hubs when it is unknown.

ContributorsHe, Zhibing (Author) / Zhao, Yunpeng YZ (Thesis advisor) / Cheng, Dan DC (Thesis advisor) / Lopes, Hedibert HL (Committee member) / Fricks, John JF (Committee member) / Kao, Ming-Hung MK (Committee member) / Arizona State University (Publisher)

Created2022

Advances in Directional Goodness-of-fit Testing of Binary Data under Model Misspecification in Case of Sparseness

Description

Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit test is sometimes considered to be an omnibus test but…

Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit test is sometimes considered to be an omnibus test but not a directional test so it is hard to find the source of poor fit when the null hypothesis is rejected and it will lose its validity and effectiveness in some of the special conditions. Sparseness is such an abnormal condition. One effective way to overcome the adverse effects of sparseness is to use limited-information statistics. In this dissertation, two topics about constructing and using limited-information statistics to overcome sparseness for binary data will be included. In the first topic, the theoretical framework of pairwise concordance and the transformation matrix which is used to extract the corresponding marginals and their generalizations are provided. Then a series of new chi-square test statistics and corresponding orthogonal components are proposed, which are used to detect the model misspecification for longitudinal binary data. One of the important conclusions is, the test statistic $X^2_{2c}$ can be taken as an extension of $X^2_{[2]}$, the second-order marginals of traditional Pearson’s chi-square statistic. In the second topic, the research interest is to investigate the effect caused by different intercept patterns when using Lagrange multiplier (LM) test to find the source of misfit for two items in 2-PL IRT model. Several other directional chi-square test statistics are taken into comparison. The simulation results showed that the intercept pattern does affect the performance of goodness-of-fit test, especially the power to find the source of misfit if the source of misfit does exist. More specifically, the power is directly affected by the `intercept distance' between two misfit variables. Another discovery is, the LM test statistic has the best balance between the accurate Type I error rates and high empirical power, which indicates the LM test is a robust test.

ContributorsXu, Jinhui (Author) / Reiser, Mark (Thesis advisor) / Kao, Ming-Hung (Committee member) / Wilson, Jeffrey (Committee member) / Zheng, Yi (Committee member) / Edwards, Michael (Committee member) / Arizona State University (Publisher)

Created2022

Filtering by