Search Content

Alternative methods via random forest to identify interactions in a general framework and variable importance in the context of value-added models

Description

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’…

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions.

ContributorsValdivia, Arturo (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Broatch, Jennifer (Committee member) / Arizona State University (Publisher)

Created2013

Testing independence of parallel pseudorandom number streams: incorporating the data's multivariate nature

Description

Parallel Monte Carlo applications require the pseudorandom numbers used on each processor to be independent in a probabilistic sense. The TestU01 software package is the standard testing suite for detecting stream dependence and other properties that make certain pseudorandom generators ineffective in parallel (as well as serial) settings. TestU01 employs…

Parallel Monte Carlo applications require the pseudorandom numbers used on each processor to be independent in a probabilistic sense. The TestU01 software package is the standard testing suite for detecting stream dependence and other properties that make certain pseudorandom generators ineffective in parallel (as well as serial) settings. TestU01 employs two basic schemes for testing parallel generated streams. The first applies serial tests to the individual streams and then tests the resulting P-values for uniformity. The second turns all the parallel generated streams into one long vector and then applies serial tests to the resulting concatenated stream. Various forms of stream dependence can be missed by each approach because neither one fully addresses the multivariate nature of the accumulated data when generators are run in parallel. This dissertation identifies these potential faults in the parallel testing methodologies of TestU01 and investigates two different methods to better detect inter-stream dependencies: correlation motivated multivariate tests and vector time series based tests. These methods have been implemented in an extension to TestU01 built in C++ and the unique aspects of this extension are discussed. A variety of different generation scenarios are then examined using the TestU01 suite in concert with the extension. This enhanced software package is found to better detect certain forms of inter-stream dependencies than the original TestU01 suites of tests.

ContributorsIsmay, Chester (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Kao, Ming-Hung (Committee member) / Lanchier, Nicolas (Committee member) / Reiser, Mark R. (Committee member) / Arizona State University (Publisher)

Created2013

Network interdependence and information dynamics in cyber-physical systems

Description

The cyber-physical systems (CPS) are emerging as the underpinning technology for major industries in the 21-th century. This dissertation is focused on two fundamental issues in cyber-physical systems: network interdependence and information dynamics. It consists of the following two main thrusts. The first thrust is targeted at understanding the impact…

The cyber-physical systems (CPS) are emerging as the underpinning technology for major industries in the 21-th century. This dissertation is focused on two fundamental issues in cyber-physical systems: network interdependence and information dynamics. It consists of the following two main thrusts. The first thrust is targeted at understanding the impact of network interdependence. It is shown that a cyber-physical system built upon multiple interdependent networks are more vulnerable to attacks since node failures in one network may result in failures in the other network, causing a cascade of failures that would potentially lead to the collapse of the entire infrastructure. There is thus a need to develop a new network science for modeling and quantifying cascading failures in multiple interdependent networks, and to develop network management algorithms that improve network robustness and ensure overall network reliability against cascading failures. To enhance the system robustness, a "regular" allocation strategy is proposed that yields better resistance against cascading failures compared to all possible existing strategies. Furthermore, in view of the load redistribution feature in many physical infrastructure networks, e.g., power grids, a CPS model is developed where the threshold model and the giant connected component model are used to capture the node failures in the physical infrastructure network and the cyber network, respectively. The second thrust is centered around the information dynamics in the CPS. One speculation is that the interconnections over multiple networks can facilitate information diffusion since information propagation in one network can trigger further spread in the other network. With this insight, a theoretical framework is developed to analyze information epidemic across multiple interconnecting networks. It is shown that the conjoining among networks can dramatically speed up message diffusion. Along a different avenue, many cyber-physical systems rely on wireless networks which offer platforms for information exchanges. To optimize the QoS of wireless networks, there is a need to develop a high-throughput and low-complexity scheduling algorithm to control link dynamics. To that end, distributed link scheduling algorithms are explored for multi-hop MIMO networks and two CSMA algorithms under the continuous-time model and the discrete-time model are devised, respectively.

ContributorsQian, Dajun (Author) / Zhang, Junshan (Thesis advisor) / Ying, Lei (Committee member) / Zhang, Yanchao (Committee member) / Cochran, Douglas (Committee member) / Arizona State University (Publisher)

Created2012

Statistical signal processing of ESI-TOF-MS for biomarker discovery

Description

Signal processing techniques have been used extensively in many engineering problems and in recent years its application has extended to non-traditional research fields such as biological systems. Many of these applications require extraction of a signal or parameter of interest from degraded measurements. One such application is mass spectrometry immunoassay…

Signal processing techniques have been used extensively in many engineering problems and in recent years its application has extended to non-traditional research fields such as biological systems. Many of these applications require extraction of a signal or parameter of interest from degraded measurements. One such application is mass spectrometry immunoassay (MSIA) which has been one of the primary methods of biomarker discovery techniques. MSIA analyzes protein molecules as potential biomarkers using time of flight mass spectrometry (TOF-MS). Peak detection in TOF-MS is important for biomarker analysis and many other MS related application. Though many peak detection algorithms exist, most of them are based on heuristics models. One of the ways of detecting signal peaks is by deploying stochastic models of the signal and noise observations. Likelihood ratio test (LRT) detector, based on the Neyman-Pearson (NP) lemma, is an uniformly most powerful test to decision making in the form of a hypothesis test. The primary goal of this dissertation is to develop signal and noise models for the electrospray ionization (ESI) TOF-MS data. A new method is proposed for developing the signal model by employing first principles calculations based on device physics and molecular properties. The noise model is developed by analyzing MS data from careful experiments in the ESI mass spectrometer. A non-flat baseline in MS data is common. The reasons behind the formation of this baseline has not been fully comprehended. A new signal model explaining the presence of baseline is proposed, though detailed experiments are needed to further substantiate the model assumptions. Signal detection schemes based on these signal and noise models are proposed. A maximum likelihood (ML) method is introduced for estimating the signal peak amplitudes. The performance of the detection methods and ML estimation are evaluated with Monte Carlo simulation which shows promising results. An application of these methods is proposed for fractional abundance calculation for biomarker analysis, which is mathematically robust and fundamentally different than the current algorithms. Biomarker panels for type 2 diabetes and cardiovascular disease are analyzed using existing MS analysis algorithms. Finally, a support vector machine based multi-classification algorithm is developed for evaluating the biomarkers' effectiveness in discriminating type 2 diabetes and cardiovascular diseases and is shown to perform better than a linear discriminant analysis based classifier.

ContributorsBuddi, Sai (Author) / Taylor, Thomas (Thesis advisor) / Cochran, Douglas (Thesis advisor) / Nelson, Randall (Committee member) / Duman, Tolga (Committee member) / Arizona State University (Publisher)

Created2012

Operator-valued frames associated with measure spaces

Description

Since Duffin and Schaeffer's introduction of frames in 1952, the concept of a frame has received much attention in the mathematical community and has inspired several generalizations. The focus of this thesis is on the concept of an operator-valued frame (OVF) and a more general concept called herein an operator-valued…

Since Duffin and Schaeffer's introduction of frames in 1952, the concept of a frame has received much attention in the mathematical community and has inspired several generalizations. The focus of this thesis is on the concept of an operator-valued frame (OVF) and a more general concept called herein an operator-valued frame associated with a measure space (MS-OVF), which is sometimes called a continuous g-frame. The first of two main topics explored in this thesis is the relationship between MS-OVFs and objects prominent in quantum information theory called positive operator-valued measures (POVMs). It has been observed that every MS-OVF gives rise to a POVM with invertible total variation in a natural way. The first main result of this thesis is a characterization of which POVMs arise in this way, a result obtained by extending certain existing Radon-Nikodym theorems for POVMs. The second main topic investigated in this thesis is the role of the theory of unitary representations of a Lie group G in the construction of OVFs for the L^2-space of a relatively compact subset of G. For G=R, Duffin and Schaeffer have given general conditions that ensure a sequence of (one-dimensional) representations of G, restricted to (-1/2,1/2), forms a frame for L^{2}(-1/2,1/2), and similar conditions exist for G=R^n. The second main result of this thesis expresses conditions related to Duffin and Schaeffer's for two more particular Lie groups: the Euclidean motion group on R^2 and the (2n+1)-dimensional Heisenberg group. This proceeds in two steps. First, for a Lie group admitting a uniform lattice and an appropriate relatively compact subset E of G, the Selberg Trace Formula is used to obtain a Parseval OVF for L^{2}(E) that is expressed in terms of irreducible representations of G. Second, for the two particular Lie groups an appropriate set E is found, and it is shown that for each of these groups, with suitably parametrized unitary duals, the Parseval OVF remains an OVF when perturbations are made to the parameters of the included representations.

ContributorsRobinson, Benjamin (Author) / Cochran, Douglas (Thesis advisor) / Moran, William (Thesis advisor) / Boggess, Albert (Committee member) / Milner, Fabio (Committee member) / Spielberg, John (Committee member) / Arizona State University (Publisher)

Created2014

Estimation of subspace occupancy

Description

The ability to identify unoccupied resources in the radio spectrum is a key capability for opportunistic users in a cognitive radio environment. This paper draws upon and extends geometrically based ideas in statistical signal processing to develop estimators for the rank and the occupied subspace in a multi-user environment from…

The ability to identify unoccupied resources in the radio spectrum is a key capability for opportunistic users in a cognitive radio environment. This paper draws upon and extends geometrically based ideas in statistical signal processing to develop estimators for the rank and the occupied subspace in a multi-user environment from multiple temporal samples of the signal received at a single antenna. These estimators enable identification of resources, such as the orthogonal complement of the occupied subspace, that may be exploitable by an opportunistic user. This concept is supported by simulations showing the estimation of the number of users in a simple CDMA system using a maximum a posteriori (MAP) estimate for the rank. It was found that with suitable parameters, such as high SNR, sufficient number of time epochs and codes of appropriate length, the number of users could be correctly estimated using the MAP estimator even when the noise variance is unknown. Additionally, the process of identifying the maximum likelihood estimate of the orthogonal projector onto the unoccupied subspace is discussed.

ContributorsBeaudet, Kaitlyn (Author) / Cochran, Douglas (Thesis advisor) / Turaga, Pavan (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2014

Non-linear system identification using compressed sensing

Description

This thesis describes an approach to system identification based on compressive sensing and demonstrates its efficacy on a challenging classical benchmark single-input, multiple output (SIMO) mechanical system consisting of an inverted pendulum on a cart. Due to its inherent non-linearity and unstable behavior, very few techniques currently exist that are…

This thesis describes an approach to system identification based on compressive sensing and demonstrates its efficacy on a challenging classical benchmark single-input, multiple output (SIMO) mechanical system consisting of an inverted pendulum on a cart. Due to its inherent non-linearity and unstable behavior, very few techniques currently exist that are capable of identifying this system. The challenge in identification also lies in the coupled behavior of the system and in the difficulty of obtaining the full-range dynamics. The differential equations describing the system dynamics are determined from measurements of the system's input-output behavior. These equations are assumed to consist of the superposition, with unknown weights, of a small number of terms drawn from a large library of nonlinear terms. Under this assumption, compressed sensing allows the constituent library elements and their corresponding weights to be identified by decomposing a time-series signal of the system's outputs into a sparse superposition of corresponding time-series signals produced by the library components. The most popular techniques for non-linear system identification entail the use of ANN's (Artificial Neural Networks), which require a large number of measurements of the input and output data at high sampling frequencies. The method developed in this project requires very few samples and the accuracy of reconstruction is extremely high. Furthermore, this method yields the Ordinary Differential Equation (ODE) of the system explicitly. This is in contrast to some ANN approaches that produce only a trained network which might lose fidelity with change of initial conditions or if facing an input that wasn't used during its training. This technique is expected to be of value in system identification of complex dynamic systems encountered in diverse fields such as Biology, Computation, Statistics, Mechanics and Electrical Engineering.

ContributorsNaik, Manjish Arvind (Author) / Cochran, Douglas (Thesis advisor) / Kovvali, Narayan (Committee member) / Kawski, Matthias (Committee member) / Platte, Rodrigo (Committee member) / Arizona State University (Publisher)

Created2011

Chi-square orthogonal components for assessing goodness-of-fit of multidimensional multinomial data

Description

It is common in the analysis of data to provide a goodness-of-fit test to assess the performance of a model. In the analysis of contingency tables, goodness-of-fit statistics are frequently employed when modeling social science, educational or psychological data where the interest is often directed at investigating the association among…

It is common in the analysis of data to provide a goodness-of-fit test to assess the performance of a model. In the analysis of contingency tables, goodness-of-fit statistics are frequently employed when modeling social science, educational or psychological data where the interest is often directed at investigating the association among multi-categorical variables. Pearson's chi-squared statistic is well-known in goodness-of-fit testing, but it is sometimes considered to produce an omnibus test as it gives little guidance to the source of poor fit once the null hypothesis is rejected. However, its components can provide powerful directional tests. In this dissertation, orthogonal components are used to develop goodness-of-fit tests for models fit to the counts obtained from the cross-classification of multi-category dependent variables. Ordinal categories are assumed. Orthogonal components defined on marginals are obtained when analyzing multi-dimensional contingency tables through the use of the QR decomposition. A subset of these orthogonal components can be used to construct limited-information tests that allow one to identify the source of lack-of-fit and provide an increase in power compared to Pearson's test. These tests can address the adverse effects presented when data are sparse. The tests rely on the set of first- and second-order marginals jointly, the set of second-order marginals only, and the random forest method, a popular algorithm for modeling large complex data sets. The performance of these tests is compared to the likelihood ratio test as well as to tests based on orthogonal polynomial components. The derived goodness-of-fit tests are evaluated with studies for detecting two- and three-way associations that are not accounted for by a categorical variable factor model with a single latent variable. In addition the tests are used to investigate the case when the model misspecification involves parameter constraints for large and sparse contingency tables. The methodology proposed here is applied to data from the 38th round of the State Survey conducted by the Institute for Public Policy and Michigan State University Social Research (2005) . The results illustrate the use of the proposed techniques in the context of a sparse data set.

ContributorsMilovanovic, Jelena (Author) / Young, Dennis (Thesis advisor) / Reiser, Mark R. (Thesis advisor) / Wilson, Jeffrey (Committee member) / Eubank, Randall (Committee member) / Yang, Yan (Committee member) / Arizona State University (Publisher)

Created2011

The detection of reliability prediction cues in manufacturing data from statistically controlled processes

Description

Many products undergo several stages of testing ranging from tests on individual components to end-item tests. Additionally, these products may be further "tested" via customer or field use. The later failure of a delivered product may in some cases be due to circumstances that have no correlation with the product's…

Many products undergo several stages of testing ranging from tests on individual components to end-item tests. Additionally, these products may be further "tested" via customer or field use. The later failure of a delivered product may in some cases be due to circumstances that have no correlation with the product's inherent quality. However, at times, there may be cues in the upstream test data that, if detected, could serve to predict the likelihood of downstream failure or performance degradation induced by product use or environmental stresses. This study explores the use of downstream factory test data or product field reliability data to infer data mining or pattern recognition criteria onto manufacturing process or upstream test data by means of support vector machines (SVM) in order to provide reliability prediction models. In concert with a risk/benefit analysis, these models can be utilized to drive improvement of the product or, at least, via screening to improve the reliability of the product delivered to the customer. Such models can be used to aid in reliability risk assessment based on detectable correlations between the product test performance and the sources of supply, test stands, or other factors related to product manufacture. As an enhancement to the usefulness of the SVM or hyperplane classifier within this context, L-moments and the Western Electric Company (WECO) Rules are used to augment or replace the native process or test data used as inputs to the classifier. As part of this research, a generalizable binary classification methodology was developed that can be used to design and implement predictors of end-item field failure or downstream product performance based on upstream test data that may be composed of single-parameter, time-series, or multivariate real-valued data. Additionally, the methodology provides input parameter weighting factors that have proved useful in failure analysis and root cause investigations as indicators of which of several upstream product parameters have the greater influence on the downstream failure outcomes.

ContributorsMosley, James (Author) / Morrell, Darryl (Committee member) / Cochran, Douglas (Committee member) / Papandreou-Suppappola, Antonia (Committee member) / Roberts, Chell (Committee member) / Spanias, Andreas (Committee member) / Arizona State University (Publisher)

Created2011

Micro-particle streak velocimetry: theory, simulation methods and applications

Description

This dissertation describes a novel, low cost strategy of using particle streak (track) images for accurate micro-channel velocity field mapping. It is shown that 2-dimensional, 2-component fields can be efficiently obtained using the spatial variation of particle track lengths in micro-channels. The velocity field is a critical performance feature of…

This dissertation describes a novel, low cost strategy of using particle streak (track) images for accurate micro-channel velocity field mapping. It is shown that 2-dimensional, 2-component fields can be efficiently obtained using the spatial variation of particle track lengths in micro-channels. The velocity field is a critical performance feature of many microfluidic devices. Since it is often the case that un-modeled micro-scale physics frustrates principled design methodologies, particle based velocity field estimation is an essential design and validation tool. Current technologies that achieve this goal use particle constellation correlation strategies and rely heavily on costly, high-speed imaging hardware. The proposed image/ video processing based method achieves comparable accuracy for fraction of the cost. In the context of micro-channel velocimetry, the usability of particle streaks has been poorly studied so far. Their use has remained restricted mostly to bulk flow measurements and occasional ad-hoc uses in microfluidics. A second look at the usability of particle streak lengths in this work reveals that they can be efficiently used, after approximately 15 years from their first use for micro-channel velocimetry. Particle tracks in steady, smooth microfluidic flows is mathematically modeled and a framework for using experimentally observed particle track lengths for local velocity field estimation is introduced here, followed by algorithm implementation and quantitative verification. Further, experimental considerations and image processing techniques that can facilitate the proposed methods are also discussed in this dissertation. Unavailability of benchmarked particle track image data motivated the implementation of a simulation framework with the capability to generate exposure time controlled particle track image sequence for velocity vector fields. This dissertation also describes this work and shows that arbitrary velocity fields designed in computational fluid dynamics software tools can be used to obtain such images. Apart from aiding gold-standard data generation, such images would find use for quick microfluidic flow field visualization and help improve device designs.

ContributorsMahanti, Prasun (Author) / Cochran, Douglas (Thesis advisor) / Taylor, Thomas (Thesis advisor) / Hayes, Mark (Committee member) / Zhang, Junshan (Committee member) / Arizona State University (Publisher)

Created2011

Filtering by