Matching Items (11)

154174-Thumbnail Image.png

Multi-variate time series similarity measures and their robustness against temporal asynchrony

Description

The amount of time series data generated is increasing due to the integration of sensor technologies with everyday applications, such as gesture recognition, energy optimization, health care, video surveillance. The

The amount of time series data generated is increasing due to the integration of sensor technologies with everyday applications, such as gesture recognition, energy optimization, health care, video surveillance. The use of multiple sensors simultaneously

for capturing different aspects of the real world attributes has also led to an increase in dimensionality from uni-variate to multi-variate time series. This has facilitated richer data representation but also has necessitated algorithms determining similarity between two multi-variate time series for search and analysis.

Various algorithms have been extended from uni-variate to multi-variate case, such as multi-variate versions of Euclidean distance, edit distance, dynamic time warping. However, it has not been studied how these algorithms account for asynchronous in time series. Human gestures, for example, exhibit asynchrony in their patterns as different subjects perform the same gesture with varying movements in their patterns at different speeds. In this thesis, we propose several algorithms (some of which also leverage metadata describing the relationships among the variates). In particular, we present several techniques that leverage the contextual relationships among the variates when measuring multi-variate time series similarities. Based on the way correlation is leveraged, various weighing mechanisms have been proposed that determine the importance of a dimension for discriminating between the time series as giving the same weight to each dimension can led to misclassification. We next study the robustness of the considered techniques against different temporal asynchronies, including shifts and stretching.

Exhaustive experiments were carried on datasets with multiple types and amounts of temporal asynchronies. It has been observed that accuracy of algorithms that rely on data to discover variate relationships can be low under the presence of temporal asynchrony, whereas in case of algorithms that rely on external metadata, robustness against asynchronous distortions tends to be stronger. Specifically, algorithms using external metadata have better classification accuracy and cluster separation than existing state-of-the-art work, such as EROS, PCA, and naive dynamic time warping.

Contributors

Agent

Created

Date Created
  • 2015

154587-Thumbnail Image.png

Graph-based estimation of information divergence functions

Description

Information divergence functions, such as the Kullback-Leibler divergence or the Hellinger distance, play a critical role in statistical signal processing and information theory; however estimating them can be challenge. Most

Information divergence functions, such as the Kullback-Leibler divergence or the Hellinger distance, play a critical role in statistical signal processing and information theory; however estimating them can be challenge. Most often, parametric assumptions are made about the two distributions to estimate the divergence of interest. In cases where no parametric model fits the data, non-parametric density estimation is used. In statistical signal processing applications, Gaussianity is usually assumed since closed-form expressions for common divergence measures have been derived for this family of distributions. Parametric assumptions are preferred when it is known that the data follows the model, however this is rarely the case in real-word scenarios. Non-parametric density estimators are characterized by a very large number of parameters that have to be tuned with costly cross-validation. In this dissertation we focus on a specific family of non-parametric estimators, called direct estimators, that bypass density estimation completely and directly estimate the quantity of interest from the data. We introduce a new divergence measure, the $D_p$-divergence, that can be estimated directly from samples without parametric assumptions on the distribution. We show that the $D_p$-divergence bounds the binary, cross-domain, and multi-class Bayes error rates and, in certain cases, provides provably tighter bounds than the Hellinger divergence. In addition, we also propose a new methodology that allows the experimenter to construct direct estimators for existing divergence measures or to construct new divergence measures with custom properties that are tailored to the application. To examine the practical efficacy of these new methods, we evaluate them in a statistical learning framework on a series of real-world data science problems involving speech-based monitoring of neuro-motor disorders.

Contributors

Agent

Created

Date Created
  • 2017

157145-Thumbnail Image.png

Robustness of the General Factor Mean Difference Estimation in Bifactor Ordinal Data

Description

A simulation study was conducted to explore the robustness of general factor mean difference estimation in bifactor ordered-categorical data. In the No Differential Item Functioning (DIF) conditions, the data generation

A simulation study was conducted to explore the robustness of general factor mean difference estimation in bifactor ordered-categorical data. In the No Differential Item Functioning (DIF) conditions, the data generation conditions varied were sample size, the number of categories per item, effect size of the general factor mean difference, and the size of specific factor loadings; in data analysis, misspecification conditions were introduced in which the generated bifactor data were fit using a unidimensional model, and/or ordered-categorical data were treated as continuous data. In the DIF conditions, the data generation conditions varied were sample size, the number of categories per item, effect size of latent mean difference for the general factor, the type of item parameters that had DIF, and the magnitude of DIF; the data analysis conditions varied in whether or not setting equality constraints on the noninvariant item parameters.

Results showed that falsely fitting bifactor data using unidimensional models or failing to account for DIF in item parameters resulted in estimation bias in the general factor mean difference, while treating ordinal data as continuous had little influence on the estimation bias as long as there was no severe model misspecification. The extent of estimation bias produced by misspecification of bifactor datasets with unidimensional models was mainly determined by the degree of unidimensionality (i.e., size of specific factor loadings) and the general factor mean difference size. When the DIF was present, the estimation accuracy of the general factor mean difference was completely robust to ignoring noninvariance in specific factor loadings while it was very sensitive to failing to account for DIF in threshold parameters. With respect to ignoring the DIF in general factor loadings, the estimation bias of the general factor mean difference was substantial when the DIF was -0.15, and it can be negligible for smaller sizes of DIF. Despite the impact of model misspecification on estimation accuracy, the power to detect the general factor mean difference was mainly influenced by the sample size and effect size. Serious Type I error rate inflation only occurred when the DIF was present in threshold parameters.

Contributors

Agent

Created

Date Created
  • 2019

149367-Thumbnail Image.png

Multivariate charts for multivariate poisson-distributed data

Description

There has been much research involving simultaneous monitoring of several correlated quality characteristics that rely on the assumptions of multivariate normality and independence. In real world applications, these assumptions are

There has been much research involving simultaneous monitoring of several correlated quality characteristics that rely on the assumptions of multivariate normality and independence. In real world applications, these assumptions are not always met, particularly when small counts are of interest. In general, the use of normal approximation to the Poisson distribution seems to be justified when the Poisson means are large enough. A new two-sided Multivariate Poisson Exponentially Weighted Moving Average (MPEWMA) control chart is proposed, and the control limits are directly derived from the multivariate Poisson distribution. The MPEWMA and the conventional Multivariate Exponentially Weighted Moving Average (MEWMA) charts are evaluated by using the multivariate Poisson framework. The MPEWMA chart outperforms the MEWMA with the normal-theory limits in terms of the in-control average run lengths. An extension study of the two-sided MPEWMA to a one-sided version is performed; this is useful for detecting an increase in the count means. The results of comparison with the one-sided MEWMA chart are quite similar to the two-sided case. The implementation of the MPEWMA scheme for multiple count data is illustrated, with step by step guidelines and several examples. In addition, the method is compared to other model-based control charts that are used to monitor the residual values such as the regression adjustment. The MPEWMA scheme shows better performance on detecting the mean shift in count data when positive correlation exists among all variables.

Contributors

Agent

Created

Date Created
  • 2010

153604-Thumbnail Image.png

Monitoring complex supply chains

Description

The complexity of supply chains (SC) has grown rapidly in recent years, resulting in an increased difficulty to evaluate and visualize performance. Consequently, analytical approaches to evaluate SC performance in

The complexity of supply chains (SC) has grown rapidly in recent years, resulting in an increased difficulty to evaluate and visualize performance. Consequently, analytical approaches to evaluate SC performance in near real time relative to targets and plans are important to detect and react to deviations in order to prevent major disruptions.

Manufacturing anomalies, inaccurate forecasts, and other problems can lead to SC disruptions. Traditional monitoring methods are not sufficient in this respect, because com- plex SCs feature changes in manufacturing tasks (dynamic complexity) and carry a large number of stock keeping units (detail complexity). Problems are easily confounded with normal system variations.

Motivated by these real challenges faced by modern SC, new surveillance solutions are proposed to detect system deviations that could lead to disruptions in a complex SC. To address supply-side deviations, the fitness of different statistics that can be extracted from the enterprise resource planning system is evaluated. A monitoring strategy is first proposed for SCs featuring high levels of dynamic complexity. This presents an opportunity for monitoring methods to be applied in a new, rich domain of SC management. Then a monitoring strategy, called Heat Map Contrasts (HMC), which converts monitoring into a series of classification problems, is used to monitor SCs with both high levels of dynamic and detail complexities. Data from a semiconductor SC simulator are used to compare the methods with other alternatives under various failure cases, and the results illustrate the viability of our methods.

To address demand-side deviations, a new method of quantifying forecast uncer- tainties using the progression of forecast updates is presented. It is illustrated that a rich amount of information is available in rolling horizon forecasts. Two proactive indicators of future forecast errors are extracted from the forecast stream. This quantitative method re- quires no knowledge of the forecasting model itself and has shown promising results when applied to two datasets consisting of real forecast updates.

Contributors

Agent

Created

Date Created
  • 2015

150488-Thumbnail Image.png

Statistical monitoring and control of locally proactive routing protocols in MANETs

Description

Mobile ad hoc networks (MANETs) have attracted attention for mission critical applications. This dissertation investigates techniques of statistical monitoring and control for overhead reduction in a proactive MANET routing protocol.

Mobile ad hoc networks (MANETs) have attracted attention for mission critical applications. This dissertation investigates techniques of statistical monitoring and control for overhead reduction in a proactive MANET routing protocol. Proactive protocols transmit overhead periodically. Instead, we propose that the local conditions of a node should determine this transmission decision. While the goal is to minimize overhead, a balance in the amount of overhead transmitted and the performance achieved is required. Statistical monitoring consists of techniques to determine if a characteristic has shifted away from an in-control state. A basic tool for monitoring is a control chart, a time-oriented representation of the characteristic. When a sample deviates outside control limits, a significant change has occurred and corrective actions are required to return to the in-control state. We investigate the use of statistical monitoring of local conditions in the Optimized Link State Routing (OLSR) protocol. Three versions are developed. In A-OLSR, each node uses a Shewhart chart to monitor betweenness of its two-hop neighbourhood. Betweenness is a social network metric that measures a node's influence; betweenness is larger when a node has more influence. Changes in topology are associated with changes in betweenness. We incorporate additional local node conditions including speed, density, packet arrival rate, and number of flows it forwards in A+-OLSR. Response Surface Methodology (RSM) is used to optimize timer values. As well, the Shewhart chart is replaced by an Exponentially Weighted Moving Average (EWMA) chart, which is more sensitive to small changes in the characteristic. It is known that control charts do not work as well in the presence of correlation. Hence, in A*-OLSR the autocorrelation in the time series is removed and an Auto-Regressive Integrated Moving Average (ARIMA) model found; this removes the dependence on node speed. A*-OLSR also extends monitoring to two characteristics concurrently using multivariate cumulative sum (MCUSUM) charts. The protocols are evaluated in simulation, and compared to OLSR and its variants. The techniques for statistical monitoring and control are general and have great potential to be applied to the adaptive control of many network protocols.

Contributors

Agent

Created

Date Created
  • 2012

156621-Thumbnail Image.png

Assessing measurement invariance and latent mean differences with bifactor multidimensional data in structural equation modeling

Description

Investigation of measurement invariance (MI) commonly assumes correct specification of dimensionality across multiple groups. Although research shows that violation of the dimensionality assumption can cause bias in model parameter estimation

Investigation of measurement invariance (MI) commonly assumes correct specification of dimensionality across multiple groups. Although research shows that violation of the dimensionality assumption can cause bias in model parameter estimation for single-group analyses, little research on this issue has been conducted for multiple-group analyses. This study explored the effects of mismatch in dimensionality between data and analysis models with multiple-group analyses at the population and sample levels. Datasets were generated using a bifactor model with different factor structures and were analyzed with bifactor and single-factor models to assess misspecification effects on assessments of MI and latent mean differences. As baseline models, the bifactor models fit data well and had minimal bias in latent mean estimation. However, the low convergence rates of fitting bifactor models to data with complex structures and small sample sizes caused concern. On the other hand, effects of fitting the misspecified single-factor models on the assessments of MI and latent means differed by the bifactor structures underlying data. For data following one general factor and one group factor affecting a small set of indicators, the effects of ignoring the group factor in analysis models on the tests of MI and latent mean differences were mild. In contrast, for data following one general factor and several group factors, oversimplifications of analysis models can lead to inaccurate conclusions regarding MI assessment and latent mean estimation.

Contributors

Agent

Created

Date Created
  • 2018

156580-Thumbnail Image.png

Supervised and ensemble classification of multivariate functional data: applications to lupus diagnosis

Description

This dissertation investigates the classification of systemic lupus erythematosus (SLE) in the presence of non-SLE alternatives, while developing novel curve classification methodologies with wide ranging applications. Functional data representations

This dissertation investigates the classification of systemic lupus erythematosus (SLE) in the presence of non-SLE alternatives, while developing novel curve classification methodologies with wide ranging applications. Functional data representations of plasma thermogram measurements and the corresponding derivative curves provide predictors yet to be investigated for SLE identification. Functional nonparametric classifiers form a methodological basis, which is used herein to develop a) the family of ESFuNC segment-wise curve classification algorithms and b) per-pixel ensembles based on logistic regression and fused-LASSO. The proposed methods achieve test set accuracy rates as high as 94.3%, while returning information about regions of the temperature domain that are critical for population discrimination. The undertaken analyses suggest that derivate-based information contributes significantly in improved classification performance relative to recently published studies on SLE plasma thermograms.

Contributors

Agent

Created

Date Created
  • 2018

150996-Thumbnail Image.png

Multivariate generalization of reduced major axis regression

Description

A least total area of triangle method was proposed by Teissier (1948) for fitting a straight line to data from a pair of variables without treating either variable as the

A least total area of triangle method was proposed by Teissier (1948) for fitting a straight line to data from a pair of variables without treating either variable as the dependent variable while allowing each of the variables to have measurement errors. This method is commonly called Reduced Major Axis (RMA) regression and is often used instead of Ordinary Least Squares (OLS) regression. Results for confidence intervals, hypothesis testing and asymptotic distributions of coefficient estimates in the bivariate case are reviewed. A generalization of RMA to more than two variables for fitting a plane to data is obtained by minimizing the sum of a function of the volumes obtained by drawing, from each data point, lines parallel to each coordinate axis to the fitted plane (Draper and Yang 1997; Goodman and Tofallis 2003). Generalized RMA results for the multivariate case obtained by Draper and Yang (1997) are reviewed and some investigations of multivariate RMA are given. A linear model is proposed that does not specify a dependent variable and allows for errors in the measurement of each variable. Coefficients in the model are estimated by minimization of the function of the volumes previously mentioned. Methods for obtaining coefficient estimates are discussed and simulations are used to investigate the distribution of coefficient estimates. The effects of sample size, sampling error and correlation among variables on the estimates are studied. Bootstrap methods are used to obtain confidence intervals for model coefficients. Residual analysis is considered for assessing model assumptions. Outlier and influential case diagnostics are developed and a forward selection method is proposed for subset selection of model variables. A real data example is provided that uses the methods developed. Topics for further research are discussed.

Contributors

Agent

Created

Date Created
  • 2012

149409-Thumbnail Image.png

Multilevel mediation analysis: statistical assumptions and centering

Description

Mediation analysis is a statistical approach that examines the effect of a treatment (e.g., prevention program) on an outcome (e.g., substance use) achieved by targeting and changing one or more

Mediation analysis is a statistical approach that examines the effect of a treatment (e.g., prevention program) on an outcome (e.g., substance use) achieved by targeting and changing one or more intervening variables (e.g., peer drug use norms). The increased use of prevention intervention programs with outcomes measured at multiple time points following the intervention requires multilevel modeling techniques to account for clustering in the data. Estimating multilevel mediation models, in which all the variables are measured at individual level (Level 1), poses several challenges to researchers. The first challenge is to conceptualize a multilevel mediation model by clarifying the underlying statistical assumptions and implications of those assumptions on cluster-level (Level-2) covariance structure. A second challenge is that variables measured at Level 1 potentially contain both between- and within-cluster variation making interpretation of multilevel analysis difficult. As a result, multilevel mediation analyses may yield coefficient estimates that are composites of coefficient estimates at different levels if proper centering is not used. This dissertation addresses these two challenges. Study 1 discusses the concept of a correctly specified multilevel mediation model by examining the underlying statistical assumptions and implication of those assumptions on Level-2 covariance structure. Further, Study 1 presents analytical results showing algebraic relationships between the population parameters in a correctly specified multilevel mediation model. Study 2 extends previous work on centering in multilevel mediation analysis. First, different centering methods in multilevel analysis including centering within cluster with the cluster mean as a Level-2 predictor of intercept (CWC2) are discussed. Next, application of the CWC2 strategy to accommodate multilevel mediation models is explained. It is shown that the CWC2 centering strategy separates the between- and within-cluster mediated effects. Next, Study 2 discusses assumptions underlying a correctly specified CWC2 multilevel mediation model and defines between- and within-cluster mediated effects. In addition, analytical results for the algebraic relationships between the population parameters in a CWC2 multilevel mediation model are presented. Finally, Study 2 shows results of a simulation study conducted to verify derived algebraic relationships empirically.

Contributors

Agent

Created

Date Created
  • 2010