Matching Items (12)

Filtering by

Clear all filters

149723-Thumbnail Image.png

System complexity reduction via feature selection

Description

This dissertation transforms a set of system complexity reduction problems to feature selection problems. Three systems are considered: classification based on association rules, network structure learning, and time series classification. Furthermore, two variable importance measures are proposed to reduce the

This dissertation transforms a set of system complexity reduction problems to feature selection problems. Three systems are considered: classification based on association rules, network structure learning, and time series classification. Furthermore, two variable importance measures are proposed to reduce the feature selection bias in tree models. Associative classifiers can achieve high accuracy, but the combination of many rules is difficult to interpret. Rule condition subset selection (RCSS) methods for associative classification are considered. RCSS aims to prune the rule conditions into a subset via feature selection. The subset then can be summarized into rule-based classifiers. Experiments show that classifiers after RCSS can substantially improve the classification interpretability without loss of accuracy. An ensemble feature selection method is proposed to learn Markov blankets for either discrete or continuous networks (without linear, Gaussian assumptions). The method is compared to a Bayesian local structure learning algorithm and to alternative feature selection methods in the causal structure learning problem. Feature selection is also used to enhance the interpretability of time series classification. Existing time series classification algorithms (such as nearest-neighbor with dynamic time warping measures) are accurate but difficult to interpret. This research leverages the time-ordering of the data to extract features, and generates an effective and efficient classifier referred to as a time series forest (TSF). The computational complexity of TSF is only linear in the length of time series, and interpretable features can be extracted. These features can be further reduced, and summarized for even better interpretability. Lastly, two variable importance measures are proposed to reduce the feature selection bias in tree-based ensemble models. It is well known that bias can occur when predictor attributes have different numbers of values. Two methods are proposed to solve the bias problem. One uses an out-of-bag sampling method called OOBForest, and the other, based on the new concept of a partial permutation test, is called a pForest. Experimental results show the existing methods are not always reliable for multi-valued predictors, while the proposed methods have advantages.

Contributors

Agent

Created

Date Created
2011

152506-Thumbnail Image.png

Cluster metrics and temporal coherency in pixel based matrices

Description

In this thesis, the application of pixel-based vertical axes used within parallel coordinate plots is explored in an attempt to improve how existing tools can explain complex multivariate interactions across temporal data. Several promising visualization techniques are combined, such as:

In this thesis, the application of pixel-based vertical axes used within parallel coordinate plots is explored in an attempt to improve how existing tools can explain complex multivariate interactions across temporal data. Several promising visualization techniques are combined, such as: visual boosting to allow for quicker consumption of large data sets, the bond energy algorithm to find finer patterns and anomalies through contrast, multi-dimensional scaling, flow lines, user guided clustering, and row-column ordering. User input is applied on precomputed data sets to provide for real time interaction. General applicability of the techniques are tested against industrial trade, social networking, financial, and sparse data sets of varying dimensionality.

Contributors

Agent

Created

Date Created
2014

152235-Thumbnail Image.png

A visual analytics based decision support methodology for evaluating low energy building design alternatives

Description

The ability to design high performance buildings has acquired great importance in recent years due to numerous federal, societal and environmental initiatives. However, this endeavor is much more demanding in terms of designer expertise and time. It requires a whole

The ability to design high performance buildings has acquired great importance in recent years due to numerous federal, societal and environmental initiatives. However, this endeavor is much more demanding in terms of designer expertise and time. It requires a whole new level of synergy between automated performance prediction with the human capabilities to perceive, evaluate and ultimately select a suitable solution. While performance prediction can be highly automated through the use of computers, performance evaluation cannot, unless it is with respect to a single criterion. The need to address multi-criteria requirements makes it more valuable for a designer to know the "latitude" or "degrees of freedom" he has in changing certain design variables while achieving preset criteria such as energy performance, life cycle cost, environmental impacts etc. This requirement can be met by a decision support framework based on near-optimal "satisficing" as opposed to purely optimal decision making techniques. Currently, such a comprehensive design framework is lacking, which is the basis for undertaking this research. The primary objective of this research is to facilitate a complementary relationship between designers and computers for Multi-Criterion Decision Making (MCDM) during high performance building design. It is based on the application of Monte Carlo approaches to create a database of solutions using deterministic whole building energy simulations, along with data mining methods to rank variable importance and reduce the multi-dimensionality of the problem. A novel interactive visualization approach is then proposed which uses regression based models to create dynamic interplays of how varying these important variables affect the multiple criteria, while providing a visual range or band of variation of the different design parameters. The MCDM process has been incorporated into an alternative methodology for high performance building design referred to as Visual Analytics based Decision Support Methodology [VADSM]. VADSM is envisioned to be most useful during the conceptual and early design performance modeling stages by providing a set of potential solutions that can be analyzed further for final design selection. The proposed methodology can be used for new building design synthesis as well as evaluation of retrofits and operational deficiencies in existing buildings.

Contributors

Agent

Created

Date Created
2013

151511-Thumbnail Image.png

Learning from asymmetric models and matched pairs

Description

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.

Contributors

Agent

Created

Date Created
2013

152992-Thumbnail Image.png

Cascading curtainmap: an interactive visualization for depicting large and flexible hierarchies

Description

In visualizing information hierarchies, icicle plots are efficient diagrams in that they provide the user a straightforward layout for different levels of data in a hierarchy and enable the user to compare items based on the item width. However, as

In visualizing information hierarchies, icicle plots are efficient diagrams in that they provide the user a straightforward layout for different levels of data in a hierarchy and enable the user to compare items based on the item width. However, as the size of the hierarchy grows large, the items in an icicle plot end up being small and indistinguishable. In this thesis, by maintaining the positive characteristics of traditional

icicle plots and incorporating new features such as dynamic diagram and active layer, we developed an interactive visualization that allows the user to selectively drill down or roll up to review different levels of data in a large hierarchy, to change the hierarchical

structure to detect potential patterns, and to maintain an overall understanding of the

current hierarchical structure.

Contributors

Agent

Created

Date Created
2014

152768-Thumbnail Image.png

Surgical instrument reprocessing in a hospital setting analyzed with statistical process control and data mining techniques

Description

In a healthcare setting, the Sterile Processing Department (SPD) provides ancillary services to the Operating Room (OR), Emergency Room, Labor & Delivery, and off-site clinics. SPD's function is to reprocess reusable surgical instruments and return them to their home departments.

In a healthcare setting, the Sterile Processing Department (SPD) provides ancillary services to the Operating Room (OR), Emergency Room, Labor & Delivery, and off-site clinics. SPD's function is to reprocess reusable surgical instruments and return them to their home departments. The management of surgical instruments and medical devices can impact patient safety and hospital revenue. Any time instrumentation or devices are not available or are not fit for use, patient safety and revenue can be negatively impacted. One step of the instrument reprocessing cycle is sterilization. Steam sterilization is the sterilization method used for the majority of surgical instruments and is preferred to immediate use steam sterilization (IUSS) because terminally sterilized items can be stored until needed. IUSS Items must be used promptly and cannot be stored for later use. IUSS is intended for emergency situations and not as regular course of action. Unfortunately, IUSS is used to compensate for inadequate inventory levels, scheduling conflicts, and miscommunications. If IUSS is viewed as an adverse event, then monitoring IUSS incidences can help healthcare organizations meet patient safety goals and financial goals along with aiding in process improvement efforts. This work recommends statistical process control methods to IUSS incidents and illustrates the use of control charts for IUSS occurrences through a case study and analysis of the control charts for data from a health care provider. Furthermore, this work considers the application of data mining methods to IUSS occurrences and presents a representative example of data mining to the IUSS occurrences. This extends the application of statistical process control and data mining in healthcare applications.

Contributors

Agent

Created

Date Created
2014

151341-Thumbnail Image.png

Spatio-temporal data mining to detect changes and clusters in trajectories

Description

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic monitoring and management, etc. To better understand movement behaviors from the raw mobility data, this doctoral work provides analytic models for analyzing trajectory data. As a first contribution, a model is developed to detect changes in trajectories with time. If the taxis moving in a city are viewed as sensors that provide real time information of the traffic in the city, a change in these trajectories with time can reveal that the road network has changed. To detect changes, trajectories are modeled with a Hidden Markov Model (HMM). A modified training algorithm, for parameter estimation in HMM, called m-BaumWelch, is used to develop likelihood estimates under assumed changes and used to detect changes in trajectory data with time. Data from vehicles are used to test the method for change detection. Secondly, sequential pattern mining is used to develop a model to detect changes in frequent patterns occurring in trajectory data. The aim is to answer two questions: Are the frequent patterns still frequent in the new data? If they are frequent, has the time interval distribution in the pattern changed? Two different approaches are considered for change detection, frequency-based approach and distribution-based approach. The methods are illustrated with vehicle trajectory data. Finally, a model is developed for clustering and outlier detection in semantic trajectories. A challenge with clustering semantic trajectories is that both numeric and categorical attributes are present. Another problem to be addressed while clustering is that trajectories can be of different lengths and also have missing values. A tree-based ensemble is used to address these problems. The approach is extended to outlier detection in semantic trajectories.

Contributors

Agent

Created

Date Created
2012

151176-Thumbnail Image.png

Novel statistical models for complex data structures

Description

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns from these massive datasets. The statistical challenges for addressing these massive datasets lay in their complex structures, such as high-dimensionality, hierarchy, multi-modality, heterogeneity and data uncertainty. Besides the statistical challenges, the associated computational approaches are also considered essential in achieving efficiency, effectiveness, as well as the numerical stability in practice. On the other hand, some recent developments in statistics and machine learning, such as sparse learning, transfer learning, and some traditional methodologies which still hold potential, such as multi-level models, all shed lights on addressing these complex datasets in a statistically powerful and computationally efficient way. In this dissertation, we identify four kinds of general complex datasets, including "high-dimensional datasets", "hierarchically-structured datasets", "multimodality datasets" and "data uncertainties", which are ubiquitous in many domains, such as biology, medicine, neuroscience, health care delivery, manufacturing, etc. We depict the development of novel statistical models to analyze complex datasets which fall under these four categories, and we show how these models can be applied to some real-world applications, such as Alzheimer's disease research, nursing care process, and manufacturing.

Contributors

Agent

Created

Date Created
2012

154558-Thumbnail Image.png

Distinct feature learning and nonlinear variation pattern discovery using regularized autoencoders

Description

Feature learning and the discovery of nonlinear variation patterns in high-dimensional data is an important task in many problem domains, such as imaging, streaming data from sensors, and manufacturing. This dissertation presents several methods for learning and visualizing nonlinear variation

Feature learning and the discovery of nonlinear variation patterns in high-dimensional data is an important task in many problem domains, such as imaging, streaming data from sensors, and manufacturing. This dissertation presents several methods for learning and visualizing nonlinear variation in high-dimensional data. First, an automated method for discovering nonlinear variation patterns using deep learning autoencoders is proposed. The approach provides a functional mapping from a low-dimensional representation to the original spatially-dense data that is both interpretable and efficient with respect to preserving information. Experimental results indicate that deep learning autoencoders outperform manifold learning and principal component analysis in reproducing the original data from the learned variation sources.

A key issue in using autoencoders for nonlinear variation pattern discovery is to encourage the learning of solutions where each feature represents a unique variation source, which we define as distinct features. This problem of learning distinct features is also referred to as disentangling factors of variation in the representation learning literature. The remainder of this dissertation highlights and provides solutions for this important problem.

An alternating autoencoder training method is presented and a new measure motivated by orthogonal loadings in linear models is proposed to quantify feature distinctness in the nonlinear models. Simulated point cloud data and handwritten digit images illustrate that standard training methods for autoencoders consistently mix the true variation sources in the learned low-dimensional representation, whereas the alternating method produces solutions with more distinct patterns.

Finally, a new regularization method for learning distinct nonlinear features using autoencoders is proposed. Motivated in-part by the properties of linear solutions, a series of learning constraints are implemented via regularization penalties during stochastic gradient descent training. These include the orthogonality of tangent vectors to the manifold, the correlation between learned features, and the distributions of the learned features. This regularized learning approach yields low-dimensional representations which can be better interpreted and used to identify the true sources of variation impacting a high-dimensional feature space. Experimental results demonstrate the effectiveness of this method for nonlinear variation pattern discovery on both simulated and real data sets.

Contributors

Agent

Created

Date Created
2016

156679-Thumbnail Image.png

Machine Learning Models for High-dimensional Biomedical Data

Description

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery.

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.

Contributors

Agent

Created

Date Created
2018