Search Content

Propensity score estimation with random forests

Description

Random Forests is a statistical learning method which has been proposed for propensity score estimation models that involve complex interactions, nonlinear relationships, or both of the covariates. In this dissertation I conducted a simulation study to examine the effects of three Random Forests model specifications in propensity score analysis. The…

Random Forests is a statistical learning method which has been proposed for propensity score estimation models that involve complex interactions, nonlinear relationships, or both of the covariates. In this dissertation I conducted a simulation study to examine the effects of three Random Forests model specifications in propensity score analysis. The results suggested that, depending on the nature of data, optimal specification of (1) decision rules to select the covariate and its split value in a Classification Tree, (2) the number of covariates randomly sampled for selection, and (3) methods of estimating Random Forests propensity scores could potentially produce an unbiased average treatment effect estimate after propensity scores weighting by the odds adjustment. Compared to the logistic regression estimation model using the true propensity score model, Random Forests had an additional advantage in producing unbiased estimated standard error and correct statistical inference of the average treatment effect. The relationship between the balance on the covariates' means and the bias of average treatment effect estimate was examined both within and between conditions of the simulation. Within conditions, across repeated samples there was no noticeable correlation between the covariates' mean differences and the magnitude of bias of average treatment effect estimate for the covariates that were imbalanced before adjustment. Between conditions, small mean differences of covariates after propensity score adjustment were not sensitive enough to identify the optimal Random Forests model specification for propensity score analysis.

ContributorsCham, Hei Ning (Author) / Tein, Jenn-Yun (Thesis advisor) / Enders, Stephen G (Thesis advisor) / Enders, Craig K. (Committee member) / Mackinnon, David P (Committee member) / Arizona State University (Publisher)

Created2013

Spatio-temporal data mining to detect changes and clusters in trajectories

Description

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic…

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic monitoring and management, etc. To better understand movement behaviors from the raw mobility data, this doctoral work provides analytic models for analyzing trajectory data. As a first contribution, a model is developed to detect changes in trajectories with time. If the taxis moving in a city are viewed as sensors that provide real time information of the traffic in the city, a change in these trajectories with time can reveal that the road network has changed. To detect changes, trajectories are modeled with a Hidden Markov Model (HMM). A modified training algorithm, for parameter estimation in HMM, called m-BaumWelch, is used to develop likelihood estimates under assumed changes and used to detect changes in trajectory data with time. Data from vehicles are used to test the method for change detection. Secondly, sequential pattern mining is used to develop a model to detect changes in frequent patterns occurring in trajectory data. The aim is to answer two questions: Are the frequent patterns still frequent in the new data? If they are frequent, has the time interval distribution in the pattern changed? Two different approaches are considered for change detection, frequency-based approach and distribution-based approach. The methods are illustrated with vehicle trajectory data. Finally, a model is developed for clustering and outlier detection in semantic trajectories. A challenge with clustering semantic trajectories is that both numeric and categorical attributes are present. Another problem to be addressed while clustering is that trajectories can be of different lengths and also have missing values. A tree-based ensemble is used to address these problems. The approach is extended to outlier detection in semantic trajectories.

ContributorsKondaveeti, Anirudh (Author) / Runger, George C. (Thesis advisor) / Mirchandani, Pitu (Committee member) / Pan, Rong (Committee member) / Maciejewski, Ross (Committee member) / Arizona State University (Publisher)

Created2012

Visual Analytic Tools for Geo-Genealogy and Geo-Demographics

Description

This work explores the development of a visual analytics tool for geodemographic exploration in an online environment. We mine 78 million records from the United States white pages, link the location data to demographic data (specifically income) from the United States Census Bureau, and allow users to interactively compare distributions…

This work explores the development of a visual analytics tool for geodemographic exploration in an online environment. We mine 78 million records from the United States white pages, link the location data to demographic data (specifically income) from the United States Census Bureau, and allow users to interactively compare distributions of names with regards to spatial location similarity and income. In order to enable interactive similarity exploration, we explore methods of pre-processing the data as well as on-the-fly lookups. As data becomes larger and more complex, the development of appropriate data storage and analytics solutions has become even more critical when enabling online visualization. We discuss problems faced in implementation, design decisions and directions for future work.

ContributorsIbarra, Jose Luis (Author) / Maciejewski, Ross (Thesis director) / Mack, Elizabeth (Committee member) / Longley, Paul (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2014-05

Visual analytics for spatiotemporal cluster analysis

Description

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would…

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would not be apparent in tabular form. However, several critical challenges arise when visualizing and exploring these large spatiotemporal datasets. While, the underlying geographical component of the data lends itself well to univariate visualization in the form of traditional cartographic representations (e.g., choropleth, isopleth, dasymetric maps), as the data becomes multivariate, cartographic representations become more complex. To simplify the visual representations, analytical methods such as clustering and feature extraction are often applied as part of the classification phase. The automatic classification can then be rendered onto a map; however, one common issue in data classification is that items near a classification boundary are often mislabeled.

This thesis explores methods to augment the automated spatial classification by utilizing interactive machine learning as part of the cluster creation step. First, this thesis explores the design space for spatiotemporal analysis through the development of a comprehensive data wrangling and exploratory data analysis platform. Second, this system is augmented with a novel method for evaluating the visual impact of edge cases for multivariate geographic projections. Finally, system features and functionality are demonstrated through a series of case studies, with key features including similarity analysis, multivariate clustering, and novel visual support for cluster comparison.

ContributorsZhang, Yifan (Author) / Maciejewski, Ross (Thesis advisor) / Mack, Elizabeth (Committee member) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)

Created2016

The role of teamwork in predicting movie earnings

Description

Intelligence analysts’ work has become progressively complex due to increasing security threats and data availability. In order to study “big” data exploration within the intelligence domain the intelligence analyst task was abstracted and replicated in a laboratory (controlled environment). Participants used a computer interface and movie database to…

Intelligence analysts’ work has become progressively complex due to increasing security threats and data availability. In order to study “big” data exploration within the intelligence domain the intelligence analyst task was abstracted and replicated in a laboratory (controlled environment). Participants used a computer interface and movie database to determine the opening weekend gross movie earnings of three pre-selected movies. Data consisted of Twitter tweets and predictive models. These data were displayed in various formats such as graphs, charts, and text. Participants used these data to make their predictions. It was expected that teams (a team is a group with members who have different specialties and who work interdependently) would outperform individuals and groups. That is, teams would be significantly better at predicting “Opening Weekend Gross” than individuals or groups. Results indicated that teams outperformed individuals and groups in the first prediction, under performed in the second prediction, and performed better than individuals in the third prediction (but not better than groups). Insights and future directions are discussed.

ContributorsBuchanan, Verica (Author) / Cooke, Nancy J. (Thesis advisor) / Maciejewski, Ross (Committee member) / Craig, Scotty D. (Committee member) / Arizona State University (Publisher)

Created2016

Methodologies in Predictive Visual Analytics

Description

Predictive analytics embraces an extensive area of techniques from statistical modeling to machine learning to data mining and is applied in business intelligence, public health, disaster management and response, and many other fields. To date, visualization has been broadly used to support tasks in the predictive analytics pipeline under the…

Predictive analytics embraces an extensive area of techniques from statistical modeling to machine learning to data mining and is applied in business intelligence, public health, disaster management and response, and many other fields. To date, visualization has been broadly used to support tasks in the predictive analytics pipeline under the underlying assumption that a human-in-the-loop can aid the analysis by integrating domain knowledge that might not be broadly captured by the system. Primary uses of visualization in the predictive analytics pipeline have focused on data cleaning, exploratory analysis, and diagnostics. More recently, numerous visual analytics systems for feature selection, incremental learning, and various prediction tasks have been proposed to support the growing use of complex models, agent-specific optimization, and comprehensive model comparison and result exploration. Such work is being driven by advances in interactive machine learning and the desire of end-users to understand and engage with the modeling process. However, despite the numerous and promising applications of visual analytics to predictive analytics tasks, work to assess the effectiveness of predictive visual analytics is lacking.

This thesis studies the current methodologies in predictive visual analytics. It first defines the scope of predictive analytics and presents a predictive visual analytics (PVA) pipeline. Following the proposed pipeline, a predictive visual analytics framework is developed to be used to explore under what circumstances a human-in-the-loop prediction process is most effective. This framework combines sentiment analysis, feature selection mechanisms, similarity comparisons and model cross-validation through a variety of interactive visualizations to support analysts in model building and prediction. To test the proposed framework, an instantiation for movie box-office prediction is developed and evaluated. Results from small-scale user studies are presented and discussed, and a generalized user study is carried out to assess the role of predictive visual analytics under a movie box-office prediction scenario.

ContributorsLu, Yafeng (Author) / Maciejewski, Ross (Thesis advisor) / Cooke, Nancy J. (Committee member) / Liu, Huan (Committee member) / He, Jingrui (Committee member) / Arizona State University (Publisher)

Created2017

Visual Analytics Methods for Exploring Geographically Networked Phenomena

Description

The connections between different entities define different kinds of networks, and many such networked phenomena are influenced by their underlying geographical relationships. By integrating network and geospatial analysis, the goal is to extract information about interaction topologies and the relationships to related geographical constructs. In the recent decades, much work…

The connections between different entities define different kinds of networks, and many such networked phenomena are influenced by their underlying geographical relationships. By integrating network and geospatial analysis, the goal is to extract information about interaction topologies and the relationships to related geographical constructs. In the recent decades, much work has been done analyzing the dynamics of spatial networks; however, many challenges still remain in this field. First, the development of social media and transportation technologies has greatly reshaped the typologies of communications between different geographical regions. Second, the distance metrics used in spatial analysis should also be enriched with the underlying network information to develop accurate models.

Visual analytics provides methods for data exploration, pattern recognition, and knowledge discovery. However, despite the long history of geovisualizations and network visual analytics, little work has been done to develop visual analytics tools that focus specifically on geographically networked phenomena. This thesis develops a variety of visualization methods to present data values and geospatial network relationships, which enables users to interactively explore the data. Users can investigate the connections in both virtual networks and geospatial networks and the underlying geographical context can be used to improve knowledge discovery. The focus of this thesis is on social media analysis and geographical hotspots optimization. A framework is proposed for social network analysis to unveil the links between social media interactions and their underlying networked geospatial phenomena. This will be combined with a novel hotspot approach to improve hotspot identification and boundary detection with the networks extracted from urban infrastructure. Several real world problems have been analyzed using the proposed visual analytics frameworks. The primary studies and experiments show that visual analytics methods can help analysts explore such data from multiple perspectives and help the knowledge discovery process.

ContributorsWang, Feng (Author) / Maciejewski, Ross (Thesis advisor) / Davulcu, Hasan (Committee member) / Grubesic, Anthony (Committee member) / Shakarian, Paulo (Committee member) / Arizona State University (Publisher)

Created2017

A Visual Analytics Workflow for Detecting Transition Regions in Large Scale Molecular Dynamics Simulations

Description

Molecular Dynamics (MD) simulations are ubiquitous throughout the physical sci-ences; they are critical in understanding how particle structures evolve over time given a particular energy function. A software package called ParSplice introduced a new method to generate these simulations in parallel that has significantly inflated their length. Typically, simulations are short discrete Markov…

Molecular Dynamics (MD) simulations are ubiquitous throughout the physical sci-ences; they are critical in understanding how particle structures evolve over time given a particular energy function. A software package called ParSplice introduced a new method to generate these simulations in parallel that has significantly inflated their length. Typically, simulations are short discrete Markov chains, only captur- ing a few microseconds of a particle’s behavior and containing tens of thousands of transitions between states; in contrast, a typical ParSplice simulation can be as long as a few milliseconds, containing tens of millions of transitions. Naturally, sifting through data of this size is impossible by hand, and there are a number of visualiza- tion systems that provide comprehensive and intuitive analyses of particle structures throughout MD simulations. However, no visual analytics systems have been built that can manage the simulations that ParSplice produces. To analyze these large data-sets, I built a visual analytics system that provides multiple coordinated views that simultaneously describe the data temporally, within its structural context, and based on its properties. The system provides fluid and powerful user interactions regardless of the size of the data, allowing the user to drill down into the data-set to get detailed insights, as well as run and save various calculations, most notably the Nudged Elastic Band method. The system also allows the comparison of multiple trajectories, revealing more information about the general behavior of particles at different temperatures, energy states etc.

ContributorsHnatyshyn, Rostyslav (Author) / Maciejewski, Ross (Thesis advisor) / Bryan, Chris (Committee member) / Ahrens, James (Committee member) / Arizona State University (Publisher)

Created2022

Explaining the Vulnerabilities of Machine Learning through Visual Analytics

Description

Machine learning models are increasingly being deployed in real-world applications where their predictions are used to make critical decisions in a variety of domains. The proliferation of such models has led to a burgeoning need to ensure the reliability and safety of these models, given the potential negative consequences of…

Machine learning models are increasingly being deployed in real-world applications where their predictions are used to make critical decisions in a variety of domains. The proliferation of such models has led to a burgeoning need to ensure the reliability and safety of these models, given the potential negative consequences of model vulnerabilities. The complexity of machine learning models, along with the extensive data sets they analyze, can result in unpredictable and unintended outcomes. Model vulnerabilities may manifest due to errors in data input, algorithm design, or model deployment, which can have significant implications for both individuals and society. To prevent such negative outcomes, it is imperative to identify model vulnerabilities at an early stage in the development process. This will aid in guaranteeing the integrity, dependability, and safety of the models, thus mitigating potential risks and enabling the full potential of these technologies to be realized. However, enumerating vulnerabilities can be challenging due to the complexity of the real-world environment. Visual analytics, situated at the intersection of human-computer interaction, computer graphics, and artificial intelligence, offers a promising approach for achieving high interpretability of complex black-box models, thus reducing the cost of obtaining insights into potential vulnerabilities of models. This research is devoted to designing novel visual analytics methods to support the identification and analysis of model vulnerabilities. Specifically, generalizable visual analytics frameworks are instantiated to explore vulnerabilities in machine learning models concerning security (adversarial attacks and data perturbation) and fairness (algorithmic bias). In the end, a visual analytics approach is proposed to enable domain experts to explain and diagnose the model improvement of addressing identified vulnerabilities of machine learning models in a human-in-the-loop fashion. The proposed methods hold the potential to enhance the security and fairness of machine learning models deployed in critical real-world applications.

ContributorsXie, Tiankai (Author) / Maciejewski, Ross (Thesis advisor) / Liu, Huan (Committee member) / Bryan, Chris (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2023

Visual analytics methodologies on causality analysis

Description

Causality analysis is the process of identifying cause-effect relationships among variables. This process is challenging because causal relationships cannot be tested solely based on statistical indicators as additional information is always needed to reduce the ambiguity caused by factors beyond those covered by the statistical test. Traditionally, controlled experiments are…

Causality analysis is the process of identifying cause-effect relationships among variables. This process is challenging because causal relationships cannot be tested solely based on statistical indicators as additional information is always needed to reduce the ambiguity caused by factors beyond those covered by the statistical test. Traditionally, controlled experiments are carried out to identify causal relationships, but recently there is a growing interest in causality analysis with observational data due to the increasing availability of data and tools. This type of analysis will often involve automatic algorithms that extract causal relations from large amounts of data and rely on expert judgment to scrutinize and verify the relations. Over-reliance on these automatic algorithms is dangerous because models trained on observational data are susceptible to bias that can be difficult to spot even with expert oversight. Visualization has proven to be effective at bridging the gap between human experts and statistical models by enabling an interactive exploration and manipulation of the data and models. This thesis develops a visual analytics framework to support the interaction between human experts and automatic models in causality analysis. Three case studies were conducted to demonstrate the application of the visual analytics framework in which feature engineering, insight generation, correlation analysis, and causality inspections were showcased.

ContributorsWang, Hong, Ph.D (Author) / Maciejewski, Ross (Thesis advisor) / He, Jingrui (Committee member) / Davulcu, Hasan (Committee member) / Thies, Cameron (Committee member) / Arizona State University (Publisher)

Created2019

Filtering by