Search Content

Expanding data mining theory for industrial applications

Description

The field of Data Mining is widely recognized and accepted for its applications in many business problems to guide decision-making processes based on data. However, in recent times, the scope of these problems has swollen and the methods are under scrutiny for applicability and relevance to real-world circumstances. At the…

The field of Data Mining is widely recognized and accepted for its applications in many business problems to guide decision-making processes based on data. However, in recent times, the scope of these problems has swollen and the methods are under scrutiny for applicability and relevance to real-world circumstances. At the crossroads of innovation and standards, it is important to examine and understand whether the current theoretical methods for industrial applications (which include KDD, SEMMA and CRISP-DM) encompass all possible scenarios that could arise in practical situations. Do the methods require changes or enhancements? As part of the thesis I study the current methods and delineate the ideas of these methods and illuminate their shortcomings which posed challenges during practical implementation. Based on the experiments conducted and the research carried out, I propose an approach which illustrates the business problems with higher accuracy and provides a broader view of the process. It is then applied to different case studies highlighting the different aspects to this approach.

ContributorsAnand, Aneeth (Author) / Liu, Huan (Thesis advisor) / Kempf, Karl G. (Thesis advisor) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2012

Diffusion in Networks: Source Localization, History Reconstruction and Real-Time Network Robustification

Description

Diffusion processes in networks can be used to model many real-world processes, such as the propagation of a rumor on social networks and cascading failures on power networks. Analysis of diffusion processes in networks can help us answer important questions such as the role and the importance of each node…

Diffusion processes in networks can be used to model many real-world processes, such as the propagation of a rumor on social networks and cascading failures on power networks. Analysis of diffusion processes in networks can help us answer important questions such as the role and the importance of each node in the network for spreading the diffusion and how to top or contain a cascading failure in the network. This dissertation consists of three parts.

In the first part, we study the problem of locating multiple diffusion sources in networks under the Susceptible-Infected-Recovered (SIR) model. Given a complete snapshot of the network, we developed a sample-path-based algorithm, named clustering and localization, and proved that for regular trees, the estimators produced by the proposed algorithm are within a constant distance from the real sources with a high probability. Then, we considered the case in which only a partial snapshot is observed and proposed a new algorithm, named Optimal-Jordan-Cover (OJC). The algorithm first extracts a subgraph using a candidate selection algorithm that selects source candidates based on the number of observed infected nodes in their neighborhoods. Then, in the extracted subgraph, OJC finds a set of nodes that "cover" all observed infected nodes with the minimum radius. The set of nodes is called the Jordan cover, and is regarded as the set of diffusion sources. We proved that OJC can locate all sources with probability one asymptotically with partial observations in the Erdos-Renyi (ER) random graph. Multiple experiments on different networks were done, which show our algorithms outperform others.

In the second part, we tackle the problem of reconstructing the diffusion history from partial observations. We formulated the diffusion history reconstruction problem as a maximum a posteriori (MAP) problem and proved the problem is NP hard. Then we proposed a step-by- step reconstruction algorithm, which can always produce a diffusion history that is consistent with the partial observations. Our experimental results based on synthetic and real networks show that the algorithm significantly outperforms some existing methods.

In the third part, we consider the problem of improving the robustness of an interdependent network by rewiring a small number of links during a cascading attack. We formulated the problem as a Markov decision process (MDP) problem. While the problem is NP-hard, we developed an effective and efficient algorithm, RealWire, to robustify the network and to mitigate the damage during the attack. Extensive experimental results show that our algorithm outperforms other algorithms on most of the robustness metrics.

ContributorsChen, Zhen (Author) / Ying, Lei (Thesis advisor) / Tong, Hanghang (Thesis advisor) / Zhang, Junshan (Committee member) / He, Jingrui (Committee member) / Arizona State University (Publisher)

Created2018

Network Representation Learning in Social Media

Description

The popularity of social media has generated abundant large-scale social networks, which advances research on network analytics. Good representations of nodes in a network can facilitate many network mining tasks. The goal of network representation learning (network embedding) is to learn low-dimensional vector representations of social network nodes that capture…

The popularity of social media has generated abundant large-scale social networks, which advances research on network analytics. Good representations of nodes in a network can facilitate many network mining tasks. The goal of network representation learning (network embedding) is to learn low-dimensional vector representations of social network nodes that capture certain properties of the networks. With the learned node representations, machine learning and data mining algorithms can be applied for network mining tasks such as link prediction and node classification. Because of its ability to learn good node representations, network representation learning is attracting increasing attention and various network embedding algorithms are proposed.

Despite the success of these network embedding methods, the majority of them are dedicated to static plain networks, i.e., networks with fixed nodes and links only; while in social media, networks can present in various formats, such as attributed networks, signed networks, dynamic networks and heterogeneous networks. These social networks contain abundant rich information to alleviate the network sparsity problem and can help learn a better network representation; while plain network embedding approaches cannot tackle such networks. For example, signed social networks can have both positive and negative links. Recent study on signed networks shows that negative links have added value in addition to positive links for many tasks such as link prediction and node classification. However, the existence of negative links challenges the principles used for plain network embedding. Thus, it is important to study signed network embedding. Furthermore, social networks can be dynamic, where new nodes and links can be introduced anytime. Dynamic networks can reveal the concept drift of a user and require efficiently updating the representation when new links or users are introduced. However, static network embedding algorithms cannot deal with dynamic networks. Therefore, it is important and challenging to propose novel algorithms for tackling different types of social networks.

In this dissertation, we investigate network representation learning in social media. In particular, we study representative social networks, which includes attributed network, signed networks, dynamic networks and document networks. We propose novel frameworks to tackle the challenges of these networks and learn representations that not only capture the network structure but also the unique properties of these social networks.

ContributorsWang, Suhang (Author) / Liu, Huan (Thesis advisor) / Aggarwal, Charu (Committee member) / Sen, Arunabha (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2018

Automatic tracking of linguistic changes for monitoring cognitive-linguistic health

Description

Many neurological disorders, especially those that result in dementia, impact speech and language production. A number of studies have shown that there exist subtle changes in linguistic complexity in these individuals that precede disease onset. However, these studies are conducted on controlled speech samples from a specific task. This thesis…

Many neurological disorders, especially those that result in dementia, impact speech and language production. A number of studies have shown that there exist subtle changes in linguistic complexity in these individuals that precede disease onset. However, these studies are conducted on controlled speech samples from a specific task. This thesis explores the possibility of using natural language processing in order to detect declining linguistic complexity from more natural discourse. We use existing data from public figures suspected (or at risk) of suffering from cognitive-linguistic decline, downloaded from the Internet, to detect changes in linguistic complexity. In particular, we focus on two case studies. The first case study analyzes President Ronald Reagan’s transcribed spontaneous speech samples during his presidency. President Reagan was diagnosed with Alzheimer’s disease in 1994, however my results showed declining linguistic complexity during the span of the 8 years he was in office. President George Herbert Walker Bush, who has no known diagnosis of Alzheimer’s disease, shows no decline in the same measures. In the second case study, we analyze transcribed spontaneous speech samples from the news conferences of 10 current NFL players and 18 non-player personnel since 2007. The non-player personnel have never played professional football. Longitudinal analysis of linguistic complexity showed contrasting patterns in the two groups. The majority (6 of 10) of current players showed decline in at least one measure of linguistic complexity over time. In contrast, the majority (11 out of 18) of non-player personnel showed an increase in at least one linguistic complexity measure.

ContributorsWang, Shuai (Author) / Berisha, Visar (Thesis advisor) / LaCross, Amy (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2016

SPSR efficient processing of socially k-nearest neighbors with spatial range filter

Description

Social media has become popular in the past decade. Facebook for example has 1.59 billion active users monthly. With such massive social networks generating lot of data, everyone is constantly looking for ways of leveraging the knowledge from social networks to make their systems more personalized to their end users.…

Social media has become popular in the past decade. Facebook for example has 1.59 billion active users monthly. With such massive social networks generating lot of data, everyone is constantly looking for ways of leveraging the knowledge from social networks to make their systems more personalized to their end users. And with rapid increase in the usage of mobile phones and wearables, social media data is being tied to spatial networks. This research document proposes an efficient technique that answers socially k-Nearest Neighbors with Spatial Range Filter. The proposed approach performs a joint search on both the social and spatial domains which radically improves the performance compared to straight forward solutions. The research document proposes a novel index that combines social and spatial indexes. In other words, graph data is stored in an organized manner to filter it based on spatial (region of interest) and social constraints (top-k closest vertices) at query time. That leads to pruning necessary paths during the social graph traversal procedure, and only returns the top-K social close venues. The research document then experimentally proves how the proposed approach outperforms existing baseline approaches by at least three times and also compare how each of our algorithms perform under various conditions on a real geo-social dataset extracted from Yelp.

ContributorsPasumarthy, Nitin (Author) / Sarwat, Mohamed (Thesis advisor) / Papotti, Paolo (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2016

Video2Vec: learning semantic spatio-temporal embedding for video representations

Description

High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such…

High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.

In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.

ContributorsHu, Sheng-Hung (Author) / Li, Baoxin (Thesis advisor) / Turaga, Pavan (Committee member) / Liang, Jianming (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2016

Algorithmic Foundations of Self-Organizing Programmable Matter

Description

Imagine that we have a piece of matter that can change its physical properties like its shape, density, conductivity, or color in a programmable fashion based on either user input or autonomous sensing. This is the vision behind what is commonly known as programmable matter. Envisioning systems of nano-sensors devices,…

Imagine that we have a piece of matter that can change its physical properties like its shape, density, conductivity, or color in a programmable fashion based on either user input or autonomous sensing. This is the vision behind what is commonly known as programmable matter. Envisioning systems of nano-sensors devices, programmable matter consists of systems of simple computational elements, called particles, that can establish and release bonds, compute, and can actively move in a self-organized way. In this dissertation the feasibility of solving fundamental problems relevant for programmable matter is investigated. As a model for such self-organizing particle systems (SOPS), the geometric amoebot model is introduced. In this model, particles only have local information and have modest computational power. They achieve locomotion by expanding and contracting, which resembles the behavior of amoeba. Under this model, efficient local-control algorithms for the leader election problem in SOPS are presented. As a central problem for programmable matter, shape formation problems are then studied. The limitations of solving the leader election problem and the shape formation problem on a more general version of the amoebot model are also discussed. The \smart paint" problem is also studied which aims at having the particles self-organize in order to uniformly coat the surface of an object of arbitrary shape and size, forming multiple coating layers if necessary. A Universal Coating algorithm is presented and shown to be asymptotically worst-case optimal both in terms of time with high probability and work. In particular, the algorithm always terminates within a linear number of rounds with high probability. A linear lower bound on the competitive gap between fully local coating algorithms and coating algorithms that rely on global information is presented, which implies that the proposed algorithm is also optimal in a competitive sense. Simulation results show that the competitive ratio of the proposed algorithm may be better than linear in practice. Developed algorithms utilize only local control, require only constant-size memory particles, and are asymptotically optimal in terms of the total number of particle movements needed to reach the desired shape configuration.

ContributorsDerakhshandeh, Zahra (Author) / Richa, Andrea (Thesis advisor) / Sen, Arunabha (Thesis advisor) / Xue, Guoliang (Committee member) / Scheideler, Christian (Committee member) / Arizona State University (Publisher)

Created2017

Heuristics for Arc Routing Problems and Their Applications

Description

Arc Routing Problems (ARPs) are a type of routing problem that finds routes of minimum total cost covering the edges or arcs in a graph representing street or road networks. They find application in many essential services such as residential waste collection, winter gritting, and others. Being NP-hard, solutions are…

Arc Routing Problems (ARPs) are a type of routing problem that finds routes of minimum total cost covering the edges or arcs in a graph representing street or road networks. They find application in many essential services such as residential waste collection, winter gritting, and others. Being NP-hard, solutions are usually found using heuristic methods. This dissertation contributes to heuristics for ARP, with a focus on the Capacitated Arc Routing Problem (CARP) with additional constraints. In operations such as residential waste collection, vehicle breakdown disruptions occur frequently. A new variant Capacitated Arc Re-routing Problem for Vehicle Break-down (CARP-VB) is introduced to address the need to re-route using only remaining vehicles to avoid missing services. A new heuristic Probe is developed to solve CARP-VB. Experiments on benchmark instances show that Probe is better in reducing the makespan and hence effective in reducing delays and avoiding missing services. In addition to total cost, operators are also interested in solutions that are attractive, that is, routes that are contiguous, compact, and non-overlapping to manage the work. Operators may not adopt a solution that is not attractive even if it is optimum. They are also interested in solutions that are balanced in workload to meet equity requirements. A new multi-objective memetic algorithm, MA-ABC is developed, that optimizes three objectives: Attractiveness, makespan, and total cost. On testing with benchmark instances, MA-ABC was found to be effective in providing attractive and balanced route solutions without affecting the total cost. Changes in the problem specification such as demand and topology occurs frequently in business operations. Machine learning be applied to learn the distribution behind these changes and generate solutions quickly at time of inference. Splice is a machine learning framework for CARP that generates closer to optimum solutions quickly using a graph neural network and deep Q-learning. Splice can solve several variants of node and arc routing problems using the same architecture without any modification. Splice was trained and tested using randomly generated instances. Splice generated solutions faster that are also better in comparison to popular metaheuristics.

ContributorsRamamoorthy, Muhilan (Author) / Syrotiuk, Violet R. (Thesis advisor) / Forrest, Stephanie (Committee member) / Mirchandani, Pitu (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2022

Cooperative Driving of Connected Autonomous Vehicles Using Responsibility Sensitive Safety Rules

Description

In the recent times, traffic congestion and motor accidents have been a major problem for transportation in major cities. Intelligent Transportation Systems has the potential to be an effective solution in order to tackle this issue. Connected Autonomous Vehicles can cooperate at intersections, ramp merging, lane change and other conflicting…

In the recent times, traffic congestion and motor accidents have been a major problem for transportation in major cities. Intelligent Transportation Systems has the potential to be an effective solution in order to tackle this issue. Connected Autonomous Vehicles can cooperate at intersections, ramp merging, lane change and other conflicting scenarios in order to resolve the conflicts and avoid collisions with other vehicles. A lot of works has been proposed for specific scenarios such as intersections, ramp merging or lane change which partially solve the conflict resolution problem. Also, one of the major issues in autonomous decision making - deadlocks have not been considered in some of the works. The existing works either do not consider deadlocks or lack a safety proof. This thesis proposes a cooperative driving solution that provides a complete navigation, conflict resolution and deadlock resolution for connected autonomous vehicles. A graph-based model is used to resolve the deadlocks between vehicles and the responsibility sensitive safety (RSS) rules have been used in order to ensure safety of the autonomous vehicles during conflict detection and resolution. This algorithm provides a complete navigation solution for an autonomous vehicle from its source to destination. The algorithm ensures that accidents do not occur even in the worst-case scenario and the decision making is deadlock free.

ContributorsAllamsetti, Harshith (Author) / Shrivastava, Aviral (Thesis advisor) / Sen, Arunabha (Committee member) / Ren, Fengbo (Committee member) / Arizona State University (Publisher)

Created2020

Learning from the Data Heterogeneity for Data Imputation

Description

Data mining, also known as big data analysis, has been identified as a critical and challenging process for a variety of applications in real-world problems. Numerous datasets are collected and generated every day to store the information. The rise in the number of data volumes and data modality has resulted…

Data mining, also known as big data analysis, has been identified as a critical and challenging process for a variety of applications in real-world problems. Numerous datasets are collected and generated every day to store the information. The rise in the number of data volumes and data modality has resulted in the increased demand for data mining methods and strategies of finding anomalies, patterns, and correlations within large data sets to predict outcomes. Effective machine learning methods are widely adapted to build the data mining pipeline for various purposes like business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The major challenges for effectively and efficiently mining big data include (1) data heterogeneity and (2) missing data. Heterogeneity is the natural characteristic of big data, as the data is typically collected from different sources with diverse formats. The missing value is the most common issue faced by the heterogeneous data analysis, which resulted from variety of factors including the data collecting processing, user initiatives, erroneous data entries, and so on. In response to these challenges, in this thesis, three main research directions with application scenarios have been investigated: (1) Mining and Formulating Heterogeneous Data, (2) missing value imputation strategy in various application scenarios in both offline and online manner, and (3) missing value imputation for multi-modality data. Multiple strategies with theoretical analysis are presented, and the evaluation of the effectiveness of the proposed algorithms compared with state-of-the-art methods is discussed.

Contributorsliu, Xu (Author) / He, Jingrui (Thesis advisor) / Xue, Guoliang (Thesis advisor) / Li, Baoxin (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2021

Filtering by