Matching Items (11)
Filtering by

Clear all filters

149907-Thumbnail Image.png
Description
Most existing approaches to complex event processing over streaming data rely on the assumption that the matches to the queries are rare and that the goal of the system is to identify these few matches within the incoming deluge of data. In many applications, such as stock market analysis and

Most existing approaches to complex event processing over streaming data rely on the assumption that the matches to the queries are rare and that the goal of the system is to identify these few matches within the incoming deluge of data. In many applications, such as stock market analysis and user credit card purchase pattern monitoring, however the matches to the user queries are in fact plentiful and the system has to efficiently sift through these many matches to locate only the few most preferable matches. In this work, we propose a complex pattern ranking (CPR) framework for specifying top-k pattern queries over streaming data, present new algorithms to support top-k pattern queries in data streaming environments, and verify the effectiveness and efficiency of the proposed algorithms. The developed algorithms identify top-k matching results satisfying both patterns as well as additional criteria. To support real-time processing of the data streams, instead of computing top-k results from scratch for each time window, we maintain top-k results dynamically as new events come and old ones expire. We also develop new top-k join execution strategies that are able to adapt to the changing situations (e.g., sorted and random access costs, join rates) without having to assume a priori presence of data statistics. Experiments show significant improvements over existing approaches.
ContributorsWang, Xinxin (Author) / Candan, K. Selcuk (Thesis advisor) / Chen, Yi (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2011
149714-Thumbnail Image.png
Description
This thesis deals with the analysis of interpersonal communication dynamics in online social networks and social media. Our central hypothesis is that communication dynamics between individuals manifest themselves via three key aspects: the information that is the content of communication, the social engagement i.e. the sociological framework emergent of the

This thesis deals with the analysis of interpersonal communication dynamics in online social networks and social media. Our central hypothesis is that communication dynamics between individuals manifest themselves via three key aspects: the information that is the content of communication, the social engagement i.e. the sociological framework emergent of the communication process, and the channel i.e. the media via which communication takes place. Communication dynamics have been of interest to researchers from multi-faceted domains over the past several decades. However, today we are faced with several modern capabilities encompassing a host of social media websites. These sites feature variegated interactional affordances, ranging from blogging, micro-blogging, sharing media elements as well as a rich set of social actions such as tagging, voting, commenting and so on. Consequently, these communication tools have begun to redefine the ways in which we exchange information, our modes of social engagement, and mechanisms of how the media characteristics impact our interactional behavior. The outcomes of this research are manifold. We present our contributions in three parts, corresponding to the three key organizing ideas. First, we have observed that user context is key to characterizing communication between a pair of individuals. However interestingly, the probability of future communication seems to be more sensitive to the context compared to the delay, which appears to be rather habitual. Further, we observe that diffusion of social actions in a network can be indicative of future information cascades; that might be attributed to social influence or homophily depending on the nature of the social action. Second, we have observed that different modes of social engagement lead to evolution of groups that have considerable predictive capability in characterizing external-world temporal occurrences, such as stock market dynamics as well as collective political sentiments. Finally, characterization of communication on rich media sites have shown that conversations that are deemed "interesting" appear to have consequential impact on the properties of the social network they are associated with: in terms of degree of participation of the individuals in future conversations, thematic diffusion as well as emergent cohesiveness in activity among the concerned participants in the network. Based on all these outcomes, we believe that this research can make significant contribution into a better understanding of how we communicate online and how it is redefining our collective sociological behavior.
ContributorsDe Choudhury, Munmun (Author) / Sundaram, Hari (Thesis advisor) / Candan, K. Selcuk (Committee member) / Liu, Huan (Committee member) / Watts, Duncan J. (Committee member) / Seligmann, Doree D. (Committee member) / Arizona State University (Publisher)
Created2011
168435-Thumbnail Image.png
Description
Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough and eligible data to support AI tasks, data processing is

Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough and eligible data to support AI tasks, data processing is always required. To be even worse, the data preprocessing tasks are often dull and heavy, which require huge human labors to deal with. Statistics show 70% - 80% of the data scientists' time is spent on data integration process. Among various reasons, schema changes that commonly exist in the data warehouse are one significant obstacle that impedes the automation of the end-to-end data integration process. Traditional data integration applications rely on data processing operators such as join, union, aggregation and so on. Those operations are fragile and can be easily interrupted by schema changes. Whenever schema changes happen, the data integration applications will require human labors to solve the interruptions and downtime. The industries as well as the data scientists need a new mechanism to handle the schema changes in data integration tasks. This work proposes a new direction of data integration applications based on deep learning models. The data integration problem is defined in the scenario of integrating tabular-format data with natural schema changes, using the cell-based data abstraction. In addition, data augmentation and adversarial learning are investigated to boost the model robustness to schema changes. The experiments are tested on two real-world data integration scenarios, and the results demonstrate the effectiveness of the proposed approach.
ContributorsWang, Zijie (Author) / Zou, Jia (Thesis advisor) / Baral, Chitta (Committee member) / Candan, K. Selcuk (Committee member) / Arizona State University (Publisher)
Created2021
193344-Thumbnail Image.png
Description
Multivariate timeseries data are highly common in the healthcare domain, especially in the neuroscience field for detecting and predicting seizures to monitoring intracranial hypertension (ICH). Unfortunately, conventional techniques to leverage the available time series data do not provide high degrees of accuracy. To address this challenge, the dissertation focuses on

Multivariate timeseries data are highly common in the healthcare domain, especially in the neuroscience field for detecting and predicting seizures to monitoring intracranial hypertension (ICH). Unfortunately, conventional techniques to leverage the available time series data do not provide high degrees of accuracy. To address this challenge, the dissertation focuses on onset prediction models for children with brain trauma in collaboration with neurologists at Phoenix Children’s Hospital. The dissertation builds on the key hypothesis that leveraging spatial information underlying the electroencephalogram (EEG) sensor graphs can significantly boost the accuracy in a multi-modal environment, integrating EEG with intracranial pressure (ICP), arterial blood pressure (ABP) and electrocardiogram (ECG) modalities. Based on this key hypothesis, the dissertation focuses on novel metadata supported multi-variate time series analysis algorithms for onset detection and prediction. In particular, the dissertation investigates a model architecture with a dual attention mechanism to draw global dependencies between inputs and outputs, leveraging self-attention in EEG data using multi-head attention for transformers, and long short-term memory (LSTM). However, recognizing that the positional encoding used traditionally in transformers does not help capture the spatial/neighborhood context of EEG sensors, the dissertation investigates novel attention techniques for performing explicit spatial learning using a coupled model network. This dissertation has answered the question of leveraging transformers and LSTM to perform implicit and explicit learning using a metadata supported coupled model network a) Robust Multi-variate Temporal Features (RMT) model and LSTM, b) the convolutional neural network - scale space attention (CNN-SSA) and LSTM mapped together using Multi-Head Attention with explicit spatial metadata for EEG sensor graphs for seizure and ICH onset prediction respectively. In addition, this dissertation focuses on transfer learning between multiple groups where target patients have lesser number of EEG channels than the source patients. This incomplete data poses problems during pre-processing. Two approaches are explored using all predictors approach considering spatial context to guide the variates who are used as predictors for the missing EEG channels, and common core/subset of EEG channels. Under data imputation K-Nearest Neighbors (KNN) regression and multi-variate multi-scale neural network (M2NN) are implemented, to address the problem for target patients.
ContributorsRavindranath, Manjusha (Author) / Candan, K. Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Zou, Jia (Committee member) / Luisa Sapino, Maria (Committee member) / Arizona State University (Publisher)
Created2024
157543-Thumbnail Image.png
Description
With the development of modern technological infrastructures, such as social networks or the Internet of Things (IoT), data is being generated at a speed that is never before seen. Analyzing the content of this data helps us further understand underlying patterns and discover relationships among different subsets of data, enabling

With the development of modern technological infrastructures, such as social networks or the Internet of Things (IoT), data is being generated at a speed that is never before seen. Analyzing the content of this data helps us further understand underlying patterns and discover relationships among different subsets of data, enabling intelligent decision making. In this thesis, I first introduce the Low-rank, Win-dowed, Incremental Singular Value Decomposition (SVD) framework to inclemently maintain SVD factors over streaming data. Then, I present the Group Incremental Non-Negative Matrix Factorization framework to leverage redundancies in the data to speed up incremental processing. They primarily tackle the challenges of using factorization models in the scenarios with streaming textual data. In order to tackle the challenges in improving the effectiveness and efficiency of generative models in this streaming environment, I introduce the Incremental Dynamic Multiscale Topic Model framework, which identifies multi-scale patterns and their evolutions within streaming datasets. While the latent factor models assume the linear independence in the latent factors, the generative models assume the observation is generated from a set of latent variables with various distributions. Furthermore, some models may not be accessible or their underlying structures are too complex to understand, such as simulation ensembles, where there may be thousands of parameters with a huge parameter space, the only way to learn information from it is to execute real simulations. When performing knowledge discovery and decision making through data- and model-driven simulation ensembles, it is expensive to operate these ensembles continuously at large scale, due to the high computational. Consequently, given a relatively small simulation budget, it is desirable to identify a sparse ensemble that includes the most informative simulations, while still permitting effective exploration of the input parameter space. Therefore, I present Complexity-Guided Parameter Space Sampling framework, which is an intelligent, top-down sampling scheme to select the most salient simulation parameters to execute, given a limited computational budget. Moreover, I also present a Pivot-Guided Parameter Space Sampling framework, which incrementally maintains a diverse ensemble of models of the simulation ensemble space and uses a pivot guided mechanism for future sample selection.
ContributorsChen, Xilun (Author) / Candan, K. Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Pedrielli, Giulia (Committee member) / Sapino, Maria Luisa (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)
Created2019
152906-Thumbnail Image.png
Description
Multidimensional data have various representations. Thanks to their simplicity in modeling multidimensional data and the availability of various mathematical tools (such as tensor decompositions) that support multi-aspect analysis of such data, tensors are increasingly being used in many application domains including scientific data management, sensor data management, and social network

Multidimensional data have various representations. Thanks to their simplicity in modeling multidimensional data and the availability of various mathematical tools (such as tensor decompositions) that support multi-aspect analysis of such data, tensors are increasingly being used in many application domains including scientific data management, sensor data management, and social network data analysis. Relational model, on the other hand, enables semantic manipulation of data using relational operators, such as projection, selection, Cartesian-product, and set operators. For many multidimensional data applications, tensor operations as well as relational operations need to be supported throughout the data life cycle. In this thesis, we introduce a tensor-based relational data model (TRM), which enables both tensor- based data analysis and relational manipulations of multidimensional data, and define tensor-relational operations on this model. Then we introduce a tensor-relational data management system, so called, TensorDB. TensorDB is based on TRM, which brings together relational algebraic operations (for data manipulation and integration) and tensor algebraic operations (for data analysis). We develop optimization strategies for tensor-relational operations in both in-memory and in-database TensorDB. The goal of the TRM and TensorDB is to serve as a single environment that supports the entire life cycle of data; that is, data can be manipulated, integrated, processed, and analyzed.
ContributorsKim, Mijung (Author) / Candan, K. Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Sundaram, Hari (Committee member) / Ye, Jieping (Committee member) / Arizona State University (Publisher)
Created2014
153003-Thumbnail Image.png
Description
Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which

Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this thesis, I provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. I thus avoid the necessity for a domain expert or master data. I also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. A Map-Reduce architecture to perform this computation in a distributed manner is also shown. I evaluate these methods over both synthetic and real data.
ContributorsDe, Sushovan (Author) / Kambhampati, Subbarao (Thesis advisor) / Chen, Yi (Committee member) / Candan, K. Selcuk (Committee member) / Liu, Huan (Committee member) / Arizona State University (Publisher)
Created2014
158774-Thumbnail Image.png
Description
Technological advances have allowed for the assimilation of a variety of data, driving a shift away from the use of simpler and constrained patterns to more complex and diverse patterns in retrieval and analysis of such data. This shift has inundated the conventional techniques and has stressed the need for

Technological advances have allowed for the assimilation of a variety of data, driving a shift away from the use of simpler and constrained patterns to more complex and diverse patterns in retrieval and analysis of such data. This shift has inundated the conventional techniques and has stressed the need for intelligent mechanisms that can model the complex patterns in the data. Deep neural networks have shown some success at capturing complex patterns, including the so-called attentioned networks, have significant shortcomings in distinguishing what is important in data from what is noise. This dissertation observes that the traditional neural networks primarily rely solely on gradient-based learning to model deep features maps while ignoring the key insight in the data that can be leveraged as complementary information to help learn an accurate model. In particular, this dissertation shows that the localized multi-scale features (captured implicitly or explicitly) can be leveraged to help improve model performance as these features capture salient informative points in the data.

This dissertation focuses on “working with the data, not just on data”, i.e. leveraging feature saliency through pre-training, in-training, and post-training analysis of the data. In particular, non-neural localized multi-scale feature extraction, in images and time series, are relatively cheap to obtain and can provide a rough overview of the patterns in the data. Furthermore, localized features coupled with deep features can help learn a high performing network. A pre-training analysis of sizes, complexities, and distribution of these localized features can help intelligently allocate a user-provided kernel budget in the network as a single-shot hyper-parameter search. Additionally, these localized features can be used as a secondary input modality to the network for cross-attention. Retraining pre-trained networks can be a costly process, yet, a post-training analysis of model inferences can allow for learning the importance of individual network parameters to the model inferences thus facilitating a retraining-free network sparsification with minimal impact on the model performance. Furthermore, effective in-training analysis of the intermediate features in the network help learn the importance of individual intermediate features (neural attention) and this analysis can be achieved through simulating local-extrema detection or learning features simultaneously and understanding their co-occurrences. In summary, this dissertation argues and establishes that, if appropriately leveraged, localized features and their feature saliency can help learn high-accurate, yet cheaper networks.
ContributorsGarg, Yash (Author) / Candan, K. Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Li, Baoxin (Committee member) / Sapino, Maria Luisa (Committee member) / Arizona State University (Publisher)
Created2020
161479-Thumbnail Image.png
Description
Tensors are commonly used for representing multi-dimensional data, such as Web graphs, sensor streams, and social networks. As a consequence of the increase in the use of tensors, tensor decomposition operations began to form the basis for many data analysis and knowledge discovery tasks, from clustering, trend detection, anomaly detection

Tensors are commonly used for representing multi-dimensional data, such as Web graphs, sensor streams, and social networks. As a consequence of the increase in the use of tensors, tensor decomposition operations began to form the basis for many data analysis and knowledge discovery tasks, from clustering, trend detection, anomaly detection to correlationanalysis [31, 38]. It is well known that Singular Value matrix Decomposition (SVD) [9] is used to extract latent semantics for matrix data. When apply SVD to tensors, which have more than two modes, it is tensor decomposition. The two most popular tensor decomposition algorithms are the Tucker [54] and the CP [19] decompositions. Intuitively, they both generalize SVD to tensors. However, one key problem with tensor decomposition is its computational complexity which may cause system bottleneck. Therefore, two phase block-centric CP tensor decomposition (2PCP) was proposed to partition the tensor into small sub-tensors, execute sub-tensor decomposition in parallel and combine the factors from each sub-tensor into final decomposition factors through iterative rerefinement process. Consequently, I proposed Sub-tensor Impact Graph (SIG) to account for inaccuracy propagation among sub-tensors and measure the impact of decomposition of sub-tensors on the other's decomposition, Based on SIG, I proposed several optimization strategies to optimize 2PCP's phase-2 refinement process. Furthermore, I applied SIG and optimization strategies for data focus, data evolution, and focus shifting in tensor analysis. Personalized Tensor Decomposition (PTD) is proposed to account for the users focus given the observations that in many applications, the user may have a focus of interest i.e., part of the data for which the user needs high accuracy and beyond this area focus, accuracy may not be as critical. PTD takes as input one or more areas of focus and performs the decomposition in such a way that, when reconstructed, the accuracy of the tensor is boosted for these areas of focus. A related challenge of data evolution in tensor analytics is incremental tensor decomposition since re-computation of the whole tensor decomposition with each update will cause high computational costs and incur large memory overheads. Especially for applications where data evolves over time and the tensor-based analysis results need to be continuouslymaintained. To avoid re-decomposition, I propose a two-phase block-incremental CP-based tensor decomposition technique, BICP, that efficiently and effectively maintains tensor decomposition results in the presence of dynamically evolving tensor data. I further extend the research focus on user focus shift. User focus may change over time as data is evolving along the time. Although PTD is efficient, re-computation for each user preference update can be the bottleneck for the system. Therefore I propose dynamic evolving user focus tensor decomposition which can smartly reuse the existing decomposition result to improve the efficiency of evolving user focus block decomposition.
ContributorsHuang, shengyu (Author) / Candan, K. Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Sapino, Maria Luisa (Committee member) / Tong, Hanghang (Committee member) / Zou, Jia (Committee member) / Arizona State University (Publisher)
Created2021
161901-Thumbnail Image.png
Description
The need of effective forecasting models for multi-variate time series has been underlined by the integration of sensory technologies into essential applications such as building energy optimizations, flight monitoring, and health monitoring. To meet this requirement, time series prediction techniques have been expanded from uni-variate to multi-variate. However, due to

The need of effective forecasting models for multi-variate time series has been underlined by the integration of sensory technologies into essential applications such as building energy optimizations, flight monitoring, and health monitoring. To meet this requirement, time series prediction techniques have been expanded from uni-variate to multi-variate. However, due to the extended models’ poor ability to capture the intrinsic relationships among variates, naïve extensions of prediction approaches result in an unwanted rise in the cost of model learning and, more critically, a significant loss in model performance. While recurrent models like Long Short-Term Memory (LSTM) and Recurrent Neural Network Network (RNN) are designed to capture the temporal intricacies in data, their performance can soon deteriorate. First, I claim in this thesis that (a) by exploiting temporal alignments of variates to quantify the importance of the recorded variates in relation to a target variate, one can build a more accurate forecasting model. I also argue that (b) traditional time series similarity/distance functions, such as Dynamic Time Warping (DTW), which require that variates have similar absolute patterns are fundamentally ill-suited for this purpose, and that should instead quantify temporal correlation in terms of temporal alignments of key “events” impacting these series, rather than series similarity. Further, I propose that (c) while learning a temporal model with recurrence-based techniques (such as RNN and LSTM – even when leveraging attention strategies) is challenging and expensive, the better results can be obtained by coupling simpler CNNs with an adaptive variate selection strategy. Putting these together, I introduce a novel Selego framework for variate selection based on these arguments, and I experimentally evaluate the performance of the proposed approach on various forecasting models, such as LSTM, RNN, and CNN, for different top-X% percent variates and different forecasting time in the future (lead), on multiple real-world data sets. Experiments demonstrate that the proposed framework can reduce the number of recorded variates required to train predictive models by 90 - 98% while also increasing accuracy. Finally, I present a fault onset detection technique that leverages the precise baseline forecasting models trained using the Selego framework. The proposed, Selego-enabled Fault Detection Framework (FDF-Selego) has been experimentally evaluated within the context of detecting the onset of faults in the building Heating, Ventilation, and Air Conditioning (HVAC) system.
ContributorsTiwaskar, Manoj (Author) / Candan, K. Selcuk (Thesis advisor) / Sapino, Maria Luisa (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2021