Filtering by
- Member of: ASU Electronic Theses and Dissertations
This research investigates the problem of preference aware skyline processing which consists of inferring the preferences of users and computing a skyline specific to that user, taking into account his preferences. This research proposes a model that transforms the data from a given space to a user preferential space where each attribute represents the preference of the user. This study proposes two techniques "Preferential Skyline Processing" and "Latent Skyline Processing" to efficiently compute preference aware skylines in the user preferential space. Finally, through extensive experiments and performance analysis the correctness of the recommendations and the algorithm's ability to outperform the naïve ones is confirmed.
I conducted user study to evaluate the effectiveness of this approach. Results show that users were able to identify relevant information more precisely via visual interface as compared to traditional list based approach. Network visualization demonstrated effective search-result navigation support to facilitate user’s tasks and improved query quality for successive queries. Subjective evaluation also showed that visualizing search results conveys more semantic information in efficient manner and makes searching more effective.
guessed by different sources. The values of different properties can be obtained from
various sources. These will lead to the disagreement in sources. An important task
is to obtain the truth from these sometimes contradictory sources. In the extension
of computing the truth, the reliability of sources needs to be computed. There are
models which compute the precision values. In those earlier models Banerjee et al.
(2005) Dong and Naumann (2009) Kasneci et al. (2011) Li et al. (2012) Marian and
Wu (2011) Zhao and Han (2012) Zhao et al. (2012), multiple properties are modeled
individually. In one of the existing works, the heterogeneous properties are modeled in
a joined way. In that work, the framework i.e. Conflict Resolution on Heterogeneous
Data (CRH) framework is based on the single objective optimization. Due to the
single objective optimization and non-convex optimization problem, only one local
optimal solution is found. As this is a non-convex optimization problem, the optimal
point depends upon the initial point. This single objective optimization problem is
converted into a multi-objective optimization problem. Due to the multi-objective
optimization problem, the Pareto optimal points are computed. In an extension of
that, the single objective optimization problem is solved with numerous initial points.
The above two approaches are used for finding the solution better than the solution
obtained in the CRH with median as the initial point for the continuous variables and
majority voting as the initial point for the categorical variables. In the experiments,
the solution, coming from the CRH, lies in the Pareto optimal points of the multiobjective
optimization and the solution coming from the CRH is the optimum solution
in these experiments.
Yet, often, queries come as part of a query workload. Therefore, there is a need
for index structures that can take into consideration existence of multiple queries in a
query workload and efficiently produce accurate results for the entire query workload.
These index structures should be scalable to handle large amounts of data as well as
large query workloads.
The main objective of this dissertation is to create and design scalable index structures
that are optimized for range query workloads. Range queries are an important
type of queries with wide-ranging applications. There are no existing index structures
that are optimized for efficient execution of range query workloads. There are
also unique challenges that need to be addressed for range queries in 1D, 2D, and
high-dimensional spaces. In this work, I introduce novel cost models, index selection
algorithms, and storage mechanisms that can tackle these challenges and efficiently
process a given range query workload in 1D, 2D, and high-dimensional spaces. In particular,
I introduce the index structures, HCS (for 1D spaces), cSHB (for 2D spaces),
and PSLSH (for high-dimensional spaces) that are designed specifically to efficiently
handle range query workload and the unique challenges arising from their respective
spaces. I experimentally show the effectiveness of the above proposed index structures
by comparing with state-of-the-art techniques.
Data integration involves the reconciliation of data from diverse data sources in order to obtain a unified data repository, upon which an end user such as a data analyst can run analytics sessions to explore the data and obtain useful insights. Supervised Machine Learning (ML) for data integration tasks such as ontology (schema) or entity (instance) matching requires several training examples in terms of manually curated, pre-labeled matching and non-matching schema concept or entity pairs which are hard to obtain. On similar lines, an analytics system without predictive capabilities about the impending workload can incur huge querying latencies, while leaving the onus of understanding the underlying database schema and writing a meaningful query at every step during a data exploration session on the user. In this dissertation, I will describe the human-in-the-loop Machine Learning (ML) systems that I have built towards data integration and predictive analytics. I alleviate the need for extensive prior labeling by utilizing active learning (AL) for dataintegration. In each AL iteration, I detect the unlabeled entity or schema concept pairs that would strengthen the ML classifier and selectively query the human oracle for such labels in a budgeted fashion. Thus, I make use of human assistance for ML-based data integration. On the other hand, when the human is an end user exploring data through Online Analytical Processing (OLAP) queries, my goal is to pro-actively assist the human by predicting the top-K next queries that s/he is likely to be interested in. I will describe my proposed SQL-predictor, a Business Intelligence (BI) query predictor and a geospatial query cardinality estimator with an emphasis on schema abstraction, query representation and how I adapt the ML models for these tasks. For each system, I will discuss the evaluation metrics and how the proposed systems compare to the state-of-the-art baselines on multiple datasets and query workloads.
processing, time series, and genome data. In higher dimensions, the phenomenon of
curse of dimensionality kills the effectiveness of most of the index structures, giving
way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity
searches. In addition to range searches and k-nearest neighbor searches, there
is a need to answer negative queries formed by excluded regions, in high-dimensional
data. Though there have been a slew of variants of LSH to improve efficiency, reduce
storage, and provide better accuracies, none of the techniques are capable of
answering queries in the presence of excluded regions.
This thesis provides a novel approach to handle such negative queries. This is
achieved by creating a prefix based hierarchical index structure. First, the higher
dimensional space is projected to a lower dimension space. Then, a one-dimensional
ordering is developed, while retaining the hierarchical traits. The algorithm intelligently
prunes the irrelevant candidates while answering queries in the presence of
excluded regions. While naive LSH would need to filter out the negative query results
from the main results, the new algorithm minimizes the need to fetch the redundant
results in the first place. Experiment results show that this reduces post-processing
cost thereby reducing the query processing time.