This research investigates the problem of preference aware skyline processing which consists of inferring the preferences of users and computing a skyline specific to that user, taking into account his preferences. This research proposes a model that transforms the data from a given space to a user preferential space where each attribute represents the preference of the user. This study proposes two techniques "Preferential Skyline Processing" and "Latent Skyline Processing" to efficiently compute preference aware skylines in the user preferential space. Finally, through extensive experiments and performance analysis the correctness of the recommendations and the algorithm's ability to outperform the naïve ones is confirmed.
I conducted user study to evaluate the effectiveness of this approach. Results show that users were able to identify relevant information more precisely via visual interface as compared to traditional list based approach. Network visualization demonstrated effective search-result navigation support to facilitate user’s tasks and improved query quality for successive queries. Subjective evaluation also showed that visualizing search results conveys more semantic information in efficient manner and makes searching more effective.
guessed by different sources. The values of different properties can be obtained from
various sources. These will lead to the disagreement in sources. An important task
is to obtain the truth from these sometimes contradictory sources. In the extension
of computing the truth, the reliability of sources needs to be computed. There are
models which compute the precision values. In those earlier models Banerjee et al.
(2005) Dong and Naumann (2009) Kasneci et al. (2011) Li et al. (2012) Marian and
Wu (2011) Zhao and Han (2012) Zhao et al. (2012), multiple properties are modeled
individually. In one of the existing works, the heterogeneous properties are modeled in
a joined way. In that work, the framework i.e. Conflict Resolution on Heterogeneous
Data (CRH) framework is based on the single objective optimization. Due to the
single objective optimization and non-convex optimization problem, only one local
optimal solution is found. As this is a non-convex optimization problem, the optimal
point depends upon the initial point. This single objective optimization problem is
converted into a multi-objective optimization problem. Due to the multi-objective
optimization problem, the Pareto optimal points are computed. In an extension of
that, the single objective optimization problem is solved with numerous initial points.
The above two approaches are used for finding the solution better than the solution
obtained in the CRH with median as the initial point for the continuous variables and
majority voting as the initial point for the categorical variables. In the experiments,
the solution, coming from the CRH, lies in the Pareto optimal points of the multiobjective
optimization and the solution coming from the CRH is the optimum solution
in these experiments.
Yet, often, queries come as part of a query workload. Therefore, there is a need
for index structures that can take into consideration existence of multiple queries in a
query workload and efficiently produce accurate results for the entire query workload.
These index structures should be scalable to handle large amounts of data as well as
large query workloads.
The main objective of this dissertation is to create and design scalable index structures
that are optimized for range query workloads. Range queries are an important
type of queries with wide-ranging applications. There are no existing index structures
that are optimized for efficient execution of range query workloads. There are
also unique challenges that need to be addressed for range queries in 1D, 2D, and
high-dimensional spaces. In this work, I introduce novel cost models, index selection
algorithms, and storage mechanisms that can tackle these challenges and efficiently
process a given range query workload in 1D, 2D, and high-dimensional spaces. In particular,
I introduce the index structures, HCS (for 1D spaces), cSHB (for 2D spaces),
and PSLSH (for high-dimensional spaces) that are designed specifically to efficiently
handle range query workload and the unique challenges arising from their respective
spaces. I experimentally show the effectiveness of the above proposed index structures
by comparing with state-of-the-art techniques.
Data integration involves the reconciliation of data from diverse data sources in order to obtain a unified data repository, upon which an end user such as a data analyst can run analytics sessions to explore the data and obtain useful insights. Supervised Machine Learning (ML) for data integration tasks such as ontology (schema) or entity (instance) matching requires several training examples in terms of manually curated, pre-labeled matching and non-matching schema concept or entity pairs which are hard to obtain. On similar lines, an analytics system without predictive capabilities about the impending workload can incur huge querying latencies, while leaving the onus of understanding the underlying database schema and writing a meaningful query at every step during a data exploration session on the user. In this dissertation, I will describe the human-in-the-loop Machine Learning (ML) systems that I have built towards data integration and predictive analytics. I alleviate the need for extensive prior labeling by utilizing active learning (AL) for dataintegration. In each AL iteration, I detect the unlabeled entity or schema concept pairs that would strengthen the ML classifier and selectively query the human oracle for such labels in a budgeted fashion. Thus, I make use of human assistance for ML-based data integration. On the other hand, when the human is an end user exploring data through Online Analytical Processing (OLAP) queries, my goal is to pro-actively assist the human by predicting the top-K next queries that s/he is likely to be interested in. I will describe my proposed SQL-predictor, a Business Intelligence (BI) query predictor and a geospatial query cardinality estimator with an emphasis on schema abstraction, query representation and how I adapt the ML models for these tasks. For each system, I will discuss the evaluation metrics and how the proposed systems compare to the state-of-the-art baselines on multiple datasets and query workloads.
For my creative project, I designed a website through which smaller, local tournament registration and management are possible. Users can register for tournaments through the registration page. Registered users can check their registration is successful by viewing a table of all competitors. If the list of competitors is too long, they can filter results based on search criteria. Tournament management will be possible via a functioning timer following WKF rules which keeps track of both the match’s score as well as time.