Matching Items (10)

153229-Thumbnail Image.png

Efficient processing of skyline queries on static data sources, data streams and incomplete datasets

Description

Skyline queries extract interesting points that are non-dominated and help paint the bigger picture of the data in question. They are valuable in many multi-criteria decision applications and are becoming

Skyline queries extract interesting points that are non-dominated and help paint the bigger picture of the data in question. They are valuable in many multi-criteria decision applications and are becoming a staple of decision support systems.

An assumption commonly made by many skyline algorithms is that a skyline query is applied to a single static data source or data stream. Unfortunately, this assumption does not hold in many applications in which a skyline query may involve attributes belonging to multiple data sources and requires a join operation to be performed before the skyline can be produced. Recently, various skyline-join algorithms have been proposed to address this problem in the context of static data sources. However, these algorithms suffer from several drawbacks: they often need to scan the data sources exhaustively to obtain the skyline-join results; moreover, the pruning techniques employed to eliminate tuples are largely based on expensive tuple-to-tuple comparisons. On the other hand, most data stream techniques focus on single stream skyline queries, thus rendering them unsuitable for skyline-join queries.

Another assumption typically made by most of the earlier skyline algorithms is that the data is complete and all skyline attribute values are available. Due to this constraint, these algorithms cannot be applied to incomplete data sources in which some of the attribute values are missing and are represented by NULL values. There exists a definition of dominance for incomplete data, but this leads to undesirable consequences such as non-transitive and cyclic dominance relations both of which are detrimental to skyline processing.

Based on the aforementioned observations, the main goal of the research described in this dissertation is the design and development of a framework of skyline operators that effectively handles three distinct types of skyline queries: 1) skyline-join queries on static data sources, 2) skyline-window-join queries over data streams, and 3) strata-skyline queries on incomplete datasets. This dissertation presents the unique challenges posed by these skyline queries and addresses the shortcomings of current skyline techniques by proposing efficient methods to tackle the added overhead in processing skyline queries on static data sources, data streams, and incomplete datasets.

Contributors

Agent

Created

Date Created
  • 2014

152100-Thumbnail Image.png

Decentralized information search

Description

Our research focuses on finding answers through decentralized search, for complex, imprecise queries (such as "Which is the best hair salon nearby?") in situations where there is a spatiotemporal constraint

Our research focuses on finding answers through decentralized search, for complex, imprecise queries (such as "Which is the best hair salon nearby?") in situations where there is a spatiotemporal constraint (say answer needs to be found within 15 minutes) associated with the query. In general, human networks are good in answering imprecise queries. We try to use the social network of a person to answer his query. Our research aims at designing a framework that exploits the user's social network in order to maximize the answers for a given query. Exploiting an user's social network has several challenges. The major challenge is that the user's immediate social circle may not possess the answer for the given query, and hence the framework designed needs to carry out the query diffusion process across the network. The next challenge involves in finding the right set of seeds to pass the query to in the user's social circle. One other challenge is to incentivize people in the social network to respond to the query and thereby maximize the quality and quantity of replies. Our proposed framework is a mobile application where an individual can either respond to the query or forward it to his friends. We simulated the query diffusion process in three types of graphs: Small World, Random and Preferential Attachment. Given a type of network and a particular query, we carried out the query diffusion by selecting seeds based on attributes of the seed. The main attributes are Topic relevance, Replying or Forwarding probability and Time to Respond. We found that there is a considerable increase in the number of replies attained, even without saturating the user's network, if we adopt an optimal seed selection process. We found the output of the optimal algorithm to be satisfactory as the number of replies received at the interrogator's end was close to three times the number of neighbors an interrogator has. We addressed the challenge of incentivizing people to respond by associating a particular amount of points for each query asked, and awarding the same to people involved in answering the query. Thus, we aim to design a mobile application based on our proposed framework so that it helps in maximizing the replies for the interrogator's query by diffusing the query across his/her social network.

Contributors

Agent

Created

Date Created
  • 2013

154272-Thumbnail Image.png

Locality sensitive indexing for efficient high-dimensional query answering in the presence of excluded regions

Description

Similarity search in high-dimensional spaces is popular for applications like image

processing, time series, and genome data. In higher dimensions, the phenomenon of

curse of dimensionality kills the effectiveness of most of

Similarity search in high-dimensional spaces is popular for applications like image

processing, time series, and genome data. In higher dimensions, the phenomenon of

curse of dimensionality kills the effectiveness of most of the index structures, giving

way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity

searches. In addition to range searches and k-nearest neighbor searches, there

is a need to answer negative queries formed by excluded regions, in high-dimensional

data. Though there have been a slew of variants of LSH to improve efficiency, reduce

storage, and provide better accuracies, none of the techniques are capable of

answering queries in the presence of excluded regions.

This thesis provides a novel approach to handle such negative queries. This is

achieved by creating a prefix based hierarchical index structure. First, the higher

dimensional space is projected to a lower dimension space. Then, a one-dimensional

ordering is developed, while retaining the hierarchical traits. The algorithm intelligently

prunes the irrelevant candidates while answering queries in the presence of

excluded regions. While naive LSH would need to filter out the negative query results

from the main results, the new algorithm minimizes the need to fetch the redundant

results in the first place. Experiment results show that this reduces post-processing

cost thereby reducing the query processing time.

Contributors

Agent

Created

Date Created
  • 2016

153303-Thumbnail Image.png

Space adaptation techniques for preference oriented skyline processing

Description

Skyline queries are a well-established technique used in multi criteria decision applications. There is a recent interest among the research community to efficiently compute skylines but the problem of presenting

Skyline queries are a well-established technique used in multi criteria decision applications. There is a recent interest among the research community to efficiently compute skylines but the problem of presenting the skyline that takes into account the preferences of the user is still open. Each user has varying interests towards each attribute and hence "one size fits all" methodology might not satisfy all the users. True user satisfaction can be obtained only when the skyline is tailored specifically for each user based on his preferences.

This research investigates the problem of preference aware skyline processing which consists of inferring the preferences of users and computing a skyline specific to that user, taking into account his preferences. This research proposes a model that transforms the data from a given space to a user preferential space where each attribute represents the preference of the user. This study proposes two techniques "Preferential Skyline Processing" and "Latent Skyline Processing" to efficiently compute preference aware skylines in the user preferential space. Finally, through extensive experiments and performance analysis the correctness of the recommendations and the algorithm's ability to outperform the naïve ones is confirmed.

Contributors

Agent

Created

Date Created
  • 2014

150226-Thumbnail Image.png

An investigation of the cost and accuracy tradeoffs of supplanting AFDs with bayes network in query processing in the presence of incompleteness in autonomous databases

Description

As the information available to lay users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key

As the information available to lay users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as Query Processing over Incomplete Autonomous Databases (QPIAD) aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values--which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this thesis, I present a principled probabilis- tic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. I learn this distribution in terms of Bayes networks. My approach involves min- ing/"learning" Bayes networks from a sample of the database, and using it do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). I present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayes networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayes networks provide higher precision and recall than AFDs while keeping query processing costs manageable.

Contributors

Agent

Created

Date Created
  • 2011

150235-Thumbnail Image.png

Topic sensitive sourcerank: extending sourcerank for performing context-sensitive search over deep-web

Description

Source selection is one of the foremost challenges for searching deep-web. For a user query, source selection involves selecting a subset of deep-web sources expected to provide relevant answers to

Source selection is one of the foremost challenges for searching deep-web. For a user query, source selection involves selecting a subset of deep-web sources expected to provide relevant answers to the user query. Existing source selection models employ query-similarity based local measures for assessing source quality. These local measures are necessary but not sufficient as they are agnostic to source trustworthiness and result importance, which, given the autonomous and uncurated nature of deep-web, have become indispensible for searching deep-web. SourceRank provides a global measure for assessing source quality based on source trustworthiness and result importance. SourceRank's effectiveness has been evaluated in single-topic deep-web environments. The goal of the thesis is to extend sourcerank to a multi-topic deep-web environment. Topic-sensitive sourcerank is introduced as an effective way of extending sourcerank to a deep-web environment containing a set of representative topics. In topic-sensitive sourcerank, multiple sourcerank vectors are created, each biased towards a representative topic. At query time, using the topic of query keywords, a query-topic sensitive, composite sourcerank vector is computed as a linear combination of these pre-computed biased sourcerank vectors. Extensive experiments on more than a thousand sources in multiple domains show 18-85% improvements in result quality over Google Product Search and other existing methods.

Contributors

Agent

Created

Date Created
  • 2011

149901-Thumbnail Image.png

Query expansion for handling exploratory and ambiguous keyword queries

Description

Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the

Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the main goal of the user would be to identify and select a specific category of query results among different categorical options, in order to narrow down the search and reach the desired result. Typical corpus-driven keyword query expansion approaches return popular words in the results as expanded queries. These empirical methods fail to cover all semantics of categories present in the query results. More importantly these methods do not consider the semantic relationship between the keywords featured in an expanded query. Contrary to a normal keyword search setting, these factors are non-trivial in an exploratory and ambiguous query setting where the user's precise discernment of different categories present in the query results is more important for making subsequent search decisions. In this thesis, I propose a new framework for keyword query expansion: generating a set of queries that correspond to the categorization of original query results, which is referred as Categorizing query expansion. Two approaches of algorithms are proposed, one that performs clustering as pre-processing step and then generates categorizing expanded queries based on the clusters. The other category of algorithms handle the case of generating quality expanded queries in the presence of imperfect clusters.

Contributors

Agent

Created

Date Created
  • 2011

149907-Thumbnail Image.png

CPR complex pattern ranking for evaluating top-k pattern queries over event streams

Description

Most existing approaches to complex event processing over streaming data rely on the assumption that the matches to the queries are rare and that the goal of the system is

Most existing approaches to complex event processing over streaming data rely on the assumption that the matches to the queries are rare and that the goal of the system is to identify these few matches within the incoming deluge of data. In many applications, such as stock market analysis and user credit card purchase pattern monitoring, however the matches to the user queries are in fact plentiful and the system has to efficiently sift through these many matches to locate only the few most preferable matches. In this work, we propose a complex pattern ranking (CPR) framework for specifying top-k pattern queries over streaming data, present new algorithms to support top-k pattern queries in data streaming environments, and verify the effectiveness and efficiency of the proposed algorithms. The developed algorithms identify top-k matching results satisfying both patterns as well as additional criteria. To support real-time processing of the data streams, instead of computing top-k results from scratch for each time window, we maintain top-k results dynamically as new events come and old ones expire. We also develop new top-k join execution strategies that are able to adapt to the changing situations (e.g., sorted and random access costs, join rates) without having to assume a priori presence of data statistics. Experiments show significant improvements over existing approaches.

Contributors

Agent

Created

Date Created
  • 2011

149464-Thumbnail Image.png

Querying for relevant people in online social networks

Description

Online social networks, including Twitter, have expanded in both scale and diversity of content, which has created significant challenges to the average user. These challenges include finding relevant information on

Online social networks, including Twitter, have expanded in both scale and diversity of content, which has created significant challenges to the average user. These challenges include finding relevant information on a topic and building social ties with like-minded individuals. The fundamental question addressed by this thesis is if an individual can leverage social network to search for information that is relevant to him or her. We propose to answer this question by developing computational algorithms that analyze a user's social network. The features of the social network we analyze include the network topology and member communications of a specific user's social network. Determining the "social value" of one's contacts is a valuable outcome of this research. The algorithms we developed were tested on Twitter, which is an extremely popular social network. Twitter was chosen due to its popularity and a majority of the communications artifacts on Twitter is publically available. In this work, the social network of a user refers to the "following relationship" social network. Our algorithm is not specific to Twitter, and is applicable to other social networks, where the network topology and communications are accessible. My approaches are as follows. For a user interested in using the system, I first determine the immediate social network of the user as well as the social contacts for each person in this network. Afterwards, I establish and extend the social network for each user. For each member of the social network, their tweet data are analyzed and represented by using a word distribution. To accomplish this, I use WordNet, a popular lexical database, to determine semantic similarity between two words. My mechanism of search combines both communication distance between two users and social relationships to determine the search results. Additionally, I developed a search interface, where a user can interactively query the system. I conducted preliminary user study to evaluate the quality and utility of my method and system against several baseline methods, including the default Twitter search. The experimental results from the user study indicate that my method is able to find relevant people and identify valuable contacts in one's social circle based on the query. The proposed system outperforms baseline methods in terms of standard information retrieval metrics.

Contributors

Agent

Created

Date Created
  • 2010

149307-Thumbnail Image.png

BioEve: user interface framework bridging IE and IR

Description

Continuous advancements in biomedical research have resulted in the production of vast amounts of scientific data and literature discussing them. The ultimate goal of computational biology is to translate these

Continuous advancements in biomedical research have resulted in the production of vast amounts of scientific data and literature discussing them. The ultimate goal of computational biology is to translate these large amounts of data into actual knowledge of the complex biological processes and accurate life science models. The ability to rapidly and effectively survey the literature is necessary for the creation of large scale models of the relationships among biomedical entities as well as hypothesis generation to guide biomedical research. To reduce the effort and time spent in performing these activities, an intelligent search system is required. Even though many systems aid in navigating through this wide collection of documents, the vastness and depth of this information overload can be overwhelming. An automated extraction system coupled with a cognitive search and navigation service over these document collections would not only save time and effort, but also facilitate discovery of the unknown information implicitly conveyed in the texts. This thesis presents the different approaches used for large scale biomedical named entity recognition, and the challenges faced in each. It also proposes BioEve: an integrative framework to fuse a faceted search with information extraction to provide a search service that addresses the user's desire for "completeness" of the query results, not just the top-ranked ones. This information extraction system enables discovery of important semantic relationships between entities such as genes, diseases, drugs, and cell lines and events from biomedical text on MEDLINE, which is the largest publicly available database of the world's biomedical journal literature. It is an innovative search and discovery service that makes it easier to search
avigate and discover knowledge hidden in life sciences literature. To demonstrate the utility of this system, this thesis also details a prototype enterprise quality search and discovery service that helps researchers with a guided step-by-step query refinement, by suggesting concepts enriched in intermediate results, and thereby facilitating the "discover more as you search" paradigm.

Contributors

Agent

Created

Date Created
  • 2010