Search Content

Client-driven dynamic database updates

Description

This thesis addresses the problem of online schema updates where the goal is to be able to update relational database schemas without reducing the database system's availability. Unlike some other work in this area, this thesis presents an approach which is completely client-driven and does not require specialized database management…

This thesis addresses the problem of online schema updates where the goal is to be able to update relational database schemas without reducing the database system's availability. Unlike some other work in this area, this thesis presents an approach which is completely client-driven and does not require specialized database management systems (DBMS). Also, unlike other client-driven work, this approach provides support for a richer set of schema updates including vertical split (normalization), horizontal split, vertical and horizontal merge (union), difference and intersection. The update process automatically generates a runtime update client from a mapping between the old the new schemas. The solution has been validated by testing it on a relatively small database of around 300,000 records per table and less than 1 Gb, but with limited memory buffer size of 24 Mb. This thesis presents the study of the overhead of the update process as a function of the transaction rates and the batch size used to copy data from the old to the new schema. It shows that the overhead introduced is minimal for medium size applications and that the update can be achieved with no more than one minute of downtime.

ContributorsTyagi, Preetika (Author) / Bazzi, Rida (Thesis advisor) / Candan, Kasim S (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)

Created2011

Time efficient and quality effective K nearest neighbor search in high dimension space

Description

K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and…

K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and LSH-Forest improve upon the basic LSH algorithm by varying hash bucket size dynamically at query time, so these two algorithms can answer different KNN queries adaptively. However, these two algorithms need a data access post-processing step after candidates' collection in order to get the final answer to the KNN query. In this thesis, Multi-Probe LSH with data access post-processing (Multi-Probe LSH with DAPP) algorithm and LSH-Forest with data access post-processing (LSH-Forest with DAPP) algorithm are improved by replacing the costly data access post-processing (DAPP) step with a much faster histogram-based post-processing (HBPP). Two HBPP algorithms: LSH-Forest with HBPP and Multi- Probe LSH with HBPP are presented in this thesis, both of them achieve the three goals for KNN search in large scale high dimensional data set: high search quality, high time efficiency, high space efficiency. None of the previous KNN algorithms can achieve all three goals. More specifically, it is shown that HBPP algorithms can always achieve high search quality (as good as LSH-Forest with DAPP and Multi-Probe LSH with DAPP) with much less time cost (one to several orders of magnitude speedup) and same memory usage. It is also shown that with almost same time cost and memory usage, HBPP algorithms can always achieve better search quality than LSH-Forest with random pick (LSH-Forest with RP) and Multi-Probe LSH with random pick (Multi-Probe LSH with RP). Moreover, to achieve a very high search quality, Multi-Probe with HBPP is always a better choice than LSH-Forest with HBPP, regardless of the distribution, size and dimension number of the data set.

ContributorsYu, Renwei (Author) / Candan, Kasim S (Thesis advisor) / Sapino, Maria L (Committee member) / Chen, Yi (Committee member) / Sundaram, Hari (Committee member) / Arizona State University (Publisher)

Created2011

Enhancing the usability of complex structured data by supporting keyword searches

Description

As pointed out in the keynote speech by H. V. Jagadish in SIGMOD'07, and also commonly agreed in the database community, the usability of structured data by casual users is as important as the data management systems' functionalities. A major hardness of using structured data is the problem of easily…

As pointed out in the keynote speech by H. V. Jagadish in SIGMOD'07, and also commonly agreed in the database community, the usability of structured data by casual users is as important as the data management systems' functionalities. A major hardness of using structured data is the problem of easily retrieving information from them given a user's information needs. Learning and using a structured query language (e.g., SQL and XQuery) is overwhelmingly burdensome for most users, as not only are these languages sophisticated, but the users need to know the data schema. Keyword search provides us with opportunities to conveniently access structured data and potentially significantly enhances the usability of structured data. However, processing keyword search on structured data is challenging due to various types of ambiguities such as structural ambiguity (keyword queries have no structure), keyword ambiguity (the keywords may not be accurate), user preference ambiguity (the user may have implicit preferences that are not indicated in the query), as well as the efficiency challenges due to large search space. This dissertation performs an expansive study on keyword search processing techniques as a gateway for users to access structured data and retrieve desired information. The key issues addressed include: (1) Resolving structural ambiguities in keyword queries by generating meaningful query results, which involves identifying relevant keyword matches, identifying return information, composing query results based on relevant matches and return information. (2) Resolving structural, keyword and user preference ambiguities through result analysis, including snippet generation, result differentiation, result clustering, result summarization/query expansion, etc. (3) Resolving the efficiency challenge in processing keyword search on structured data by utilizing and efficiently maintaining materialized views. These works deliver significant technical contributions towards building a full-fledged search engine for structured data.

ContributorsLiu, Ziyang (Author) / Chen, Yi (Thesis advisor) / Candan, Kasim S (Committee member) / Davulcu, Hasan (Committee member) / Jagadish, H V (Committee member) / Arizona State University (Publisher)

Created2011

An information diffusion approach to detecting emotional contagion in online social networks

Description

Internet sites that support user-generated content, so-called Web 2.0, have become part of the fabric of everyday life in technologically advanced nations. Users collectively spend billions of hours consuming and creating content on social networking sites, weblogs (blogs), and various other types of sites in the United States and around…

Internet sites that support user-generated content, so-called Web 2.0, have become part of the fabric of everyday life in technologically advanced nations. Users collectively spend billions of hours consuming and creating content on social networking sites, weblogs (blogs), and various other types of sites in the United States and around the world. Given the fundamentally emotional nature of humans and the amount of emotional content that appears in Web 2.0 content, it is important to understand how such websites can affect the emotions of users. This work attempts to determine whether emotion spreads through an online social network (OSN). To this end, a method is devised that employs a model based on a general threshold diffusion model as a classifier to predict the propagation of emotion between users and their friends in an OSN by way of mood-labeled blog entries. The model generalizes existing information diffusion models in that the state machine representation of a node is generalized from being binary to having n-states in order to support n class labels necessary to model emotional contagion. In the absence of ground truth, the prediction accuracy of the model is benchmarked with a baseline method that predicts the majority label of a user's emotion label distribution. The model significantly outperforms the baseline method in terms of prediction accuracy. The experimental results make a strong case for the existence of emotional contagion in OSNs in spite of possible alternative arguments such confounding influence and homophily, since these alternatives are likely to have negligible effect in a large dataset or simply do not apply to the domain of human emotions. A hybrid manual/automated method to map mood-labeled blog entries to a set of emotion labels is also presented, which enables the application of the model to a large set (approximately 900K) of blog entries from LiveJournal.

ContributorsCole, William David, M.S (Author) / Liu, Huan (Thesis advisor) / Sarjoughian, Hessam S. (Committee member) / Candan, Kasim S (Committee member) / Arizona State University (Publisher)

Created2011

Digital Fountain for Multi-node Aggregation of Data in Blockchains

Description

Blockchain scalability is one of the issues that concerns its current adopters. The current popular blockchains have initially been designed with imperfections that in- troduce fundamental bottlenecks which limit their ability to have a higher throughput and a lower latency.

One of the major bottlenecks for existing blockchain technologies is fast…

Blockchain scalability is one of the issues that concerns its current adopters. The current popular blockchains have initially been designed with imperfections that in- troduce fundamental bottlenecks which limit their ability to have a higher throughput and a lower latency.

One of the major bottlenecks for existing blockchain technologies is fast block propagation. A faster block propagation enables a miner to reach a majority of the network within a time constraint and therefore leading to a lower orphan rate and better profitability. In order to attain a throughput that could compete with the current state of the art transaction processing, while also keeping the block intervals same as today, a 24.3 Gigabyte block will be required every 10 minutes with an average transaction size of 500 bytes, which translates to 48600000 transactions every 10 minutes or about 81000 transactions per second.

In order to synchronize such large blocks faster across the network while maintain- ing consensus by keeping the orphan rate below 50%, the thesis proposes to aggregate partial block data from multiple nodes using digital fountain codes. The advantages of using a fountain code is that all connected peers can send part of data in an encoded form. When the receiving peer has enough data, it then decodes the information to reconstruct the block. Along with them sending only part information, the data can be relayed over UDP, instead of TCP, improving upon the speed of propagation in the current blockchains. Fountain codes applied in this research are Raptor codes, which allow construction of infinite decoding symbols. The research, when applied to blockchains, increases success rate of block delivery on decode failures.

ContributorsChawla, Nakul (Author) / Boscovic, Dragan (Thesis advisor) / Candan, Kasim S (Thesis advisor) / Zhao, Ming (Committee member) / Arizona State University (Publisher)

Created2018

Query Workload-Aware Index Structures for Range Searches in 1D, 2D, and High-Dimensional Spaces

Description

Most current database management systems are optimized for single query execution.

Yet, often, queries come as part of a query workload. Therefore, there is a need

for index structures that can take into consideration existence of multiple queries in a

query workload and efficiently produce accurate results for the entire query workload.

These index…

Most current database management systems are optimized for single query execution.

Yet, often, queries come as part of a query workload. Therefore, there is a need

for index structures that can take into consideration existence of multiple queries in a

query workload and efficiently produce accurate results for the entire query workload.

These index structures should be scalable to handle large amounts of data as well as

large query workloads.

The main objective of this dissertation is to create and design scalable index structures

that are optimized for range query workloads. Range queries are an important

type of queries with wide-ranging applications. There are no existing index structures

that are optimized for efficient execution of range query workloads. There are

also unique challenges that need to be addressed for range queries in 1D, 2D, and

high-dimensional spaces. In this work, I introduce novel cost models, index selection

algorithms, and storage mechanisms that can tackle these challenges and efficiently

process a given range query workload in 1D, 2D, and high-dimensional spaces. In particular,

I introduce the index structures, HCS (for 1D spaces), cSHB (for 2D spaces),

and PSLSH (for high-dimensional spaces) that are designed specifically to efficiently

handle range query workload and the unique challenges arising from their respective

spaces. I experimentally show the effectiveness of the above proposed index structures

by comparing with state-of-the-art techniques.

ContributorsNagarkar, Parth (Author) / Candan, Kasim S (Thesis advisor) / Davulcu, Hasan (Committee member) / Sapino, Maria Luisa (Committee member) / Sarwat, Mohamed (Committee member) / Arizona State University (Publisher)

Created2017

A Framework for Top-k queries over weighted RDF graphs

Description

The Resource Description Framework (RDF) is a specification that aims to support the conceptual modeling of metadata or information about resources in the form of a directed graph composed of triples of knowledge (facts). RDF also provides mechanisms to encode meta-information (such as source, trust, and certainty) about facts already…

The Resource Description Framework (RDF) is a specification that aims to support the conceptual modeling of metadata or information about resources in the form of a directed graph composed of triples of knowledge (facts). RDF also provides mechanisms to encode meta-information (such as source, trust, and certainty) about facts already existing in a knowledge base through a process called reification. In this thesis, an extension to the current RDF specification is proposed in order to enhance RDF triples with an application specific weight (cost). Unlike reification, this extension treats these additional weights as first class knowledge attributes in the RDF model, which can be leveraged by the underlying query engine. Additionally, current RDF query languages, such as SPARQL, have a limited expressive power which limits the capabilities of applications that use them. Plus, even in the presence of language extensions, current RDF stores could not provide methods and tools to process extended queries in an efficient and effective way. To overcome these limitations, a set of novel primitives for the SPARQL language is proposed to express Top-k queries using traditional query patterns as well as novel predicates inspired by those from the XPath language. Plus, an extended query processor engine is developed to support efficient ranked path search, join, and indexing. In addition, several query optimization strategies are proposed, which employ heuristics, advanced indexing tools, and two graph metrics: proximity and sub-result inter-arrival time. These strategies aim to find join orders that reduce the total query execution time while avoiding worst-case pattern combinations. Finally, extensive experimental evaluation shows that using these two metrics in query optimization has a significant impact on the performance and efficiency of Top-k queries. Further experiments also show that proximity and inter-arrival have an even greater, although sometimes undesirable, impact when combined through aggregation functions. Based on these results, a hybrid algorithm is proposed which acknowledges that proximity is more important than inter-arrival time, due to its more complete nature, and performs a fine-grained combination of both metrics by analyzing the differences between their individual scores and performing the aggregation only if these differences are negligible.

ContributorsCedeno, Juan Pablo (Author) / Candan, Kasim S (Thesis advisor) / Chen, Yi (Committee member) / Sapino, Maria L (Committee member) / Arizona State University (Publisher)

Created2010

Multi-Variant Spatially Informed Rapid Testing for Epidemic Model

Description

The COVID-19 outbreak that started in 2020, brought the world to its knees and is still a menace after three years. Over eighty-five million cases and over a million deaths have occurred due to COVID-19 during that time in the United States alone. A great deal of research has gone…

The COVID-19 outbreak that started in 2020, brought the world to its knees and is still a menace after three years. Over eighty-five million cases and over a million deaths have occurred due to COVID-19 during that time in the United States alone. A great deal of research has gone into making epidemic models to show the impact of the virus by plotting the cases, deaths, and hospitalization due to COVID-19. However, there is very less research that has anything to do with mapping different variants of COVID-19. SARS-CoV-2, the virus that causes COVID-19, constantly mutates and multiple variants have emerged over time. The major variants include Beta, Gamma, Delta and the recent one, Omicron. The purpose of the research done in this thesis is to modify one of the epidemic models i.e., the Spatially Informed Rapid Testing for Epidemic Model (SIRTEM), in such a way that various variants of the virus will be modelled at the same time. The model will be assessed by adding the Omicron and the Delta variants and in doing so, the effects of different variants can be studied by looking at the positive cases, hospitalizations, and deaths from both the variants for the Arizona Population. The focus will be to find the best infection rate and testing rate by using Random numbers so that the published positive cases and the positive cases derived from the model have the least mean square error.

ContributorsVarghese, Allen Moncey (Author) / Pedrielli, Giulia (Thesis advisor) / Candan, Kasim S (Committee member) / Wu, Teresa (Committee member) / Arizona State University (Publisher)

Created2022

Enabling Peer to Peer Energy Trading Marketplace Using Consortium Blockchain Networks

Description

Blockchain technology enables peer-to-peer transactions through the elimination of the need for a centralized entity governing consensus. Rather than having a centralized database, the data is distributed across multiple computers which enables crash fault tolerance as well as makes the system difficult to tamper with due to a distributed consensus…

Blockchain technology enables peer-to-peer transactions through the elimination of the need for a centralized entity governing consensus. Rather than having a centralized database, the data is distributed across multiple computers which enables crash fault tolerance as well as makes the system difficult to tamper with due to a distributed consensus algorithm.

In this research, the potential of blockchain technology to manage energy transactions is examined. The energy production landscape is being reshaped by distributed energy resources (DERs): photo-voltaic panels, electric vehicles, smart appliances, and battery storage. Distributed energy sources such as microgrids, household solar installations, community solar installations, and plug-in hybrid vehicles enable energy consumers to act as providers of energy themselves, hence acting as 'prosumers' of energy.

Blockchain Technology facilitates managing the transactions between involved prosumers using 'Smart Contracts' by tokenizing energy into assets. Better utilization of grid assets lowers costs and also presents the opportunity to buy energy at a reasonable price while staying connected with the utility company. This technology acts as a backbone for 2 models applicable to transactional energy marketplace viz. 'Real-Time Energy Marketplace' and 'Energy Futures'. In the first model, the prosumers are given a choice to bid for a price for energy within a stipulated period of time, while the Utility Company acts as an operating entity. In the second model, the marketplace is more liberal, where the utility company is not involved as an operator. The Utility company facilitates infrastructure and manages accounts for all users, but does not endorse or govern transactions related to energy bidding. These smart contracts are not time bounded and can be suspended by the utility during periods of network instability.

ContributorsSadaye, Raj Anil (Author) / Candan, Kasim S (Thesis advisor) / Boscovic, Dragan (Committee member) / Zhao, Ming (Committee member) / Arizona State University (Publisher)

Created2019

MedFabric4Me: Blockchain Based Patient Centric Electronic Health Records System

Description

Blockchain technology enables a distributed and decentralized environment without any central authority. Healthcare is one industry in which blockchain is expected to have significant impacts. In recent years, the Healthcare Information Exchange(HIE) has been shown to benefit the healthcare industry remarkably. It has been shown that blockchain could hel…

Blockchain technology enables a distributed and decentralized environment without any central authority. Healthcare is one industry in which blockchain is expected to have significant impacts. In recent years, the Healthcare Information Exchange(HIE) has been shown to benefit the healthcare industry remarkably. It has been shown that blockchain could help to improve multiple aspects of the HIE system.

When Blockchain technology meets HIE, there are only a few proposed systems and they all suffer from the following two problems. First, the existing systems are not patient-centric in terms of data governance. Patients do not own their data and have no direct control over it. Second, there is no defined protocol among different systems on how to share sensitive data.

To address the issues mentioned above, this paper proposes MedFabric4Me, a blockchain-based platform for HIE. MedFabric4Me is a patient-centric system where patients own their healthcare data and share on a need-to-know basis. First, analyzed the requirements for a patient-centric system which ensures tamper-proof sharing of data among participants. Based on the analysis, a Merkle root based mechanism is created to ensure that data has not tampered. Second, a distributed Proxy re-encryption system is used for secure encryption of data during storage and sharing of records. Third, combining off-chain storage and on-chain access management for both authenticability and privacy.

MedFabric4Me is a two-pronged solution platform, composed of on-chain and off-chain components. The on-chain solution is implemented on the secure network of Hyperledger Fabric(HLF) while the off-chain solution uses Interplanetary File System(IPFS) to store data securely. Ethereum based Nucypher, a proxy re-encryption network provides cryptographic access controls to actors for encrypted data sharing.

To demonstrate the practicality and scalability, a prototype solution of MedFabric4Me is implemented and evaluated the performance measure of the system against an already implemented HIE.

Results show that decentralization technology like blockchain could help to mitigate some issues that HIE faces today, like transparency for patients, slow emergency response, and better access control.

Finally, this research concluded with the benefits and shortcomings of MedFabric4Me with some directions and work that could benefit MedFabric4Me in terms of operation and performance.

ContributorsVishnoi, Manish (Author) / Boscovic, Dragan (Thesis advisor) / Candan, Kasim S (Thesis advisor) / Grando, Maria (Committee member) / Arizona State University (Publisher)

Created2020