Search Content

Distributed SPARQL over big RDF data: a comparative analysis using Presto and MapReduce

Description

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the…

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the volume of RDF data grew to exponential proportions, the limitations of these systems became apparent and researchers began to focus on using big data analysis tools, most notably Hadoop, to process RDF data. Various studies and benchmarks that evaluate these tools for RDF data processing have been published. In the past two and half years, however, heavy users of big data systems, like Facebook, noted limitations with the query performance of these big data systems and began to develop new distributed query engines for big data that do not rely on map-reduce. Facebook's Presto is one such example.

This thesis deals with evaluating the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done on four and eight node Linux clusters installed on Microsoft Windows Azure platform with RDF datasets of size 10, 20, and 30 million triples. The results of the experiment show that Presto has a much higher performance than Hive can be used to process big RDF data. The thesis also proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data.

ContributorsMammo, Mulugeta (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)

Created2014

MOOCLink: linking and maintaining qulity of data provided by various MOOC providers

Description

The concept of Linked Data is gaining widespread popularity and importance. The method of publishing and linking structured data on the web is called Linked Data. Emergence of Linked Data has made it possible to make sense of huge data, which is scattered all over the web, and link multiple…

The concept of Linked Data is gaining widespread popularity and importance. The method of publishing and linking structured data on the web is called Linked Data. Emergence of Linked Data has made it possible to make sense of huge data, which is scattered all over the web, and link multiple heterogeneous sources. This leads to the challenge of maintaining the quality of Linked Data, i.e., ensuring outdated data is removed and new data is included. The focus of this thesis is devising strategies to effectively integrate data from multiple sources, publish it as Linked Data, and maintain the quality of Linked Data. The domain used in the study is online education. With so many online courses offered by Massive Open Online Courses (MOOC), it is becoming increasingly difficult for an end user to gauge which course best fits his/her needs.

Users are spoilt for choices. It would be very helpful for them to make a choice if there is a single place where they can visually compare the offerings of various MOOC providers for the course they are interested in. Previous work has been done in this area through the MOOCLink project that involved integrating data from Coursera, EdX, and Udacity and generation of linked data, i.e. Resource Description Framework (RDF) triples.

The research objective of this thesis is to determine a methodology by which the quality

of data available through the MOOCLink application is maintained, as there are lots of new courses being constantly added and old courses being removed by data providers. This thesis presents the integration of data from various MOOC providers and algorithms for incrementally updating linked data to maintain their quality and compare it against a naïve approach in order to constantly keep the users engaged with up-to-date data. A master threshold value was determined through experiments and analysis that quantifies one algorithm being better than the other in terms of time efficiency. An evaluation of the tool shows the effectiveness of the algorithms presented in this thesis.

ContributorsDhekne, Chinmay (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Sohoni, Sohum (Committee member) / Arizona State University (Publisher)

Created2016

A semantic framework for integrating and publishing linked data on the Web

Description

Semantic web is the web of data that provides a common framework and technologies for sharing and reusing data in various applications. In semantic web terminology, linked data is the term used to describe a method of exposing and connecting data on the web from different sources. The purpose of…

Semantic web is the web of data that provides a common framework and technologies for sharing and reusing data in various applications. In semantic web terminology, linked data is the term used to describe a method of exposing and connecting data on the web from different sources. The purpose of linked data and semantic web is to publish data in an open and standard format and to link this data with existing data on the Linked Open Data Cloud. The goal of this thesis to come up with a semantic framework for integrating and publishing linked data on the web. Traditionally integrating data from multiple sources usually involves an Extract-Transform-Load (ETL) framework to generate datasets for analytics and visualization. The thesis proposes introducing a semantic component in the ETL framework to semi-automate the generation and publishing of linked data. In this thesis, various existing ETL tools and data integration techniques have been analyzed and deficiencies have been identified. This thesis proposes a set of requirements for the semantic ETL framework by conducting a manual process to integrate data from various sources such as weather, holidays, airports, flight arrival, departure and delays. The research questions that are addressed are: (i) to what extent can the integration, generation, and publishing of linked data to the cloud using a semantic ETL framework be automated; (ii) does use of semantic technologies produce a richer data model and integrated data. Details of the methodology, data collection, and application that uses the linked data generated are presented. Evaluation is done by comparing traditional data integration approach with semantic ETL approach in terms of effort involved in integration, data model generated and querying the data generated.

ContributorsPadki, Aparna (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)

Created2016

Distributed RDF Storage and Querying Using In-Memory Processing Engine

Description

The proliferation of semantic data in the form of RDF (Resource Description Framework) triples demands an efficient, scalable, and distributed storage along with a highly available and fault-tolerant parallel processing strategy. There are three open issues with distributed RDF data management systems that are not well addressed altogether in existing…

The proliferation of semantic data in the form of RDF (Resource Description Framework) triples demands an efficient, scalable, and distributed storage along with a highly available and fault-tolerant parallel processing strategy. There are three open issues with distributed RDF data management systems that are not well addressed altogether in existing work. First is the querying efficiency, second is that solutions are optimized for certain types of query patterns and don’t necessarily work well for all types, and third is concerned with reducing pre-processing cost. Therefore, the rapid growth of RDF data raises the need for an efficient partitioning strategy over distributed data management systems to improve SPARQL (SPARQL Protocol and RDF Query Language) query performance regardless of its pattern shape with minimized pre-processing overhead. In this context, the first contribution of this work is a distributed RDF data partitioning schema called 3CStore that extends the existing VP (Vertical Partitioning) approach by using a subset of triples from the VP tables based on different join correlations. This approach speeds up queries at the cost of additional pre-processing overhead. To solve this, a relational partitioning schema called VPExp was developed by splitting predicates based on explicit type information of objects. This approach gains a significant query performance only for the specific type of query where the object is bound to a value for a particular predicate. To get efficient query performance on a wide range of query patterns, an improved solution is proposed by extending the existing Property Table approach to Subset-Property Table and combined with the VP approach. Further investigation on distributed RDF processing and querying systems based on typical use cases led to a novel relational partitioning schema called PTP (Property Table Partitioning) that further partitions the whole Property Table into the number of unique properties to minimize query input size and join operations during query evaluation. Finally, an RDF data management system based on the SPARQL-over-SQL approach called S3QLRDF is developed that generates the optimal query execution plan using statistics of PTP tables to provide efficient SPARQL query processing on a distributed system.

ContributorsHassan, P M Mahmudul Mahmudul (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Davulcu, Hasan (Committee member) / Sarwat Abdelghany Aly Elsayed, Mohamed (Committee member) / Arizona State University (Publisher)

Created2021

ASU Electronic Theses and Dissertations

Filtering by

Distributed SPARQL over big RDF data: a comparative analysis using Presto and MapReduce

MOOCLink: linking and maintaining qulity of data provided by various MOOC providers

A semantic framework for integrating and publishing linked data on the Web

Distributed RDF Storage and Querying Using In-Memory Processing Engine