Search Content

Matching Items (2)

Filtering by

All Subjects: Big Data
Creators: Bansal, Ajay

Distributed SPARQL over big RDF data: a comparative analysis using Presto and MapReduce

Description

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the volume of RDF data grew to exponential proportions, the limitations of these systems became apparent and researchers began to focus on using big data analysis tools, most notably Hadoop, to process RDF data. Various studies and benchmarks that evaluate these tools for RDF data processing have been published. In the past two and half years, however, heavy users of big data systems, like Facebook, noted limitations with the query performance of these big data systems and began to develop new distributed query engines for big data that do not rely on map-reduce. Facebook's Presto is one such example.

This thesis deals with evaluating the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done on four and eight node Linux clusters installed on Microsoft Windows Azure platform with RDF datasets of size 10, 20, and 30 million triples. The results of the experiment show that Presto has a much higher performance than Hive can be used to process big RDF data. The thesis also proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data.

ContributorsMammo, Mulugeta (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)

Created2014

Data Analytics to Identify the Genetic Basis for Resilience to Temperature Stress in Soybeans

Description

This paper explores the ability to predict yields of soybeans based on genetics and environmental factors. Based on the biology of soybeans, it has been shown that yields are best when soybeans grow within a certain temperature range. The event a soybean is exposed to temperature outside their accepted range is labeled as an instance of stress. Currently, there are few models that use genetic information to predict how crops may respond to stress. Using data provided by an agricultural business, a model was developed that can categorically label soybean varieties by their yield response to stress using genetic data. The model clusters varieties based on their yield production in response to stress. The clustering criteria is based on variance distribution and correlation. A logistic regression is then fitted to identify significant gene markers in varieties with minimal yield variance. Such characteristics provide a probabilistic outlook of how certain varieties will perform when planted in different regions. Given changing global climate conditions, this model demonstrates the potential of using data to efficiently develop and grow crops adjusted to climate changes.

ContributorsDean, Arlen (Co-author) / Ozcan, Ozkan (Co-author) / Travis, Daniel (Co-author) / Gel, Esma (Thesis director) / Armbruster, Dieter (Committee member) / Parry, Sam (Committee member) / Industrial, Systems and Operations Engineering Program (Contributor) / Department of Information Systems (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05