Matching Items (3)
152514-Thumbnail Image.png
Description
As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms

As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms which are capable of finding the hidden structure within these datasets. As consumers of popular Big Data frameworks have sought to apply and benefit from these improved learning algorithms, the problems encountered with the frameworks have motivated a new generation of Big Data tools to address the shortcomings of the previous generation. One important example of this is the improved performance in the newer tools with the large class of machine learning algorithms which are highly iterative in nature. In this thesis project, I set about to implement a low-rank matrix completion algorithm (as an example of a highly iterative algorithm) within a popular Big Data framework, and to evaluate its performance processing the Netflix Prize dataset. I begin by describing several approaches which I attempted, but which did not perform adequately. These include an implementation of the Singular Value Thresholding (SVT) algorithm within the Apache Mahout framework, which runs on top of the Apache Hadoop MapReduce engine. I then describe an approach which uses the Divide-Factor-Combine (DFC) algorithmic framework to parallelize the state-of-the-art low-rank completion algorithm Orthogoal Rank-One Matrix Pursuit (OR1MP) within the Apache Spark engine. I describe the results of a series of tests running this implementation with the Netflix dataset on clusters of various sizes, with various degrees of parallelism. For these experiments, I utilized the Amazon Elastic Compute Cloud (EC2) web service. In the final analysis, I conclude that the Spark DFC + OR1MP implementation does indeed produce competitive results, in both accuracy and performance. In particular, the Spark implementation performs nearly as well as the MATLAB implementation of OR1MP without any parallelism, and improves performance to a significant degree as the parallelism increases. In addition, the experience demonstrates how Spark's flexible programming model makes it straightforward to implement this parallel and iterative machine learning algorithm.
ContributorsKrouse, Brian (Author) / Ye, Jieping (Thesis advisor) / Liu, Huan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2014
153213-Thumbnail Image.png
Description
The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the volume of RDF data grew to exponential proportions, the limitations of these systems became apparent and researchers began to focus on using big data analysis tools, most notably Hadoop, to process RDF data. Various studies and benchmarks that evaluate these tools for RDF data processing have been published. In the past two and half years, however, heavy users of big data systems, like Facebook, noted limitations with the query performance of these big data systems and began to develop new distributed query engines for big data that do not rely on map-reduce. Facebook's Presto is one such example.

This thesis deals with evaluating the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done on four and eight node Linux clusters installed on Microsoft Windows Azure platform with RDF datasets of size 10, 20, and 30 million triples. The results of the experiment show that Presto has a much higher performance than Hive can be used to process big RDF data. The thesis also proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data.
ContributorsMammo, Mulugeta (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)
Created2014
150189-Thumbnail Image.png
Description
This thesis research attempts to observe, measure and visualize the communication patterns among developers of an open source community and analyze how this can be inferred in terms of progress of that open source project. Here I attempted to analyze the Ubuntu open source project's email data (9 subproject log

This thesis research attempts to observe, measure and visualize the communication patterns among developers of an open source community and analyze how this can be inferred in terms of progress of that open source project. Here I attempted to analyze the Ubuntu open source project's email data (9 subproject log archives over a period of five years) and focused on drawing more precise metrics from different perspectives of the communication data. Also, I attempted to overcome the scalability issue by using Apache Pig libraries, which run on a MapReduce framework based Hadoop Cluster. I described four metrics based on which I observed and analyzed the data and also presented the results which show the required patterns and anomalies to better understand and infer the communication. Also described the usage experience with Pig Latin (scripting language of Apache Pig Libraries) for this research and how they brought the feature of scalability, simplicity, and visibility in this data intensive research work. These approaches are useful in project monitoring, to augment human observation and reporting, in social network analysis, to track individual contributions.
ContributorsMotamarri, Lakshminarayana (Author) / Santanam, Raghu (Thesis advisor) / Ye, Jieping (Thesis advisor) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2011