Search Content

Distributed SPARQL over big RDF data: a comparative analysis using Presto and MapReduce

Description

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the…

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the volume of RDF data grew to exponential proportions, the limitations of these systems became apparent and researchers began to focus on using big data analysis tools, most notably Hadoop, to process RDF data. Various studies and benchmarks that evaluate these tools for RDF data processing have been published. In the past two and half years, however, heavy users of big data systems, like Facebook, noted limitations with the query performance of these big data systems and began to develop new distributed query engines for big data that do not rely on map-reduce. Facebook's Presto is one such example.

This thesis deals with evaluating the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done on four and eight node Linux clusters installed on Microsoft Windows Azure platform with RDF datasets of size 10, 20, and 30 million triples. The results of the experiment show that Presto has a much higher performance than Hive can be used to process big RDF data. The thesis also proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data.

ContributorsMammo, Mulugeta (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)

Created2014

All Purpose Textual Data Information Extraction, Visualization and Querying

Description

Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and…

Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and extraction worthy. Using data visualization theory and fast, interactive querying methods, leaving out information might not really be necessary. This thesis explores textual data visualization techniques, intuitive querying, and a novel approach to all-purpose textual information extraction to encode large text corpus to improve human understanding of the information present in textual data.

This thesis presents a modified traversal algorithm on dependency parse output of text to extract all subject predicate object pairs from text while ensuring that no information is missed out. To support full scale, all-purpose information extraction from large text corpuses, a data preprocessing pipeline is recommended to be used before the extraction is run. The output format is designed specifically to fit on a node-edge-node model and form the building blocks of a network which makes understanding of the text and querying of information from corpus quick and intuitive. It attempts to reduce reading time and enhancing understanding of the text using interactive graph and timeline.

ContributorsHashmi, Syed Usama (Author) / Bansal, Ajay (Thesis advisor) / Bansal, Srividya (Committee member) / Gonzalez Sanchez, Javier (Committee member) / Arizona State University (Publisher)

Created2018

Template-Based Question Answering over Linked Data using Recursive Neural Networks

Description

The Semantic Web contains large amounts of related information in the form of knowledge graphs such as DBpedia. These knowledge graphs are typically enormous and are not easily accessible for users as they need specialized knowledge in query languages (such as SPARQL) as well as deep familiarity of the ontologies…

The Semantic Web contains large amounts of related information in the form of knowledge graphs such as DBpedia. These knowledge graphs are typically enormous and are not easily accessible for users as they need specialized knowledge in query languages (such as SPARQL) as well as deep familiarity of the ontologies used by these knowledge graphs. So, to make these knowledge graphs more accessible (even for non- experts) several question answering (QA) systems have been developed over the last decade. Due to the complexity of the task, several approaches have been undertaken that include techniques from natural language processing (NLP), information retrieval (IR), machine learning (ML) and the Semantic Web (SW). At a higher level, most question answering systems approach the question answering task as a conversion from the natural language question to its corresponding SPARQL query. These systems then utilize the query to retrieve the desired entities or literals. One approach to solve this problem, that is used by most systems today, is to apply deep syntactic and semantic analysis on the input question to derive the SPARQL query. This has resulted in the evolution of natural language processing pipelines that have common characteristics such as answer type detection, segmentation, phrase matching, part-of-speech-tagging, named entity recognition, named entity disambiguation, syntactic or dependency parsing, semantic role labeling, etc.

This has lead to NLP pipeline architectures that integrate components that solve a specific aspect of the problem and pass on the results to subsequent components for further processing eg: DBpedia Spotlight for named entity recognition, RelMatch for relational mapping, etc. A major drawback in this approach is error propagation that is a common problem in NLP. This can occur due to mistakes early on in the pipeline that can adversely affect successive steps further down the pipeline. Another approach is to use query templates either manually generated or extracted from existing benchmark datasets such as Question Answering over Linked Data (QALD) to generate the SPARQL queries that is basically a set of predefined queries with various slots that need to be filled. This approach potentially shifts the question answering problem into a classification task where the system needs to match the input question to the appropriate template (class label).

This thesis proposes a neural network approach to automatically learn and classify natural language questions into its corresponding template using recursive neural networks. An obvious advantage of using neural networks is the elimination for the need of laborious feature engineering that can be cumbersome and error prone. The input question would be encoded into a vector representation. The model will be trained and evaluated on the LC-QuAD Dataset (Large-scale Complex Question Answering Dataset). The dataset was created explicitly for machine learning based QA approaches for learning complex SPARQL queries. The dataset consists of 5000 questions along with their corresponding SPARQL queries over the DBpedia dataset spanning 5042 entities and 615 predicates. These queries were annotated based on 38 unique templates that the model will attempt to classify. The resulting model will be evaluated against both the LC-QuAD dataset and the Question Answering Over Linked Data (QALD-7) dataset.

The recursive neural network achieves template classification accuracy of 0.828 on the LC-QuAD dataset and an accuracy of 0.618 on the QALD-7 dataset. When the top-2 most likely templates were considered the model achieves an accuracy of 0.945 on the LC-QuAD dataset and 0.786 on the QALD-7 dataset.

After slot filling, the overall system achieves a macro F-score 0.419 on the LC- QuAD dataset and a macro F-score of 0.417 on the QALD-7 dataset.

ContributorsAthreya, Ram G (Author) / Bansal, Srividya (Thesis advisor) / Usbeck, Ricardo (Committee member) / Gary, Kevin (Committee member) / Arizona State University (Publisher)

Created2018

Helix: A First Game Retrospective

Description

The original version of Helix, the one I pitched when first deciding to make a video game
for my thesis, is an action-platformer, with the intent of metroidvania-style progression
and an interconnected world map.

The current version of Helix is a turn based role-playing game, with the intent of roguelike
gameplay and a dark…

The original version of Helix, the one I pitched when first deciding to make a video game
for my thesis, is an action-platformer, with the intent of metroidvania-style progression
and an interconnected world map.

The current version of Helix is a turn based role-playing game, with the intent of roguelike
gameplay and a dark fantasy theme. We will first be exploring the challenges that came
with programming my own game - not quite from scratch, but also without a prebuilt
engine - then transition into game design and how Helix has evolved from its original form
to what we see today.

ContributorsDiscipulo, Isaiah K (Author) / Meuth, Ryan (Thesis director) / Kobayashi, Yoshihiro (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

RecyclePlus: A Sustainability iOS Application

Description

RecyclePlus is an iOS mobile application that allows users to be knowledgeable in the realms of sustainability. It gives encourages users to be environmental responsible by providing them access to recycling information. In particular, it allows users to search up certain materials and learn about its recyclability and how to…

RecyclePlus is an iOS mobile application that allows users to be knowledgeable in the realms of sustainability. It gives encourages users to be environmental responsible by providing them access to recycling information. In particular, it allows users to search up certain materials and learn about its recyclability and how to properly dispose of the material. Some searches will show locations of facilities near users that collect certain materials and dispose of the materials properly. This is a full stack software project that explores open source software and APIs, UI/UX design, and iOS development.

ContributorsTran, Nikki (Author) / Ganesh, Tirupalavanam (Thesis director) / Meuth, Ryan (Committee member) / Watts College of Public Service & Community Solut (Contributor) / Department of Information Systems (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Blockchain Design and Simulation

Description

This paper details the specification and implementation of a single-machine blockchain simulator. It also includes a brief introduction on the history & underlying concepts of blockchain, with explanations on features such as decentralization, openness, trustlessness, and consensus. The introduction features a brief overview of public interest and current implementations of…

This paper details the specification and implementation of a single-machine blockchain simulator. It also includes a brief introduction on the history & underlying concepts of blockchain, with explanations on features such as decentralization, openness, trustlessness, and consensus. The introduction features a brief overview of public interest and current implementations of blockchain before stating potential use cases for blockchain simulation software. The paper then gives a brief literature review of blockchain's role, both as a disruptive technology and a foundational technology. The literature review also addresses the potential and difficulties regarding the use of blockchain in Internet of Things (IoT) networks, and also describes the limitations of blockchain in general regarding computational intensity, storage capacity, and network architecture. Next, the paper gives the specification for a generic blockchain structure, with summaries on the behaviors and purposes of transactions, blocks, nodes, miners, public & private key cryptography, signature validation, and hashing. Finally, the author gives an overview of their specific implementation of the blockchain using C/C++ and OpenSSL. The overview includes a brief description of all the classes and data structures involved in the implementation, including their function and behavior. While the implementation meets the requirements set forward in the specification, the results are more qualitative and intuitive, as time constraints did not allow for quantitative measurements of the network simulation. The paper concludes by discussing potential applications for the simulator, and the possibility for future hardware implementations of blockchain.

ContributorsRauschenbach, Timothy Rex (Author) / Vrudhula, Sarma (Thesis director) / Nakamura, Mutsumi (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2017-12

Fielding an Autonomous Cobot in a University Environment: Engineering and Evaluation

Description

Many researchers aspire to create robotics systems that assist humans in common office tasks, especially by taking over delivery and messaging tasks. For meaningful interactions to take place, a mobile robot must be able to identify the humans it interacts with and communicate successfully with them. It must also be…

Many researchers aspire to create robotics systems that assist humans in common office tasks, especially by taking over delivery and messaging tasks. For meaningful interactions to take place, a mobile robot must be able to identify the humans it interacts with and communicate successfully with them. It must also be able to successfully navigate the office environment. While mobile robots are well suited for navigating and interacting with elements inside a deterministic office environment, attempting to interact with human beings in an office environment remains a challenge due to the limits on the amount of cost-efficient compute power onboard the robot. In this work, I propose the use of remote cloud services to offload intensive interaction tasks. I detail the interactions required in an office environment and discuss the challenges faced when implementing a human-robot interaction platform in a stochastic office environment. I also experiment with cloud services for facial recognition, speech recognition, and environment navigation and discuss my results. As part of my thesis, I have implemented a human-robot interaction system utilizing cloud APIs into a mobile robot, enabling it to navigate the office environment, identify humans within the environment, and communicate with these humans.

ContributorsDSouza, Daniel Anand (Author) / Kambhampati, Subbarao (Thesis director) / Zhang, Yu (Committee member) / Computer Science and Engineering Program (Contributor) / School of Computing, Informatics, and Decision Systems Engineering (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Leye (Lie) Detector \u2014 A Study of Lie Detection using Eye Tracking, Facial Gestures, and EEG

Description

Lie detection is used prominently in contemporary society for many purposes such as for pre-employment screenings, granting security clearances, and determining if criminals or potential subjects may or may not be lying, but by no means is not limited to that scope. However, lie detection has been criticized for being…

Lie detection is used prominently in contemporary society for many purposes such as for pre-employment screenings, granting security clearances, and determining if criminals or potential subjects may or may not be lying, but by no means is not limited to that scope. However, lie detection has been criticized for being subjective, unreliable, inaccurate, and susceptible to deliberate manipulation. Furthermore, critics also believe that the administrator of the test also influences the outcome as well. As a result, the polygraph machine, the contemporary device used for lie detection, has come under scrutiny when used as evidence in the courts. The purpose of this study is to use three entirely different tools and concepts to determine whether eye tracking systems, electroencephalogram (EEG), and Facial Expression Emotion Analysis (FACET) are reliable tools for lie detection. This study found that certain constructs such as where the left eye is looking at in regard to its usual position and engagement levels in eye tracking and EEG respectively could distinguish between truths and lies. However, the FACET proved the most reliable tool out of the three by providing not just one distinguishing variable but seven, all related to emotions derived from movements in the facial muscles during the present study. The emotions associated with the FACET that were documented to possess the ability to distinguish between truthful and lying responses were joy, anger, fear, confusion, and frustration. In addition, an overall measure of the subject's neutral and positive emotional expression were found to be distinctive factors. The implications of this study and future directions are discussed.

ContributorsSeto, Raymond Hua (Author) / Atkinson, Robert (Thesis director) / Runger, George (Committee member) / W. P. Carey School of Business (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Visual Analytics and the Impact of Inter-Country Trade on Violence

Description

Global violent conflict has become an increasing problem in recent decades, especially in the African continent. Civil wars, terrorism, riots, and political violence has wrought havoc not only on civilian lives, but also on economic foundations. Trade networks are a way to measure these economic foundations. To summarize trade networks…

Global violent conflict has become an increasing problem in recent decades, especially in the African continent. Civil wars, terrorism, riots, and political violence has wrought havoc not only on civilian lives, but also on economic foundations. Trade networks are a way to measure these economic foundations. To summarize trade networks clustering coefficient as well as trade quantity/value summation measures are used. To understand effects of global trade on violent conflict, Pearson product-moment correlations are utilized. This work details a comparison of African national economies and violent conflict events using clustering coefficient, trade summation measures and Pearson correlation coefficient.

ContributorsKadambi, Sagarika Sanjay (Author) / Maciejewski, Ross (Thesis director) / Shutters, Shade (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Convolutional Neural Networks for Facial Expression Recognition

Description

This paper presents work that was done to create a system capable of facial expression recognition (FER) using deep convolutional neural networks (CNNs) and test multiple configurations and methods. CNNs are able to extract powerful information about an image using multiple layers of generic feature detectors. The extracted information can…

This paper presents work that was done to create a system capable of facial expression recognition (FER) using deep convolutional neural networks (CNNs) and test multiple configurations and methods. CNNs are able to extract powerful information about an image using multiple layers of generic feature detectors. The extracted information can be used to understand the image better through recognizing different features present within the image. Deep CNNs, however, require training sets that can be larger than a million pictures in order to fine tune their feature detectors. For the case of facial expression datasets, none of these large datasets are available. Due to this limited availability of data required to train a new CNN, the idea of using naïve domain adaptation is explored. Instead of creating and using a new CNN trained specifically to extract features related to FER, a previously trained CNN originally trained for another computer vision task is used. Work for this research involved creating a system that can run a CNN, can extract feature vectors from the CNN, and can classify these extracted features. Once this system was built, different aspects of the system were tested and tuned. These aspects include the pre-trained CNN that was used, the layer from which features were extracted, normalization used on input images, and training data for the classifier. Once properly tuned, the created system returned results more accurate than previous attempts on facial expression recognition. Based on these positive results, naïve domain adaptation is shown to successfully leverage advantages of deep CNNs for facial expression recognition.

ContributorsEusebio, Jose Miguel Ang (Author) / Panchanathan, Sethuraman (Thesis director) / McDaniel, Troy (Committee member) / Venkateswara, Hemanth (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2016-05

Filtering by