Search Content

Distributed SPARQL over big RDF data: a comparative analysis using Presto and MapReduce

Description

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the…

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational databases management systems. But as the volume of RDF data grew to exponential proportions, the limitations of these systems became apparent and researchers began to focus on using big data analysis tools, most notably Hadoop, to process RDF data. Various studies and benchmarks that evaluate these tools for RDF data processing have been published. In the past two and half years, however, heavy users of big data systems, like Facebook, noted limitations with the query performance of these big data systems and began to develop new distributed query engines for big data that do not rely on map-reduce. Facebook's Presto is one such example.

This thesis deals with evaluating the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done on four and eight node Linux clusters installed on Microsoft Windows Azure platform with RDF datasets of size 10, 20, and 30 million triples. The results of the experiment show that Presto has a much higher performance than Hive can be used to process big RDF data. The thesis also proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data.

ContributorsMammo, Mulugeta (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)

Created2014

Smooth surfaces for video game development

Description

The video game graphics pipeline has traditionally rendered the scene using a polygonal approach. Advances in modern graphics hardware now allow the rendering of parametric methods. This thesis explores various smooth surface rendering methods that can be integrated into the video game graphics engine. Moving over to parametric or smooth…

The video game graphics pipeline has traditionally rendered the scene using a polygonal approach. Advances in modern graphics hardware now allow the rendering of parametric methods. This thesis explores various smooth surface rendering methods that can be integrated into the video game graphics engine. Moving over to parametric or smooth surfaces from the polygonal domain has its share of issues and there is an inherent need to address various rendering bottlenecks that could hamper such a move. The game engine needs to choose an appropriate method based on in-game characteristics of the objects; character and animated objects need more sophisticated methods whereas static objects could use simpler techniques. Scaling the polygon count over various hardware platforms becomes an important factor. Much control is needed over the tessellation levels, either imposed by the hardware limitations or by the application, to be able to adaptively render the mesh without significant loss in performance. This thesis explores several methods that would help game engine developers in making correct design choices by optimally balancing the trade-offs while rendering the scene using smooth surfaces. It proposes a novel technique for adaptive tessellation of triangular meshes that vastly improves speed and tessellation count. It develops an approximate method for rendering Loop subdivision surfaces on tessellation enabled hardware. A taxonomy and evaluation of the methods is provided and a unified rendering system that provides automatic level of detail by switching between the methods is proposed.

ContributorsAmresh, Ashish (Author) / Farin, Gerlad (Thesis advisor) / Razdan, Anshuman (Thesis advisor) / Wonka, Peter (Committee member) / Hansford, Dianne (Committee member) / Arizona State University (Publisher)

Created2011

Lighting prediction and simulation in large nighttime urban scenes

Description

Night vision goggles (NVGs) are widely used by helicopter pilots for flight missions at night, but the equipment can present visually confusing images especially in urban areas. A simulation tool with realistic nighttime urban images would help pilots practice and train for flight with NVGs. However, there is a lack…

Night vision goggles (NVGs) are widely used by helicopter pilots for flight missions at night, but the equipment can present visually confusing images especially in urban areas. A simulation tool with realistic nighttime urban images would help pilots practice and train for flight with NVGs. However, there is a lack of tools for visualizing urban areas at night. This is mainly due to difficulties in gathering the light system data, placing the light systems at suitable locations, and rendering millions of lights with complex light intensity distributions (LID). Unlike daytime images, a city can have millions of light sources at night, including street lights, illuminated signs, and light shed from building interiors through windows. In this paper, a Procedural Lighting tool (PL), which predicts the positions and properties of street lights, is presented. The PL tool is used to accomplish three aims: (1) to generate vector data layers for geographic information systems (GIS) with statistically estimated information on lighting designs for streets, as well as the locations, orientations, and models for millions of streetlights; (2) to generate geo-referenced raster data to suitable for use as light maps that cover a large scale urban area so that the effect of millions of street light can be accurately rendered at real time, and (3) to extend existing 3D models by generating detailed light-maps that can be used as UV-mapped textures to render the model. An interactive graphical user interface (GUI) for configuring and previewing lights from a Light System Database (LDB) is also presented. The GUI includes physically accurate information about LID and also the lights' spectral power distributions (SPDs) so that a light-map can be generated for use with any sensor if the sensors luminosity function is known. Finally, for areas where more detail is required, a tool has been developed for editing and visualizing light effects over a 3D building from many light sources including area lights and windows. The above components are integrated in the PL tool to produce a night time urban view for not only a large-scale area but also a detail of a city building.

ContributorsChuang, Chia-Yuan (Author) / Femiani, John (Thesis advisor) / Razdan, Anshuman (Committee member) / Amresh, Ashish (Committee member) / Arizona State University (Publisher)

Created2011

All Purpose Textual Data Information Extraction, Visualization and Querying

Description

Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and…

Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and extraction worthy. Using data visualization theory and fast, interactive querying methods, leaving out information might not really be necessary. This thesis explores textual data visualization techniques, intuitive querying, and a novel approach to all-purpose textual information extraction to encode large text corpus to improve human understanding of the information present in textual data.

This thesis presents a modified traversal algorithm on dependency parse output of text to extract all subject predicate object pairs from text while ensuring that no information is missed out. To support full scale, all-purpose information extraction from large text corpuses, a data preprocessing pipeline is recommended to be used before the extraction is run. The output format is designed specifically to fit on a node-edge-node model and form the building blocks of a network which makes understanding of the text and querying of information from corpus quick and intuitive. It attempts to reduce reading time and enhancing understanding of the text using interactive graph and timeline.

ContributorsHashmi, Syed Usama (Author) / Bansal, Ajay (Thesis advisor) / Bansal, Srividya (Committee member) / Gonzalez Sanchez, Javier (Committee member) / Arizona State University (Publisher)

Created2018

Template-Based Question Answering over Linked Data using Recursive Neural Networks

Description

The Semantic Web contains large amounts of related information in the form of knowledge graphs such as DBpedia. These knowledge graphs are typically enormous and are not easily accessible for users as they need specialized knowledge in query languages (such as SPARQL) as well as deep familiarity of the ontologies…

The Semantic Web contains large amounts of related information in the form of knowledge graphs such as DBpedia. These knowledge graphs are typically enormous and are not easily accessible for users as they need specialized knowledge in query languages (such as SPARQL) as well as deep familiarity of the ontologies used by these knowledge graphs. So, to make these knowledge graphs more accessible (even for non- experts) several question answering (QA) systems have been developed over the last decade. Due to the complexity of the task, several approaches have been undertaken that include techniques from natural language processing (NLP), information retrieval (IR), machine learning (ML) and the Semantic Web (SW). At a higher level, most question answering systems approach the question answering task as a conversion from the natural language question to its corresponding SPARQL query. These systems then utilize the query to retrieve the desired entities or literals. One approach to solve this problem, that is used by most systems today, is to apply deep syntactic and semantic analysis on the input question to derive the SPARQL query. This has resulted in the evolution of natural language processing pipelines that have common characteristics such as answer type detection, segmentation, phrase matching, part-of-speech-tagging, named entity recognition, named entity disambiguation, syntactic or dependency parsing, semantic role labeling, etc.

This has lead to NLP pipeline architectures that integrate components that solve a specific aspect of the problem and pass on the results to subsequent components for further processing eg: DBpedia Spotlight for named entity recognition, RelMatch for relational mapping, etc. A major drawback in this approach is error propagation that is a common problem in NLP. This can occur due to mistakes early on in the pipeline that can adversely affect successive steps further down the pipeline. Another approach is to use query templates either manually generated or extracted from existing benchmark datasets such as Question Answering over Linked Data (QALD) to generate the SPARQL queries that is basically a set of predefined queries with various slots that need to be filled. This approach potentially shifts the question answering problem into a classification task where the system needs to match the input question to the appropriate template (class label).

This thesis proposes a neural network approach to automatically learn and classify natural language questions into its corresponding template using recursive neural networks. An obvious advantage of using neural networks is the elimination for the need of laborious feature engineering that can be cumbersome and error prone. The input question would be encoded into a vector representation. The model will be trained and evaluated on the LC-QuAD Dataset (Large-scale Complex Question Answering Dataset). The dataset was created explicitly for machine learning based QA approaches for learning complex SPARQL queries. The dataset consists of 5000 questions along with their corresponding SPARQL queries over the DBpedia dataset spanning 5042 entities and 615 predicates. These queries were annotated based on 38 unique templates that the model will attempt to classify. The resulting model will be evaluated against both the LC-QuAD dataset and the Question Answering Over Linked Data (QALD-7) dataset.

The recursive neural network achieves template classification accuracy of 0.828 on the LC-QuAD dataset and an accuracy of 0.618 on the QALD-7 dataset. When the top-2 most likely templates were considered the model achieves an accuracy of 0.945 on the LC-QuAD dataset and 0.786 on the QALD-7 dataset.

After slot filling, the overall system achieves a macro F-score 0.419 on the LC- QuAD dataset and a macro F-score of 0.417 on the QALD-7 dataset.

ContributorsAthreya, Ram G (Author) / Bansal, Srividya (Thesis advisor) / Usbeck, Ricardo (Committee member) / Gary, Kevin (Committee member) / Arizona State University (Publisher)

Created2018

Analyzing the effect of an "open learner model" represented through a feedback system in a teachable agent system

Description

For this master's thesis, an open learner model is integrated with Quinn, a teachable robotic agent developed at Arizona State University. This system is represented as a feedback system, which aims to improve a student’s understanding of a subject. It also helps to understand the effect of the learner model…

For this master's thesis, an open learner model is integrated with Quinn, a teachable robotic agent developed at Arizona State University. This system is represented as a feedback system, which aims to improve a student’s understanding of a subject. It also helps to understand the effect of the learner model when it is represented by performance of the teachable agent. The feedback system represents performance of the teachable agent, and not of a student. Data in the feedback system is thus updated according to a student's understanding of the subject. This provides students an opportunity to enhance their understanding of a subject by analyzing their performance. To test the effectiveness of the feedback system, student understanding in two different conditions is analyzed. In the first condition a feedback report is not provided to the students, while in the second condition the feedback report is provided in the form of the agent’s performance.

ContributorsUpadhyay, Abha (Author) / Walker, Erin (Thesis advisor) / Nelson, Brian (Committee member) / Amresh, Ashish (Committee member) / Arizona State University (Publisher)

Created2016

Development and use of an iPad-based resuscitation code-blue sheet for improving resuscitation outcomes during intensive patient care

Description

The American Heart Association recommended in 1997 the data elements that should be collected from resuscitations in hospitals. (15) Currently, data documentation from resuscitation events in hospitals, termed ‘code blue’ events, utilizes a paper form, which is institution-specific. Problems with data capture and transcription exists, due to the challenges of…

The American Heart Association recommended in 1997 the data elements that should be collected from resuscitations in hospitals. (15) Currently, data documentation from resuscitation events in hospitals, termed ‘code blue’ events, utilizes a paper form, which is institution-specific. Problems with data capture and transcription exists, due to the challenges of dynamic documentation of patient, event and outcome variables as the code blue event unfolds.

This thesis is based on the hypothesis that an electronic version of code blue real-time data capture would lead to improved resuscitation data transcription, and enable clinicians to address deficiencies in quality of care. The primary goal of this thesis is to create an iOS based application, primarily designed for iPads, for code blue events at the Mayo Clinic Hospital. The secondary goal is to build an open-source software development framework for converting paper-based hospital protocols into digital format.

The tool created in this study enabled data documentation to be completed electronically rather than on paper for resuscitation outcomes. The tool was evaluated for usability with twenty nurses, the end-users, at Mayo Clinic in Phoenix, Arizona. The results showed the preference of users for the iPad application. Furthermore, a qualitative survey showed the clinicians perceived the electronic version to be more accurate and efficient than paper-based documentation, both of which are essential for an emergency code blue resuscitation procedure.

ContributorsBokhari, Wasif (Author) / Patel, Vimla L. (Thesis advisor) / Amresh, Ashish (Thesis advisor) / Nelson, Brian (Committee member) / Sen, Ayan (Committee member) / Arizona State University (Publisher)

Created2015

A study of text mining framework for automated classification of software requirements in enterprise systems

Description

Text Classification is a rapidly evolving area of Data Mining while Requirements Engineering is a less-explored area of Software Engineering which deals the process of defining, documenting and maintaining a software system's requirements. When researchers decided to blend these two streams in, there was research on automating the process of…

Text Classification is a rapidly evolving area of Data Mining while Requirements Engineering is a less-explored area of Software Engineering which deals the process of defining, documenting and maintaining a software system's requirements. When researchers decided to blend these two streams in, there was research on automating the process of classification of software requirements statements into categories easily comprehensible for developers for faster development and delivery, which till now was mostly done manually by software engineers - indeed a tedious job. However, most of the research was focused on classification of Non-functional requirements pertaining to intangible features such as security, reliability, quality and so on. It is indeed a challenging task to automatically classify functional requirements, those pertaining to how the system will function, especially those belonging to different and large enterprise systems. This requires exploitation of text mining capabilities. This thesis aims to investigate results of text classification applied on functional software requirements by creating a framework in R and making use of algorithms and techniques like k-nearest neighbors, support vector machine, and many others like boosting, bagging, maximum entropy, neural networks and random forests in an ensemble approach. The study was conducted by collecting and visualizing relevant enterprise data manually classified previously and subsequently used for training the model. Key components for training included frequency of terms in the documents and the level of cleanliness of data. The model was applied on test data and validated for analysis, by studying and comparing parameters like precision, recall and accuracy.

ContributorsSwadia, Japa (Author) / Ghazarian, Arbi (Thesis advisor) / Bansal, Srividya (Committee member) / Gaffar, Ashraf (Committee member) / Arizona State University (Publisher)

Created2016

An adaptive time reduction technique for video lectures

Description

Lecture videos are a widely used resource for learning. A simple way to create

videos is to record live lectures, but these videos end up being lengthy, include long

pauses and repetitive words making the viewing experience time consuming. While

pauses are useful in live learning environments where students take notes, I question

the…

Lecture videos are a widely used resource for learning. A simple way to create

videos is to record live lectures, but these videos end up being lengthy, include long

pauses and repetitive words making the viewing experience time consuming. While

pauses are useful in live learning environments where students take notes, I question

the value of pauses in video lectures. Techniques and algorithms that can shorten such

videos can have a huge impact in saving students’ time and reducing storage space.

I study this problem of shortening videos by removing long pauses and adaptively

modifying the playback rate by emphasizing the most important sections of the video

and its effect on the student community. The playback rate is designed in such a

way to play uneventful sections faster and significant sections slower. Important and

unimportant sections of a video are identified using textual analysis. I use an existing

speech-to-text algorithm to extract the transcript and apply latent semantic analysis

and standard information retrieval techniques to identify the relevant segments of

the video. I compute relevance scores of different segments and propose a variable

playback rate for each of these segments. The aim is to reduce the amount of time

students spend on passive learning while watching videos without harming their ability

to follow the lecture. I validate the approach by conducting a user study among

computer science students and measuring their engagement. The results indicate

no significant difference in their engagement when this method is compared to the

original unedited video.

ContributorsPurushothama Shenoy, Sreenivas (Author) / Amresh, Ashish (Thesis advisor) / Femiani, John (Committee member) / Walker, Erin (Committee member) / Arizona State University (Publisher)

Created2016

Land use and land cover classification using deep learning techniques

Description

Large datasets of sub-meter aerial imagery represented as orthophoto mosaics are widely available today, and these data sets may hold a great deal of untapped information. This imagery has a potential to locate several types of features; for example, forests, parking lots, airports, residential areas, or freeways in the imagery.…

Large datasets of sub-meter aerial imagery represented as orthophoto mosaics are widely available today, and these data sets may hold a great deal of untapped information. This imagery has a potential to locate several types of features; for example, forests, parking lots, airports, residential areas, or freeways in the imagery. However, the appearances of these things vary based on many things including the time that the image is captured, the sensor settings, processing done to rectify the image, and the geographical and cultural context of the region captured by the image. This thesis explores the use of deep convolutional neural networks to classify land use from very high spatial resolution (VHR), orthorectified, visible band multispectral imagery. Recent technological and commercial applications have driven the collection a massive amount of VHR images in the visible red, green, blue (RGB) spectral bands, this work explores the potential for deep learning algorithms to exploit this imagery for automatic land use/ land cover (LULC) classification. The benefits of automatic visible band VHR LULC classifications may include applications such as automatic change detection or mapping. Recent work has shown the potential of Deep Learning approaches for land use classification; however, this thesis improves on the state-of-the-art by applying additional dataset augmenting approaches that are well suited for geospatial data. Furthermore, the generalizability of the classifiers is tested by extensively evaluating the classifiers on unseen datasets and we present the accuracy levels of the classifier in order to show that the results actually generalize beyond the small benchmarks used in training. Deep networks have many parameters, and therefore they are often built with very large sets of labeled data. Suitably large datasets for LULC are not easy to come by, but techniques such as refinement learning allow networks trained for one task to be retrained to perform another recognition task. Contributions of this thesis include demonstrating that deep networks trained for image recognition in one task (ImageNet) can be efficiently transferred to remote sensing applications and perform as well or better than manually crafted classifiers without requiring massive training data sets. This is demonstrated on the UC Merced dataset, where 96% mean accuracy is achieved using a CNN (Convolutional Neural Network) and 5-fold cross validation. These results are further tested on unrelated VHR images at the same resolution as the training set.

ContributorsUba, Nagesh Kumar (Author) / Femiani, John (Thesis advisor) / Razdan, Anshuman (Committee member) / Amresh, Ashish (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by