Search Content

Template-Based Question Answering over Linked Data using Recursive Neural Networks

Description

The Semantic Web contains large amounts of related information in the form of knowledge graphs such as DBpedia. These knowledge graphs are typically enormous and are not easily accessible for users as they need specialized knowledge in query languages (such as SPARQL) as well as deep familiarity of the ontologies…

The Semantic Web contains large amounts of related information in the form of knowledge graphs such as DBpedia. These knowledge graphs are typically enormous and are not easily accessible for users as they need specialized knowledge in query languages (such as SPARQL) as well as deep familiarity of the ontologies used by these knowledge graphs. So, to make these knowledge graphs more accessible (even for non- experts) several question answering (QA) systems have been developed over the last decade. Due to the complexity of the task, several approaches have been undertaken that include techniques from natural language processing (NLP), information retrieval (IR), machine learning (ML) and the Semantic Web (SW). At a higher level, most question answering systems approach the question answering task as a conversion from the natural language question to its corresponding SPARQL query. These systems then utilize the query to retrieve the desired entities or literals. One approach to solve this problem, that is used by most systems today, is to apply deep syntactic and semantic analysis on the input question to derive the SPARQL query. This has resulted in the evolution of natural language processing pipelines that have common characteristics such as answer type detection, segmentation, phrase matching, part-of-speech-tagging, named entity recognition, named entity disambiguation, syntactic or dependency parsing, semantic role labeling, etc.

This has lead to NLP pipeline architectures that integrate components that solve a specific aspect of the problem and pass on the results to subsequent components for further processing eg: DBpedia Spotlight for named entity recognition, RelMatch for relational mapping, etc. A major drawback in this approach is error propagation that is a common problem in NLP. This can occur due to mistakes early on in the pipeline that can adversely affect successive steps further down the pipeline. Another approach is to use query templates either manually generated or extracted from existing benchmark datasets such as Question Answering over Linked Data (QALD) to generate the SPARQL queries that is basically a set of predefined queries with various slots that need to be filled. This approach potentially shifts the question answering problem into a classification task where the system needs to match the input question to the appropriate template (class label).

This thesis proposes a neural network approach to automatically learn and classify natural language questions into its corresponding template using recursive neural networks. An obvious advantage of using neural networks is the elimination for the need of laborious feature engineering that can be cumbersome and error prone. The input question would be encoded into a vector representation. The model will be trained and evaluated on the LC-QuAD Dataset (Large-scale Complex Question Answering Dataset). The dataset was created explicitly for machine learning based QA approaches for learning complex SPARQL queries. The dataset consists of 5000 questions along with their corresponding SPARQL queries over the DBpedia dataset spanning 5042 entities and 615 predicates. These queries were annotated based on 38 unique templates that the model will attempt to classify. The resulting model will be evaluated against both the LC-QuAD dataset and the Question Answering Over Linked Data (QALD-7) dataset.

The recursive neural network achieves template classification accuracy of 0.828 on the LC-QuAD dataset and an accuracy of 0.618 on the QALD-7 dataset. When the top-2 most likely templates were considered the model achieves an accuracy of 0.945 on the LC-QuAD dataset and 0.786 on the QALD-7 dataset.

After slot filling, the overall system achieves a macro F-score 0.419 on the LC- QuAD dataset and a macro F-score of 0.417 on the QALD-7 dataset.

ContributorsAthreya, Ram G (Author) / Bansal, Srividya (Thesis advisor) / Usbeck, Ricardo (Committee member) / Gary, Kevin (Committee member) / Arizona State University (Publisher)

Created2018

All Purpose Textual Data Information Extraction, Visualization and Querying

Description

Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and…

Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and extraction worthy. Using data visualization theory and fast, interactive querying methods, leaving out information might not really be necessary. This thesis explores textual data visualization techniques, intuitive querying, and a novel approach to all-purpose textual information extraction to encode large text corpus to improve human understanding of the information present in textual data.

This thesis presents a modified traversal algorithm on dependency parse output of text to extract all subject predicate object pairs from text while ensuring that no information is missed out. To support full scale, all-purpose information extraction from large text corpuses, a data preprocessing pipeline is recommended to be used before the extraction is run. The output format is designed specifically to fit on a node-edge-node model and form the building blocks of a network which makes understanding of the text and querying of information from corpus quick and intuitive. It attempts to reduce reading time and enhancing understanding of the text using interactive graph and timeline.

ContributorsHashmi, Syed Usama (Author) / Bansal, Ajay (Thesis advisor) / Bansal, Srividya (Committee member) / Gonzalez Sanchez, Javier (Committee member) / Arizona State University (Publisher)

Created2018

Moving obstacle avoidance for unmanned aerial vehicles

Description

There has been a vast increase in applications of Unmanned Aerial Vehicles (UAVs) in civilian domains. To operate in the civilian airspace, a UAV must be able to sense and avoid both static and moving obstacles for flight safety. While indoor and low-altitude environments are mainly occupied by static obstacles,…

There has been a vast increase in applications of Unmanned Aerial Vehicles (UAVs) in civilian domains. To operate in the civilian airspace, a UAV must be able to sense and avoid both static and moving obstacles for flight safety. While indoor and low-altitude environments are mainly occupied by static obstacles, risks in space of higher altitude primarily come from moving obstacles such as other aircraft or flying vehicles in the airspace. Therefore, the ability to avoid moving obstacles becomes a necessity

for Unmanned Aerial Vehicles.

Towards enabling a UAV to autonomously sense and avoid moving obstacles, this thesis makes the following contributions. Initially, an image-based reactive motion planner is developed for a quadrotor to avoid a fast approaching obstacle. Furthermore, A Dubin’s curve based geometry method is developed as a global path planner for a fixed-wing UAV to avoid collisions with aircraft. The image-based method is unable to produce an optimal path and the geometry method uses a simplified UAV model. To compensate

these two disadvantages, a series of algorithms built upon the Closed-Loop Rapid Exploratory Random Tree are developed as global path planners to generate collision avoidance paths in real time. The algorithms are validated in Software-In-the-Loop (SITL) and Hardware-In-the-Loop (HIL) simulations using a fixed-wing UAV model and in real flight experiments using quadrotors. It is observed that the algorithm enables a UAV to avoid moving obstacles approaching to it with different directions and speeds.

ContributorsLin, Yucong (Author) / Saripalli, Srikanth (Thesis advisor) / Scowen, Paul (Committee member) / Fainekos, Georgios (Committee member) / Thangavelautham, Jekanthan (Committee member) / Youngbull, Cody (Committee member) / Arizona State University (Publisher)

Created2015

Dynamic analysis of embedded software

Description

Most embedded applications are constructed with multiple threads to handle concurrent events. For optimization and debugging of the programs, dynamic program analysis is widely used to collect execution information while the program is running. Unfortunately, the non-deterministic behavior of multithreaded embedded software makes the dynamic analysis difficult. In addition, instrumentation…

Most embedded applications are constructed with multiple threads to handle concurrent events. For optimization and debugging of the programs, dynamic program analysis is widely used to collect execution information while the program is running. Unfortunately, the non-deterministic behavior of multithreaded embedded software makes the dynamic analysis difficult. In addition, instrumentation overhead for gathering execution information may change the execution of a program, and lead to distorted analysis results, i.e., probe effect. This thesis presents a framework that tackles the non-determinism and probe effect incurred in dynamic analysis of embedded software. The thesis largely consists of three parts. First of all, we discusses a deterministic replay framework to provide reproducible execution. Once a program execution is recorded, software instrumentation can be safely applied during replay without probe effect. Second, a discussion of probe effect is presented and a simulation-based analysis is proposed to detect execution changes of a program caused by instrumentation overhead. The simulation-based analysis examines if the recording instrumentation changes the original program execution. Lastly, the thesis discusses data race detection algorithms that help to remove data races for correctness of the replay and the simulation-based analysis. The focus is to make the detection efficient for C/C++ programs, and to increase scalability of the detection on multi-core machines.

ContributorsSong, Young Wn (Author) / Lee, Yann-Hang (Thesis advisor) / Shrivastava, Aviral (Committee member) / Fainekos, Georgios (Committee member) / Lee, Joohyung (Committee member) / Arizona State University (Publisher)

Created2015

SmartGateway Framework

Description

Cisco estimates that by 2020, 50 billion devices will be connected to the Internet. But 99% of the things today remain isolated and unconnected. Different connectivity protocols, proprietary access, varied device characteristics, security concerns are the main reasons for that isolated state. This project aims at designing and building a…

Cisco estimates that by 2020, 50 billion devices will be connected to the Internet. But 99% of the things today remain isolated and unconnected. Different connectivity protocols, proprietary access, varied device characteristics, security concerns are the main reasons for that isolated state. This project aims at designing and building a prototype gateway that exposes a simple and intuitive HTTP Restful interface to access and manipulate devices and the data that they produce while addressing most of the issues listed above. Along with manipulating devices, the framework exposes sensor data in such a way that it can be used to create applications like rules or events that make the home smarter. It also allows the user to represent high-level knowledge by aggregating the low-level sensor data. This high-level representation can be considered as a property of the environment or object rather than the sensor itself which makes interpreting the values more intuitive and accessible.

ContributorsNair, Shankar (Author) / Lee, Yann-Hang (Thesis advisor) / Lee, Joohyung (Committee member) / Fainekos, Georgios (Committee member) / Arizona State University (Publisher)

Created2015

A semantic framework for integrating and publishing linked data on the Web

Description

Semantic web is the web of data that provides a common framework and technologies for sharing and reusing data in various applications. In semantic web terminology, linked data is the term used to describe a method of exposing and connecting data on the web from different sources. The purpose of…

Semantic web is the web of data that provides a common framework and technologies for sharing and reusing data in various applications. In semantic web terminology, linked data is the term used to describe a method of exposing and connecting data on the web from different sources. The purpose of linked data and semantic web is to publish data in an open and standard format and to link this data with existing data on the Linked Open Data Cloud. The goal of this thesis to come up with a semantic framework for integrating and publishing linked data on the web. Traditionally integrating data from multiple sources usually involves an Extract-Transform-Load (ETL) framework to generate datasets for analytics and visualization. The thesis proposes introducing a semantic component in the ETL framework to semi-automate the generation and publishing of linked data. In this thesis, various existing ETL tools and data integration techniques have been analyzed and deficiencies have been identified. This thesis proposes a set of requirements for the semantic ETL framework by conducting a manual process to integrate data from various sources such as weather, holidays, airports, flight arrival, departure and delays. The research questions that are addressed are: (i) to what extent can the integration, generation, and publishing of linked data to the cloud using a semantic ETL framework be automated; (ii) does use of semantic technologies produce a richer data model and integrated data. Details of the methodology, data collection, and application that uses the linked data generated are presented. Evaluation is done by comparing traditional data integration approach with semantic ETL approach in terms of effort involved in integration, data model generated and querying the data generated.

ContributorsPadki, Aparna (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)

Created2016

A composite natural language processing and information retrieval approach to question answering against a structured knowledge base

Description

With the inception of World Wide Web, the amount of data present on the internet is tremendous. This makes the task of navigating through this enormous amount of data quite difficult for the user. As users struggle to navigate through this wealth of information, the need for the development of…

With the inception of World Wide Web, the amount of data present on the internet is tremendous. This makes the task of navigating through this enormous amount of data quite difficult for the user. As users struggle to navigate through this wealth of information, the need for the development of an automated system that can extract the required information becomes urgent. The aim of this thesis is to develop a Question Answering system to ease the process of information retrieval.

Question Answering systems have been around for quite some time and are a sub-field of information retrieval and natural language processing. The task of any Question Answering system is to seek an answer to a free form factual question. The difficulty of pinpointing and verifying the precise answer makes question answering more challenging than simple information retrieval done by search engines. Text REtrieval Conference (TREC) is a yearly conference which provides large - scale infrastructure and resources to support research in information retrieval domain. TREC has a question answering track since 1999 where the questions dataset contains a list of factual questions (Vorhees & Tice, 1999). DBpedia (Bizer et al., 2009) is a community driven effort to extract and structure the data present in Wikipedia.

The research objective of this thesis is to develop a novel approach to Question Answering based on a composition of conventional approaches of Information Retrieval and Natural Language processing. The focus is also on exploring the use of a structured and annotated knowledge base as opposed to an unstructured knowledge base. The knowledge base used here is DBpedia and the final system is evaluated on the TREC 2004 questions dataset.

ContributorsChandurkar, Avani (Author) / Bansal, Ajay (Thesis advisor) / Bansal, Srividya (Committee member) / Lindquist, Timothy (Committee member) / Arizona State University (Publisher)

Created2016

Instrumentation and coverage analysis of cyber physical system models

Description

A Cyber Physical System consists of a computer monitoring and controlling physical processes usually in a feedback loop. These systems are increasingly becoming part of our daily life ranging from smart buildings to medical devices to automobiles. The controller comprises discrete software which may be operating in one of the…

A Cyber Physical System consists of a computer monitoring and controlling physical processes usually in a feedback loop. These systems are increasingly becoming part of our daily life ranging from smart buildings to medical devices to automobiles. The controller comprises discrete software which may be operating in one of the many possible operating modes and interacting with a changing physical environment in a feedback loop. The systems with such a mix of discrete and continuous dynamics are usually termed as hybrid systems. In general, these systems are safety critical, hence their correct operation must be verified. Model Based Design (MBD) languages like Simulink are being used extensively for the design and analysis of hybrid systems due to the ease in system design and automatic code generation. It also allows testing and verification of these systems before deployment. One of the main challenges in the verification of these systems is to test all the operating modes of the control software and reduce the amount of user intervention.

This research aims to provide an automated framework for the structural analysis and instrumentation of hybrid system models developed in Simulink. The behavior of the components introducing discontinuities in the model are automatically extracted in the form of state transition graphs. The framework is integrated in the S-TaLiRo toolbox to demonstrate the improvement in mode coverage.

ContributorsThekkalore Srinivasa, Rahul (Author) / Fainekos, Georgios (Thesis advisor) / Mayyas, Abdel Ra’ouf (Committee member) / Sarjoughian, Hessam S. (Committee member) / Arizona State University (Publisher)

Created2016

Extended LTLvis motion planning interface

Description

Robots are becoming an important part of our life and industry. Although a lot of robot control interfaces have been developed to simplify the control method and improve user experience, users still cannot control robots comfortably. With the improvements of the robot functions, the requirements of universality and ease of…

Robots are becoming an important part of our life and industry. Although a lot of robot control interfaces have been developed to simplify the control method and improve user experience, users still cannot control robots comfortably. With the improvements of the robot functions, the requirements of universality and ease of use of robot control interfaces are also increasing. This research introduces a graphical interface for Linear Temporal Logic (LTL) specifications for mobile robots. It is a sketch based interface built on the Android platform which makes the LTL control interface more friendly to non-expert users. By predefining a set of areas of interest, this interface can quickly and efficiently create plans that satisfy extended plan goals in LTL. The interface can also allow users to customize the paths for this plan by sketching a set of reference trajectories. Given the custom paths by the user, the LTL specification and the environment, the interface generates a plan balancing the customized paths and the LTL specifications. We also show experimental results with the implemented interface.

ContributorsWei, Wei (Author) / Fainekos, Georgios (Thesis advisor) / Amor, Hani Ben (Committee member) / Zhang, Yu (Committee member) / Arizona State University (Publisher)

Created2016

MOOCLink: linking and maintaining qulity of data provided by various MOOC providers

Description

The concept of Linked Data is gaining widespread popularity and importance. The method of publishing and linking structured data on the web is called Linked Data. Emergence of Linked Data has made it possible to make sense of huge data, which is scattered all over the web, and link multiple…

The concept of Linked Data is gaining widespread popularity and importance. The method of publishing and linking structured data on the web is called Linked Data. Emergence of Linked Data has made it possible to make sense of huge data, which is scattered all over the web, and link multiple heterogeneous sources. This leads to the challenge of maintaining the quality of Linked Data, i.e., ensuring outdated data is removed and new data is included. The focus of this thesis is devising strategies to effectively integrate data from multiple sources, publish it as Linked Data, and maintain the quality of Linked Data. The domain used in the study is online education. With so many online courses offered by Massive Open Online Courses (MOOC), it is becoming increasingly difficult for an end user to gauge which course best fits his/her needs.

Users are spoilt for choices. It would be very helpful for them to make a choice if there is a single place where they can visually compare the offerings of various MOOC providers for the course they are interested in. Previous work has been done in this area through the MOOCLink project that involved integrating data from Coursera, EdX, and Udacity and generation of linked data, i.e. Resource Description Framework (RDF) triples.

The research objective of this thesis is to determine a methodology by which the quality

of data available through the MOOCLink application is maintained, as there are lots of new courses being constantly added and old courses being removed by data providers. This thesis presents the integration of data from various MOOC providers and algorithms for incrementally updating linked data to maintain their quality and compare it against a naïve approach in order to constantly keep the users engaged with up-to-date data. A master threshold value was determined through experiments and analysis that quantifies one algorithm being better than the other in terms of time efficiency. An evaluation of the tool shows the effectiveness of the algorithms presented in this thesis.

ContributorsDhekne, Chinmay (Author) / Bansal, Srividya (Thesis advisor) / Bansal, Ajay (Committee member) / Sohoni, Sohum (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by