Search Content

A semantic triplet based story classifier

Description

Text classification, in the artificial intelligence domain, is an activity in which text documents are automatically classified into predefined categories using machine learning techniques. An example of this is classifying uncategorized news articles into different predefined categories such as "Business", "Politics", "Education", "Technology" , etc. In this thesis, supervised machine…

Text classification, in the artificial intelligence domain, is an activity in which text documents are automatically classified into predefined categories using machine learning techniques. An example of this is classifying uncategorized news articles into different predefined categories such as "Business", "Politics", "Education", "Technology" , etc. In this thesis, supervised machine learning approach is followed, in which a module is first trained with pre-classified training data and then class of test data is predicted. Good feature extraction is an important step in the machine learning approach and hence the main component of this text classifier is semantic triplet based features in addition to traditional features like standard keyword based features and statistical features based on shallow-parsing (such as density of POS tags and named entities). Triplet {Subject, Verb, Object} in a sentence is defined as a relation between subject and object, the relation being the predicate (verb). Triplet extraction process, is a 5 step process which takes input corpus as a web text document(s), each consisting of one or many paragraphs, from RSS feeds to lists of extremist website. Input corpus feeds into the "Pronoun Resolution" step, which uses an heuristic approach to identify the noun phrases referenced by the pronouns. The next step "SRL Parser" is a shallow semantic parser and converts the incoming pronoun resolved paragraphs into annotated predicate argument format. The output of SRL parser is processed by "Triplet Extractor" algorithm which forms the triplet in the form {Subject, Verb, Object}. Generalization and reduction of triplet features is the next step. Reduced feature representation reduces computing time, yields better discriminatory behavior and handles curse of dimensionality phenomena. For training and testing, a ten- fold cross validation approach is followed. In each round SVM classifier is trained with 90% of labeled (training) data and in the testing phase, classes of remaining 10% unlabeled (testing) data are predicted. Concluding, this paper proposes a model with semantic triplet based features for story classification. The effectiveness of the model is demonstrated against other traditional features used in the literature for text classification tasks.

ContributorsKarad, Ravi Chandravadan (Author) / Davulcu, Hasan (Thesis advisor) / Corman, Steven (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2013

Advancing biomedical named entity recognition with multivariate feature selection and semantically motivated features

Description

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located…

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located within natural-language text and their semantic type is determined. This step is critical for later tasks in an information extraction pipeline, including normalization and relationship extraction. BANNER is a benchmark biomedical NER system using linear-chain conditional random fields and the rich feature set approach. A case study with BANNER locating genes and proteins in biomedical literature is described. The first corpus for disease NER adequate for use as training data is introduced, and employed in a case study of disease NER. The first corpus locating adverse drug reactions (ADRs) in user posts to a health-related social website is also described, and a system to locate and identify ADRs in social media text is created and evaluated. The rich feature set approach to creating NER feature sets is argued to be subject to diminishing returns, implying that additional improvements may require more sophisticated methods for creating the feature set. This motivates the first application of multivariate feature selection with filters and false discovery rate analysis to biomedical NER, resulting in a feature set at least 3 orders of magnitude smaller than the set created by the rich feature set approach. Finally, two novel approaches to NER by modeling the semantics of token sequences are introduced. The first method focuses on the sequence content by using language models to determine whether a sequence resembles entries in a lexicon of entity names or text from an unlabeled corpus more closely. The second method models the distributional semantics of token sequences, determining the similarity between a potential mention and the token sequences from the training data by analyzing the contexts where each sequence appears in a large unlabeled corpus. The second method is shown to improve the performance of BANNER on multiple data sets.

ContributorsLeaman, James Robert (Author) / Gonzalez, Graciela (Thesis advisor) / Baral, Chitta (Thesis advisor) / Cohen, Kevin B (Committee member) / Liu, Huan (Committee member) / Ye, Jieping (Committee member) / Arizona State University (Publisher)

Created2013

A high level language for human robot interaction

Description

While developing autonomous intelligent robots has been the goal of many research programs, a more practical application involving intelligent robots is the formation of teams consisting of both humans and robots. An example of such an application is search and rescue operations where robots commanded by humans are sent to…

While developing autonomous intelligent robots has been the goal of many research programs, a more practical application involving intelligent robots is the formation of teams consisting of both humans and robots. An example of such an application is search and rescue operations where robots commanded by humans are sent to environments too dangerous for humans. For such human-robot interaction, natural language is considered a good communication medium as it allows humans with less training about the robot's internal language to be able to command and interact with the robot. However, any natural language communication from the human needs to be translated to a formal language that the robot can understand. Similarly, before the robot can communicate (in natural language) with the human, it needs to formulate its communique in some formal language which then gets translated into natural language. In this paper, I develop a high level language for communication between humans and robots and demonstrate various aspects through a robotics simulation. These language constructs borrow some ideas from action execution languages and are grounded with respect to simulated human-robot interaction transcripts.

ContributorsLumpkin, Barry Thomas (Author) / Baral, Chitta (Thesis advisor) / Lee, Joohyung (Committee member) / Fainekos, Georgios (Committee member) / Arizona State University (Publisher)

Created2012

Learning the Initial Lexicon in Translating Natural Language to Formal Language

Description

The objective of this research is to determine an approach for automating the learning of the initial lexicon used in translating natural language sentences to their formal knowledge representations based on lambda-calculus expressions. Using a universal knowledge representation and its associated parser, this research attempts to use word alignment techniques…

The objective of this research is to determine an approach for automating the learning of the initial lexicon used in translating natural language sentences to their formal knowledge representations based on lambda-calculus expressions. Using a universal knowledge representation and its associated parser, this research attempts to use word alignment techniques to align natural language sentences to the linearized parses of their associated knowledge representations in order to learn the meanings of individual words. The work includes proposing and analyzing an approach that can be used to learn some of the initial lexicon.

ContributorsBaldwin, Amy Lynn (Author) / Baral, Chitta (Thesis director) / Vo, Nguyen (Committee member) / Industrial, Systems (Contributor) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2015-05

Using Machine Learning Models to Detect Fake News, Bots, and Rumors on Social Media

Description

In this paper, I introduce the fake news problem and detail how it has been exacerbated through social media. I explore current practices for fake news detection using natural language processing and current benchmarks in ranking the efficacy of various language models. Using a Twitter-specific benchmark, I attempt to reproduce the scores of…

In this paper, I introduce the fake news problem and detail how it has been exacerbated through social media. I explore current practices for fake news detection using natural language processing and current benchmarks in ranking the efficacy of various language models. Using a Twitter-specific benchmark, I attempt to reproduce the scores of six language models demonstrating their effectiveness in seven tweet classification tasks. I explain the successes and challenges in reproducing these results and provide analysis for the future implications of fake news research.

ContributorsChang, Ariz Bay (Author) / Liu, Huan (Thesis director) / Tahir, Anique (Committee member) / Computer Science and Engineering Program (Contributor, Contributor) / Barrett, The Honors College (Contributor)

Created2021-05

Exploring Financial Credit Contracts Using Natural Language Processing Techniques

Description

Natural Language Processing (NLP) techniques have increasingly been used in finance, accounting, and economics research to analyze text-based information more efficiently and effectively than primarily human-centered methods. The literature is rich with computational textual analysis techniques applied to consistent annual or quarterly financial fillings, with promising results to identify similarities…

Natural Language Processing (NLP) techniques have increasingly been used in finance, accounting, and economics research to analyze text-based information more efficiently and effectively than primarily human-centered methods. The literature is rich with computational textual analysis techniques applied to consistent annual or quarterly financial fillings, with promising results to identify similarities between documents and firms, in addition to further using this information in relation to other economic phenomena. Building upon the knowledge gained from previous research and extending the application of NLP methods to other categories of financial documents, this project explores financial credit contracts, better understanding the information provided through their textual data by assessing patterns and relationships between documents and firms. The main methods used throughout this project is Term Frequency-Inverse Document Frequency (to represent each document as a numerical vector), Cosine Similarity (to measure the similarity between contracts), and K-Means Clustering (to organically derive clusters of documents based on the text included in the contract itself). Using these methods, the dimensions analyzed are various grouping methodologies (external industry classifications and text derived classifications), various granularities (document-wise and firm-wise), various financial documents associated with a single firm (the relationship between credit contracts and 10-K product descriptions), and how various mean cosine similarity distributions change over time.

ContributorsLiu, Jeremy J (Author) / Wahal, Sunil (Thesis director) / Bharath, Sreedhar (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / School for the Future of Innovation in Society (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Instructional Design with Natural Language Processing in a Virtual Reality Environment

Description

Natural Language Processing and Virtual Reality are hot topics in the present. How can we synthesize these together in order to make a cohesive experience? The game focuses on users using vocal commands, building structures, and memorizing spatial objects. In order to get proper vocal commands, the IBM Watson API…

Natural Language Processing and Virtual Reality are hot topics in the present. How can we synthesize these together in order to make a cohesive experience? The game focuses on users using vocal commands, building structures, and memorizing spatial objects. In order to get proper vocal commands, the IBM Watson API for Natural Language Processing was incorporated into our game system. User experience elements like gestures, UI color change, and images were used to help guide users in memorizing and building structures. The process to create these elements were streamlined through the VRTK library in Unity. The game has two segments. The first segment is a tutorial level where the user learns to perform motions and in-game actions. The second segment is a game where the user must correctly create a structure by utilizing vocal commands and spatial recognition. A standardized usability test, System Usability Scale, was used to evaluate the effectiveness of the game. A survey was also created in order to evaluate a more descriptive user opinion. Overall, users gave a positive score on the System Usability Scale and slightly positive reviews in the custom survey.

ContributorsOrtega, Excel (Co-author) / Ryan, Alexander (Co-author) / Kobayashi, Yoshihiro (Thesis director) / Nelson, Brian (Committee member) / Computing and Informatics Program (Contributor) / School of Art (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Voice Reconfigurable Networks

Description

The software element of home and small business networking solutions has failed to keep pace with annual development of newer and faster hardware. The software running on these devices is an afterthought, oftentimes equipped with minimal features, an obtuse user interface, or both. At the same time, this past year…

The software element of home and small business networking solutions has failed to keep pace with annual development of newer and faster hardware. The software running on these devices is an afterthought, oftentimes equipped with minimal features, an obtuse user interface, or both. At the same time, this past year has seen the rise of smart home assistants that represent the next step in human-computer interaction with their advanced use of natural language processing. This project seeks to quell the issues with the former by exploring a possible fusion of a powerful, feature-rich software-defined networking stack and the incredible natural language processing tools of smart home assistants. To accomplish these ends, a piece of software was developed to leverage the powerful natural language processing capabilities of one such smart home assistant, the Amazon Echo. On one end, this software interacts with Amazon Web Services to retrieve information about a user's speech patterns and key information contained in their speech. On the other end, the software joins that information with its previous session state to intelligently translate speech into a series of commands for the separate components of a networking stack. The software developed for this project empowers a user to quickly make changes to several facets of their networking gear or acquire information about it with just their language \u2014 no terminals, java applets, or web configuration interfaces needed, thus circumventing clunky UI's or jumping from shell to shell. It is the author's hope that showing how networking equipment can be configured in this innovative way will draw more attention to the current failings of networking equipment and inspire a new series of intuitive user interfaces.

ContributorsHermens, Ryan Joseph (Author) / Meuth, Ryan (Thesis director) / Burger, Kevin (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2016-12

Conjugating Honorifics in English-to-Japanese Machine Translation

Description

This research lays down foundational work in the semantic reconstruction of linguistic politeness in English-to-Japanese machine translation and thereby advances semantic-based automated translation of English into other natural languages. I developed a Java project called the PoliteParser that is intended as a plug-in to existing semantic parsers to determine whether…

This research lays down foundational work in the semantic reconstruction of linguistic politeness in English-to-Japanese machine translation and thereby advances semantic-based automated translation of English into other natural languages. I developed a Java project called the PoliteParser that is intended as a plug-in to existing semantic parsers to determine whether verbs in dialogue in an English corpus should be conjugated into the plain or the polite honorific form when translated into Japanese. The PoliteParser bases this decision off of semantic information about the social relationships between the speaker and the listener, the speaker's personality, and the circumstances of the utterance. Testing undergone during the course of this research demonstrates that the PoliteParser can achieve levels of accuracy 31 percentage points higher than that of statistical translation systems when integrated with a semantic parser and 54 percentage points higher when used with pre-parsed data.

ContributorsGuiou, Jared Tyler (Author) / Baral, Chitta (Thesis director) / Tanno, Koji (Committee member) / School of International Letters and Cultures (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2016-12

Prescription Information Extraction from Electronic Health Records using BiLSTM-CRF and Word Embeddings

Description

Medical records are increasingly being recorded in the form of electronic health records (EHRs), with a significant amount of patient data recorded as unstructured natural language text. Consequently, being able to extract and utilize clinical data present within these records is an important step in furthering clinical care. One important…

Medical records are increasingly being recorded in the form of electronic health records (EHRs), with a significant amount of patient data recorded as unstructured natural language text. Consequently, being able to extract and utilize clinical data present within these records is an important step in furthering clinical care. One important aspect within these records is the presence of prescription information. Existing techniques for extracting prescription information — which includes medication names, dosages, frequencies, reasons for taking, and mode of administration — from unstructured text have focused on the application of rule- and classifier-based methods. While state-of-the-art systems can be effective in extracting many types of information, they require significant effort to develop hand-crafted rules and conduct effective feature engineering. This paper presents the use of a bidirectional LSTM with CRF tagging model initialized with precomputed word embeddings for extracting prescription information from sentences without requiring significant feature engineering. The experimental results, run on the i2b2 2009 dataset, achieve an F1 macro measure of 0.8562, and scores above 0.9449 on four of the six categories, indicating significant potential for this model.

ContributorsRawal, Samarth Chetan (Author) / Baral, Chitta (Thesis director) / Anwar, Saadat (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05