Search Content

Advancing biomedical named entity recognition with multivariate feature selection and semantically motivated features

Description

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located…

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located within natural-language text and their semantic type is determined. This step is critical for later tasks in an information extraction pipeline, including normalization and relationship extraction. BANNER is a benchmark biomedical NER system using linear-chain conditional random fields and the rich feature set approach. A case study with BANNER locating genes and proteins in biomedical literature is described. The first corpus for disease NER adequate for use as training data is introduced, and employed in a case study of disease NER. The first corpus locating adverse drug reactions (ADRs) in user posts to a health-related social website is also described, and a system to locate and identify ADRs in social media text is created and evaluated. The rich feature set approach to creating NER feature sets is argued to be subject to diminishing returns, implying that additional improvements may require more sophisticated methods for creating the feature set. This motivates the first application of multivariate feature selection with filters and false discovery rate analysis to biomedical NER, resulting in a feature set at least 3 orders of magnitude smaller than the set created by the rich feature set approach. Finally, two novel approaches to NER by modeling the semantics of token sequences are introduced. The first method focuses on the sequence content by using language models to determine whether a sequence resembles entries in a lexicon of entity names or text from an unlabeled corpus more closely. The second method models the distributional semantics of token sequences, determining the similarity between a potential mention and the token sequences from the training data by analyzing the contexts where each sequence appears in a large unlabeled corpus. The second method is shown to improve the performance of BANNER on multiple data sets.

ContributorsLeaman, James Robert (Author) / Gonzalez, Graciela (Thesis advisor) / Baral, Chitta (Thesis advisor) / Cohen, Kevin B (Committee member) / Liu, Huan (Committee member) / Ye, Jieping (Committee member) / Arizona State University (Publisher)

Created2013

A: compact disc recording of three commissioned works featuring the clarinet by Portuguese composers, which include Portuguese folk music elements

Description

Despite the wealth of folk music traditions in Portugal and the importance of the clarinet in the music of bandas filarmonicas, it is uncommon to find works featuring the clarinet using Portuguese folk music elements. In the interest of expanding this type of repertoire, three new works were commissioned from…

Despite the wealth of folk music traditions in Portugal and the importance of the clarinet in the music of bandas filarmonicas, it is uncommon to find works featuring the clarinet using Portuguese folk music elements. In the interest of expanding this type of repertoire, three new works were commissioned from three different composers. The resulting works are Seres Imaginarios 3 by Luis Cardoso; Delirio Barroco by Tiago Derrica; and Memória by Pedro Faria Gomes. In an effort to submit these new works for inclusion into mainstream performance literature, the author has recorded these works on compact disc. This document includes interview transcripts with each composer, providing first-person discussion of each composition, as well as detailed biographical information on each composer. To provide context, the author has included a brief discussion on Portuguese folk music, and in particular, the role that the clarinet plays in Portuguese folk music culture.

ContributorsFerreira, Wesley (Contributor) / Spring, Robert S (Thesis advisor) / Bailey, Wayne (Committee member) / Gardner, Joshua (Committee member) / Hill, Gary (Committee member) / Schuring, Martin (Committee member) / Solis, Theodore (Committee member) / Arizona State University (Publisher)

Created2013

Charlotte Burton, clarinet

ContributorsBurton, Charlotte (Performer) / ASU Library. Music Library (Publisher)

Created2018-04-08

Elizabeth Druesedow, clarinet

ContributorsDruesedow, Elizabeth (Performer) / ASU Library. Music Library (Publisher)

Created2018-04-07

Sentimental bi-partite graph of political blogs

Description

Analysis of political texts, which contains a huge amount of personal political opinions, sentiments, and emotions towards powerful individuals, leaders, organizations, and a large number of people, is an interesting task, which can lead to discover interesting interactions between the political parties and people. Recently, political blogosphere plays an increasingly…

Analysis of political texts, which contains a huge amount of personal political opinions, sentiments, and emotions towards powerful individuals, leaders, organizations, and a large number of people, is an interesting task, which can lead to discover interesting interactions between the political parties and people. Recently, political blogosphere plays an increasingly important role in politics, as a forum for debating political issues. Most of the political weblogs are biased towards their political parties, and they generally express their sentiments towards their issues (i.e. leaders, topics etc.,) and also towards issues of the opposing parties. In this thesis, I have modeled the above interactions/debate as a sentimental bi-partite graph, a bi-partite graph with Blogs forming vertices of a disjoint set, and the issues (i.e. leaders, topics etc.,) forming the other disjoint set,and the edges between the two sets representing the sentiment of the blogs towards the issues. I have used American Political blog data to model the sentimental bi- partite graph, in particular, a set of popular political liberal and conservative blogs that have clearly declared positions. These blogs contain discussion about social, political, economic issues and related key individuals in their conservative/liberal view. To be more focused and more polarized, 22 most popular liberal/conservative blogs of a particular time period, May 2008 - October 2008(because of high intensity of debate and discussions), just before the presidential elections, was considered, involving around 23,800 articles. This thesis involves solving the questions: a) which is the most liberal/conservative blogs on the web? b) Who is on which side of debate and what are the issues? c) Who are the important leaders? d) How do you model the relationship between the participants of the debate and the underlying issues?

ContributorsThirumalai, Dananjayan (Author) / Davulcu, Hasan (Thesis advisor) / Sarjoughian, Hessam S. (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2012

A recording and performance guide for three new works featuring clarinet and electronics, clarinet and piano, and clarinet, bass clarinet, and piano

Description

This project includes a recording and performance guide for three newly commissioned pieces for the clarinet. The first piece, shimmer, was written by Grant Jahn and is for B-flat clarinet and electronics. The second piece, Paragon, is for B-flat clarinet and piano and was composed by Dr. Theresa Martin. The…

This project includes a recording and performance guide for three newly commissioned pieces for the clarinet. The first piece, shimmer, was written by Grant Jahn and is for B-flat clarinet and electronics. The second piece, Paragon, is for B-flat clarinet and piano and was composed by Dr. Theresa Martin. The third and final piece, Duality in the Eye of a Bovine, was written by Kurt Mehlenbacher and is for B-flat clarinet, bass clarinet, and piano. In addition to the performance guide, this document also includes background information and program notes for the compositions, as well as composer biographical information, a list of other works featuring the clarinet by each composer, and transcripts of composer and performer interviews. This document is accompanied by a recording of the three pieces.

ContributorsPoupard, Caitlin Marie (Author) / Spring, Robert (Thesis advisor) / Gardner, Joshua (Thesis advisor) / Hill, Gary (Committee member) / Oldani, Robert (Committee member) / Schuring, Martin (Committee member) / Arizona State University (Publisher)

Created2016

A recording project featuring four newly commissioned pieces for clarinet

Description

The primary objective of this research project is to expand the clarinet repertoire with the addition of four new pieces. Each of these new pieces use contemporary clarinet techniques, including electronics, prerecorded sounds, multiphonics, circular breathing, multiple articulation, demi-clarinet, and the clari-flute. The repertoire composed includes Grant Jahn’s Duo for…

The primary objective of this research project is to expand the clarinet repertoire with the addition of four new pieces. Each of these new pieces use contemporary clarinet techniques, including electronics, prerecorded sounds, multiphonics, circular breathing, multiple articulation, demi-clarinet, and the clari-flute. The repertoire composed includes Grant Jahn’s Duo for Two Clarinets, Reggie Berg’s Funkalicious for Clarinet and Piano, Rusty Banks’ Star Juice for Clarinet and Fixed Media, and Chris Malloy’s A Celestial Breath for Clarinet and Electronics. In addition to the musical commissions, this project also includes interviews with the composers indicating how they wrote these works and what their influences were, along with any information pertinent to the performer, professional recordings of each piece, as well as performance notes and suggestions.

ContributorsCase-Ruchala, Celeste Ann (Contributor) / Gardner, Joshua (Thesis advisor) / Spring, Robert (Thesis advisor) / Hill, Gary (Committee member) / Rogers, Rodney (Committee member) / Schuring, Martin (Committee member) / Arizona State University (Publisher)

Created2016

Semantic feature extraction for narrative analysis

Description

A story is defined as "an actor(s) taking action(s) that culminates in a resolution(s)''. I present novel sets of features to facilitate story detection among text via supervised classification and further reveal different forms within stories via unsupervised clustering. First, I investigate the utility of a new set of semantic…

A story is defined as "an actor(s) taking action(s) that culminates in a resolution(s)''. I present novel sets of features to facilitate story detection among text via supervised classification and further reveal different forms within stories via unsupervised clustering. First, I investigate the utility of a new set of semantic features compared to standard keyword features combined with statistical features, such as density of part-of-speech (POS) tags and named entities, to develop a story classifier. The proposed semantic features are based on triplets that can be extracted using a shallow parser. Experimental results show that a model of memory-based semantic linguistic features alongside statistical features achieves better accuracy. Next, I further improve the performance of story detection with a novel algorithm which aggregates the triplets producing generalized concepts and relations. A major challenge in automated text analysis is that different words are used for related concepts. Analyzing text at the surface level would treat related concepts (i.e. actors, actions, targets, and victims) as different objects, potentially missing common narrative patterns. The algorithm clusters triplets into generalized concepts by utilizing syntactic criteria based on common contexts and semantic corpus-based statistical criteria based on "contextual synonyms''. Generalized concepts representation of text (1) overcomes surface level differences (which arise when different keywords are used for related concepts) without drift, (2) leads to a higher-level semantic network representation of related stories, and (3) when used as features, they yield a significant (36%) boost in performance for the story detection task. Finally, I implement co-clustering based on generalized concepts/relations to automatically detect story forms. Overlapping generalized concepts and relationships correspond to archetypes/targets and actions that characterize story forms. I perform co-clustering of stories using standard unigrams/bigrams and generalized concepts. I show that the residual error of factorization with concept-based features is significantly lower than the error with standard keyword-based features. I also present qualitative evaluations by a subject matter expert, which suggest that concept-based features yield more coherent, distinctive and interesting story forms compared to those produced by using standard keyword-based features.

ContributorsCeran, Saadet Betul (Author) / Davulcu, Hasan (Thesis advisor) / Corman, Steven R. (Committee member) / Shakarian, Paulo (Committee member) / Ye, Jieping (Committee member) / Arizona State University (Publisher)

Created2016

Katrina Clements, clarinet

ContributorsClements, Katrina (Performer) / ASU Library. Music Library (Publisher)

Created2018-03-15

Health information extraction from social media

Description

Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks such as pharmacovigilance via the use of Natural Language Processing (NLP) techniques. One of the critical steps in information extraction pipelines is Named Entity Recognition…

Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks such as pharmacovigilance via the use of Natural Language Processing (NLP) techniques. One of the critical steps in information extraction pipelines is Named Entity Recognition (NER), where the mentions of entities such as diseases are located in text and their entity type are identified. However, the language in social media is highly informal, and user-expressed health-related concepts are often non-technical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and advanced machine learning-based NLP techniques have been underutilized. This work explores the effectiveness of different machine learning techniques, and particularly deep learning, to address the challenges associated with extraction of health-related concepts from social media. Deep learning has recently attracted a lot of attention in machine learning research and has shown remarkable success in several applications particularly imaging and speech recognition. However, thus far, deep learning techniques are relatively unexplored for biomedical text mining and, in particular, this is the first attempt in applying deep learning for health information extraction from social media.

This work presents ADRMine that uses a Conditional Random Field (CRF) sequence tagger for extraction of complex health-related concepts. It utilizes a large volume of unlabeled user posts for automatic learning of embedding cluster features, a novel application of deep learning in modeling the similarity between the tokens. ADRMine significantly improved the medical NER performance compared to the baseline systems.

This work also presents DeepHealthMiner, a deep learning pipeline for health-related concept extraction. Most of the machine learning methods require sophisticated task-specific manual feature design which is a challenging step in processing the informal and noisy content of social media. DeepHealthMiner automatically learns classification features using neural networks and utilizing a large volume of unlabeled user posts. Using a relatively small labeled training set, DeepHealthMiner could accurately identify most of the concepts, including the consumer expressions that were not observed in the training data or in the standard medical lexicons outperforming the state-of-the-art baseline techniques.

ContributorsNikfarjam, Azadeh (Author) / Gonzalez, Graciela (Thesis advisor) / Greenes, Robert (Committee member) / Scotch, Matthew (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by