Advancing biomedical named entity recognition with multivariate feature selection and semantically motivated features

Leaman, James Robert

Description

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where…

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located within natural-language text and their semantic type is determined. This step is critical for later tasks in an information extraction pipeline, including normalization and relationship extraction. BANNER is a benchmark biomedical NER system using linear-chain conditional random fields and the rich feature set approach. A case study with BANNER locating genes and proteins in biomedical literature is described. The first corpus for disease NER adequate for use as training data is introduced, and employed in a case study of disease NER. The first corpus locating adverse drug reactions (ADRs) in user posts to a health-related social website is also described, and a system to locate and identify ADRs in social media text is created and evaluated. The rich feature set approach to creating NER feature sets is argued to be subject to diminishing returns, implying that additional improvements may require more sophisticated methods for creating the feature set. This motivates the first application of multivariate feature selection with filters and false discovery rate analysis to biomedical NER, resulting in a feature set at least 3 orders of magnitude smaller than the set created by the rich feature set approach. Finally, two novel approaches to NER by modeling the semantics of token sequences are introduced. The first method focuses on the sequence content by using language models to determine whether a sequence resembles entries in a lexicon of entity names or text from an unlabeled corpus more closely. The second method models the distributional semantics of token sequences, determining the similarity between a potential mention and the token sequences from the training data by analyzing the contexts where each sequence appears in a large unlabeled corpus. The second method is shown to improve the performance of BANNER on multiple data sets.

Details

Contributors

Leaman, James Robert (Author)
Gonzalez, Graciela (Thesis advisor)
Baral, Chitta (Thesis advisor)
Cohen, Kevin B (Committee member)
Liu, Huan (Committee member)
Ye, Jieping (Committee member)
Arizona State University (Publisher)

Date Created

2013

Topical Subject

Resource Type

Text

Language

eng

Note

thesis

Partial requirement for: Ph.D., Arizona State University, 2013
bibliography

Includes bibliographical references (p. 96-107)
Field of study: Computer science

Citation and reuse

Statement of Responsibility

by James Robert Leaman

Additional Information

Extent

xii, 107 p. : ill. (some col.)

Genre

Advancing biomedical named entity recognition with multivariate feature selection and semantically motivated features

Downloads

Details

Machine-readable links