Matching Items (7)

132648-Thumbnail Image.png

Image Reference Extraction: Linking References to Radiology Images with Clinically Significant Findings

Description

Background: Pulmonary embolism is a deadly condition that is often diagnosed using a technique known as computed tomography pulmonary angiography (CTPA). CTPA reports are free-text, narrative-style forms of documentation conferring

Background: Pulmonary embolism is a deadly condition that is often diagnosed using a technique known as computed tomography pulmonary angiography (CTPA). CTPA reports are free-text, narrative-style forms of documentation conferring radiologist findings—both primary (regarding pulmonary embolism) and incidental. This project seeks to combine simple natural language processing (NLP) techniques, such as regular expressions and rules, to build upon and
further process output from a machine learning based named entity recognition (NER) tool for the purposes of (1) linking references to radiological images with the corresponding clinical findings and (2) extracting primary and incidental findings.

Methods: The project’s system utilized a regular expression to extract image references. All CTPA reports were first processed with NER software to obtain the text and spans of clinical findings. A heuristic was used to determine the appropriate clinical finding that should be linked with a particular image reference. Another regular expression was used to extract primary findings from NER output; the remaining findings were considered incidental. Performance was
assessed against a gold standard, which was based upon a manually annotated version of the CTPA reports used in this project.

Results: Extraction of image references achieved a 100% accuracy. Linkages between these references and exact gold standard spans of the clinical findings achieved a precision of 0.24, a recall of 0.22, and an F1 score of 0.23. Linkages with partial spans of clinical findings as determined by the gold standard achieved a precision of 0.71, a recall of 0.67, and an F1 score of 0.69. Primary and incidental finding extraction achieved a precision of 0.67, a recall of 0.80, and
an F1 score of 0.73.

Discussion: Various elements reduced system performance such as the difficulty of exactly matching the spans of clinical findings from NER output with those found in the gold standard. The heuristic linking clinical findings and image references was especially sensitive to NER false positives and false negatives due to its assumption that the appropriate clinical finding was that which was immediately prior to the image reference. Although the system did not perform as well as hoped, lessons were learned such as the need for clear research methodology and proper gold standard creation; without a proper gold standard, problem scope and system performance cannot be properly assessed. Improvements to the system include creating a more robust heuristic, sifting NER false positives, and training the NER tool used on a dataset of CTPA reports.

Contributors

Agent

Created

Date Created
  • 2019-05

134821-Thumbnail Image.png

MosquitoDB

Description

Mosquito population data is a valuable resource for researchers and public health officials working to limit the spread of deadly zoonotic viruses such as Zika Virus and West Nile Virus.

Mosquito population data is a valuable resource for researchers and public health officials working to limit the spread of deadly zoonotic viruses such as Zika Virus and West Nile Virus. Unfortunately, this data is currently difficult to obtain and aggregate across the United States. Obtaining historical data often requires filing requests to individual States or Counties and hoping for a response. Current online systems available for accessing aggregated data are lacking essential features, or limited in scope. In order to make mosquito population data more accessible for United States researchers, epidemiologists, and public health officials, the MosquitoDB system has been developed. MosquitoDB consists of a JavaScript Web Application, connected to a SQL database, that makes submitting and retrieving United States mosquito population data much simpler and straight forward than alternative systems. The MosquitoDB software project is open source and publically available on GitHub, allowing community scrutiny and contributions to add or improve necessary features. For this Creative Project, the core MosquitoDB system was designed and developed with 3 main features: 1) Web Interface for querying mosquito data. 2) Web Interface for submitting mosquito data. 3) Web Services for querying/retrieving and submitting mosquito data. The Web Interface is essential for common end users, such as researchers and public health officials, to access historical data or submit new data. The Web Services provide building blocks for Web Applications that other developers can use to incorporate data into new applications. The current MosquitoDB system is live at https://zodo.asu.edu/mosquito and the public code repository is available at https://github.com/developerDemetri/mosquitodb.

Contributors

Created

Date Created
  • 2016-12

148351-Thumbnail Image.png

My Contraceptive Choice: A Decision Support Tool for College Women

Description

Contraceptive methods are vital in maintaining women’s health and preventing unintended pregnancy. When a woman uses a method that reflects her personal preferences and lifestyle, the chances of low adoption

Contraceptive methods are vital in maintaining women’s health and preventing unintended pregnancy. When a woman uses a method that reflects her personal preferences and lifestyle, the chances of low adoption and misuse decreases. The research aim of this project is to develop a web-based decision aid tailored to college women that assists in the selection of contraceptive methods. For this reason, My Contraceptive Choice (MCC) is built using the gaps identified in existing resources provided by Planned Parenthood and Bedsider, along with feedback from a university student focus group. The tool is a short quiz that is followed by two pages of information and resources for a variety of different contraceptive methods commonly used by college women. The evaluation phase of this project includes simulated test cases, a Google Forms survey, and a second focus group to assess the tool for accuracy and usability. From the survey, 130 of the 150 (80.7%) responses believe that the recommendations provided can help them select a birth control method. Furthermore, 136 of the 150 (90.0%) responses believe that the layout of the tool made it easy to navigate. The second focus group feedback suggests that the MCC tool is perceived to be accurate, usable, and useful to the college population. Participants believe that the MCC tool performs better overall compared to the Planned Parenthood quiz in creating a customized recommendation and Bedsider in overall usability. The test cases reveal that there are further improvements that could be made to create a more accurate recommendation to the user. In conclusion, the new MCC tool accomplishes the aim of creating a beneficial resource to college women in assisting with the birth control selection process.

Contributors

Created

Date Created
  • 2021-05

154663-Thumbnail Image.png

Context-aware adaptive hybrid semantic relatedness in biomedical science

Description

Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language

Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1–6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources.

Contributors

Agent

Created

Date Created
  • 2016

157992-Thumbnail Image.png

Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning

Description

Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of

Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.

Contributors

Agent

Created

Date Created
  • 2019

154641-Thumbnail Image.png

A timeline extraction approach to derive drug usage patterns in pregnant women using social media

Description

Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on

Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on social media and pharmacovigilance have concentrated on

Adverse Drug Reaction (ADR) identification. Such methods employ a step of drug search followed by classification of the associated text as consisting an ADR or not. Although this method works efficiently for ADR classifications, if ADR evidence is present in users posts over time, drug mentions fail to capture such ADRs. It also fails to record additional user information which may provide an opportunity to perform an in-depth analysis for lifestyle habits and possible reasons for any medical problems.

Pre-market clinical trials for drugs generally do not include pregnant women, and so their effects on pregnancy outcomes are not discovered early. This thesis presents a thorough, alternative strategy for assessing the safety profiles of drugs during pregnancy by utilizing user timelines from social media. I explore the use of a variety of state-of-the-art social media mining techniques, including rule-based and machine learning techniques, to identify pregnant women, monitor their drug usage patterns, categorize their birth outcomes, and attempt to discover associations between drugs and bad birth outcomes.

The technique used models user timelines as longitudinal patient networks, which provide us with a variety of key information about pregnancy, drug usage, and post-

birth reactions. I evaluate the distinct parts of the pipeline separately, validating the usefulness of each step. The approach to use user timelines in this fashion has produced very encouraging results, and can be employed for a range of other important tasks where users/patients are required to be followed over time to derive population-based measures.

Contributors

Agent

Created

Date Created
  • 2016

149607-Thumbnail Image.png

An effective approach to biomedical information extraction with limited training data

Description

In the current millennium, extensive use of computers and the internet caused an exponential increase in information. Few research areas are as important as information extraction, which primarily involves extracting

In the current millennium, extensive use of computers and the internet caused an exponential increase in information. Few research areas are as important as information extraction, which primarily involves extracting concepts and the relations between them from free text. Limitations in the size of training data, lack of lexicons and lack of relationship patterns are major factors for poor performance in information extraction. This is because the training data cannot possibly contain all concepts and their synonyms; and it contains only limited examples of relationship patterns between concepts. Creating training data, lexicons and relationship patterns is expensive, especially in the biomedical domain (including clinical notes) because of the depth of domain knowledge required of the curators. Dictionary-based approaches for concept extraction in this domain are not sufficient to effectively overcome the complexities that arise because of the descriptive nature of human languages. For example, there is a relatively higher amount of abbreviations (not all of them present in lexicons) compared to everyday English text. Sometimes abbreviations are modifiers of an adjective (e.g. CD4-negative) rather than nouns (and hence, not usually considered named entities). There are many chemical names with numbers, commas, hyphens and parentheses (e.g. t(3;3)(q21;q26)), which will be separated by most tokenizers. In addition, partial words are used in place of full words (e.g. up- and downregulate); and some of the words used are highly specialized for the domain. Clinical notes contain peculiar drug names, anatomical nomenclature, other specialized names and phrases that are not standard in everyday English or in published articles (e.g. "l shoulder inj"). State of the art concept extraction systems use machine learning algorithms to overcome some of these challenges. However, they need a large annotated corpus for every concept class that needs to be extracted. A novel natural language processing approach to minimize this limitation in concept extraction is proposed here using distributional semantics. Distributional semantics is an emerging field arising from the notion that the meaning or semantics of a piece of text (discourse) depends on the distribution of the elements of that discourse in relation to its surroundings. Distributional information from large unlabeled data is used to automatically create lexicons for the concepts to be tagged, clusters of contextually similar words, and thesauri of distributionally similar words. These automatically generated lexical resources are shown here to be more useful than manually created lexicons for extracting concepts from both literature and narratives. Further, machine learning features based on distributional semantics are shown to improve the accuracy of BANNER, and could be used in other machine learning systems such as cTakes to improve their performance. In addition, in order to simplify the sentence patterns and facilitate association extraction, a new algorithm using a "shotgun" approach is proposed. The goal of sentence simplification has traditionally been to reduce the grammatical complexity of sentences while retaining the relevant information content and meaning to enable better readability for humans and enhanced processing by parsers. Sentence simplification is shown here to improve the performance of association extraction systems for both biomedical literature and clinical notes. It helps improve the accuracy of protein-protein interaction extraction from the literature and also improves relationship extraction from clinical notes (such as between medical problems, tests and treatments). Overall, the two main contributions of this work include the application of sentence simplification to association extraction as described above, and the use of distributional semantics for concept extraction. The proposed work on concept extraction amalgamates for the first time two diverse research areas -distributional semantics and information extraction. This approach renders all the advantages offered in other semi-supervised machine learning systems, and, unlike other proposed semi-supervised approaches, it can be used on top of different basic frameworks and algorithms.

Contributors

Agent

Created

Date Created
  • 2011