Matching Items (8)

No Right Answer: A Feasibility Study of Essay Assessment with LDA

Description

Essay scoring is a difficult and contentious business. The problem is exacerbated when there are no “right” answers for the essay prompts. This research developed a simple toolset for essay

Essay scoring is a difficult and contentious business. The problem is exacerbated when there are no “right” answers for the essay prompts. This research developed a simple toolset for essay analysis by integrating a freely available Latent Dirichlet Allocation (LDA) implementation into a homegrown assessment assistant. The complexity of the essay assessment problem is demonstrated and illustrated with a representative collection of open-ended essays. This research also explores the use of “expert vectors” or “keyword essays” for maximizing the utility of LDA with small corpora. While, by itself, LDA appears insufficient for adequately scoring essays, it is quite capable of classifying responses to open-ended essay prompts and providing insight into the responses. This research also reports some trends that might be useful in scoring essays once more data is available. Some observations are made about these insights and a discussion of the use of LDA in qualitative assessment results in proposals that may assist other researchers in developing more complete essay assessment software.

Contributors

157657-Thumbnail Image.png

Computational interdisciplinarity: a study in the history of science

Description

This dissertation focuses on creating a pluralistic approach to understanding and measuring interdisciplinarity at various scales to further the study of the evolution of knowledge and innovation. Interdisciplinarity is

This dissertation focuses on creating a pluralistic approach to understanding and measuring interdisciplinarity at various scales to further the study of the evolution of knowledge and innovation. Interdisciplinarity is considered an important research component and is closely linked to higher rates of innovation. If the goal is to create more innovative research, we must understand how interdisciplinarity operates.

I begin by examining interdisciplinarity with a small scope, the research university. This study uses metadata to create co-authorship networks and examine how a change in university policies to increase interdisciplinarity can be successful. The New American University Initiative (NAUI) at Arizona State University (ASU) set forth the goal of making ASU a world hub for interdisciplinary research. This kind of interdisciplinarity is produced from a deliberate, engineered, reorganization of the individuals within the university and the knowledge they contain. By using a set of social network analysis measurements, I created an algorithm to measure the changes to the co-authorship networks that resulted from increased university support for interdisciplinary research.

The second case study increases the scope of interdisciplinarity from individual universities to a single scientific discourse, the Anthropocene. The idea of the Anthropocene began as an idea about the need for a new geological epoch and underwent unsupervised interdisciplinary expansion due to climate change integrating itself into the core of the discourse. In contrast to the NAUI which was specifically engineered to increase interdisciplinarity, the I use keyword co-occurrence networks to measure how the Anthropocene discourse increases its interdisciplinarity through unsupervised expansion after climate change becomes a core keyword within the network and behaves as an anchor point for new disciplines to connect and join the discourse.

The scope of interdisciplinarity increases again with the final case study about the field of evolutionary medicine. Evolutionary medicine is a case of engineered interdisciplinary integration between evolutionary biology and medicine. The primary goal of evolutionary medicine is to better understand "why we get sick" through the lens of evolutionary biology. This makes it an excellent candidate to understand large-scale interdisciplinarity. I show through multiple type of networks and metadata analyses that evolutionary medicine successfully integrates the concepts of evolutionary biology into medicine.

By increasing our knowledge of interdisciplinarity at various scales and how it behaves in different initial conditions, we are better able to understand the elusive nature of innovation. Interdisciplinary can mean different things depending on how its defined. I show that a pluralistic approach to defining and measuring interdisciplinarity is not only appropriate but necessary if our goal is to increase interdisciplinarity, the frequency of innovations, and our understanding of the evolution of knowledge.

Contributors

Agent

Created

Date Created
  • 2019

155252-Thumbnail Image.png

Mining signed social networks using unsupervised learning algorithms

Description

Due to vast resources brought by social media services, social data mining has

received increasing attention in recent years. The availability of sheer amounts of

user-generated data presents data scientists both opportunities

Due to vast resources brought by social media services, social data mining has

received increasing attention in recent years. The availability of sheer amounts of

user-generated data presents data scientists both opportunities and challenges. Opportunities are presented with additional data sources. The abundant link information

in social networks could provide another rich source in deriving implicit information

for social data mining. However, the vast majority of existing studies overwhelmingly

focus on positive links between users while negative links are also prevailing in real-

world social networks such as distrust relations in Epinions and foe links in Slashdot.

Though recent studies show that negative links have some added value over positive

links, it is dicult to directly employ them because of its distinct characteristics from

positive interactions. Another challenge is that label information is rather limited

in social media as the labeling process requires human attention and may be very

expensive. Hence, alternative criteria are needed to guide the learning process for

many tasks such as feature selection and sentiment analysis.

To address above-mentioned issues, I study two novel problems for signed social

networks mining, (1) unsupervised feature selection in signed social networks; and

(2) unsupervised sentiment analysis with signed social networks. To tackle the first problem, I propose a novel unsupervised feature selection framework SignedFS. In

particular, I model positive and negative links simultaneously for user preference

learning, and then embed the user preference learning into feature selection. To study the second problem, I incorporate explicit sentiment signals in textual terms and

implicit sentiment signals from signed social networks into a coherent model Signed-

Senti. Empirical experiments on real-world datasets corroborate the effectiveness of

these two frameworks on the tasks of feature selection and sentiment analysis.

Contributors

Agent

Created

Date Created
  • 2017

155728-Thumbnail Image.png

Examining the role of linguistic flexibility in the text production process

Description

A commonly held belief among educators, researchers, and students is that high-quality texts are easier to read than low-quality texts, as they contain more engaging narrative and story-like elements. Interestingly,

A commonly held belief among educators, researchers, and students is that high-quality texts are easier to read than low-quality texts, as they contain more engaging narrative and story-like elements. Interestingly, these assumptions have typically failed to be supported by the writing literature. Research suggests that higher quality writing is typically associated with decreased levels of text narrativity and readability. Although narrative elements may sometimes be associated with high-quality writing, the majority of research suggests that higher quality writing is associated with decreased levels of text narrativity, and measures of readability in general. One potential explanation for this conflicting evidence lies in the situational influence of text elements on writing quality. In other words, it is possible that the frequency of specific linguistic or rhetorical text elements alone is not consistently indicative of essay quality. Rather, these effects may be largely driven by individual differences in students' ability to leverage the benefits of these elements in appropriate contexts. This dissertation presents the hypothesis that writing proficiency is associated with an individual's flexible use of text properties, rather than simply the consistent use of a particular set of properties. Across three experiments, this dissertation relies on a combination of natural language processing and dynamic methodologies to examine the role of linguistic flexibility in the text production process. Overall, the studies included in this dissertation provide important insights into the role of flexibility in writing skill and develop a strong foundation on which to conduct future research and educational interventions.

Contributors

Agent

Created

Date Created
  • 2017

153222-Thumbnail Image.png

Investigations of the role of high-level cognitive skills in the text production process

Description

Writing is an intricate cognitive and social process that involves the production of texts for the purpose of conveying meaning to others. The importance of lower level cognitive skills and

Writing is an intricate cognitive and social process that involves the production of texts for the purpose of conveying meaning to others. The importance of lower level cognitive skills and language knowledge during this text production process has been well documented in the literature. However, the role of higher level skills (e.g., metacognition, strategy use, etc.) has been less strongly emphasized. This thesis proposal examines higher level cognitive skills in the context of persuasive essay writing. Specifically, two published manuscripts are presented, which both examine the role of higher level skills in the context of writing. The first manuscript investigates the role of metacognition in the writing process by examining the accuracy and characteristics of students' self-assessments of their essays. The second manuscript takes an individual differences approach and examines whether the higher level cognitive skills commonly associated with reading comprehension are also related to performance on writing tasks. Taken together, these manuscripts point towards a strong role of higher level skills in the writing process and provide a strong foundation on which to develop future research and educational interventions.

Contributors

Agent

Created

Date Created
  • 2014

Shibboleth: an automated foreign accent identification program

Description

The speech of non-native (L2) speakers of a language contains phonological rules that differentiate them from native speakers. These phonological rules characterize or distinguish accents in an L2. The Shibboleth

The speech of non-native (L2) speakers of a language contains phonological rules that differentiate them from native speakers. These phonological rules characterize or distinguish accents in an L2. The Shibboleth program creates combinatorial rule-sets to describe the phonological pattern of these accents and classifies L2 speakers into their native language. The training and classification is done in Shibboleth by support vector machines using a Gaussian radial basis kernel. In one experiment run using Shibboleth, the program correctly identified the native language (L1) of a speaker of unknown origin 42% of the time when there were six possible L1s in which to classify the speaker. This rate is significantly better than the 17% chance classification rate. Chi-squared test (1, N=24) =10.800, p=.0010 In a second experiment, Shibboleth was not able to determine the native language family of a speaker of unknown origin at a rate better than chance (33-44%) when the L1 was not in the transcripts used for training the language family rule-set. Chi-squared test (1, N=18) =1.000, p=.3173 The 318 participants for both experiments were from the Speech Accent Archive (Weinberger, 2013), and ranged in age from 17 to 80 years old. Forty percent of the speakers were female and 60% were male. The factor that most influenced correct classification was higher age of onset for the L2. A higher number of years spent living in an English-speaking country did not have the expected positive effect on classification.

Contributors

Agent

Created

Date Created
  • 2013

151867-Thumbnail Image.png

Advancing biomedical named entity recognition with multivariate feature selection and semantically motivated features

Description

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located within natural-language text and their semantic type is determined. This step is critical for later tasks in an information extraction pipeline, including normalization and relationship extraction. BANNER is a benchmark biomedical NER system using linear-chain conditional random fields and the rich feature set approach. A case study with BANNER locating genes and proteins in biomedical literature is described. The first corpus for disease NER adequate for use as training data is introduced, and employed in a case study of disease NER. The first corpus locating adverse drug reactions (ADRs) in user posts to a health-related social website is also described, and a system to locate and identify ADRs in social media text is created and evaluated. The rich feature set approach to creating NER feature sets is argued to be subject to diminishing returns, implying that additional improvements may require more sophisticated methods for creating the feature set. This motivates the first application of multivariate feature selection with filters and false discovery rate analysis to biomedical NER, resulting in a feature set at least 3 orders of magnitude smaller than the set created by the rich feature set approach. Finally, two novel approaches to NER by modeling the semantics of token sequences are introduced. The first method focuses on the sequence content by using language models to determine whether a sequence resembles entries in a lexicon of entity names or text from an unlabeled corpus more closely. The second method models the distributional semantics of token sequences, determining the similarity between a potential mention and the token sequences from the training data by analyzing the contexts where each sequence appears in a large unlabeled corpus. The second method is shown to improve the performance of BANNER on multiple data sets.

Contributors

Agent

Created

Date Created
  • 2013

149607-Thumbnail Image.png

An effective approach to biomedical information extraction with limited training data

Description

In the current millennium, extensive use of computers and the internet caused an exponential increase in information. Few research areas are as important as information extraction, which primarily involves extracting

In the current millennium, extensive use of computers and the internet caused an exponential increase in information. Few research areas are as important as information extraction, which primarily involves extracting concepts and the relations between them from free text. Limitations in the size of training data, lack of lexicons and lack of relationship patterns are major factors for poor performance in information extraction. This is because the training data cannot possibly contain all concepts and their synonyms; and it contains only limited examples of relationship patterns between concepts. Creating training data, lexicons and relationship patterns is expensive, especially in the biomedical domain (including clinical notes) because of the depth of domain knowledge required of the curators. Dictionary-based approaches for concept extraction in this domain are not sufficient to effectively overcome the complexities that arise because of the descriptive nature of human languages. For example, there is a relatively higher amount of abbreviations (not all of them present in lexicons) compared to everyday English text. Sometimes abbreviations are modifiers of an adjective (e.g. CD4-negative) rather than nouns (and hence, not usually considered named entities). There are many chemical names with numbers, commas, hyphens and parentheses (e.g. t(3;3)(q21;q26)), which will be separated by most tokenizers. In addition, partial words are used in place of full words (e.g. up- and downregulate); and some of the words used are highly specialized for the domain. Clinical notes contain peculiar drug names, anatomical nomenclature, other specialized names and phrases that are not standard in everyday English or in published articles (e.g. "l shoulder inj"). State of the art concept extraction systems use machine learning algorithms to overcome some of these challenges. However, they need a large annotated corpus for every concept class that needs to be extracted. A novel natural language processing approach to minimize this limitation in concept extraction is proposed here using distributional semantics. Distributional semantics is an emerging field arising from the notion that the meaning or semantics of a piece of text (discourse) depends on the distribution of the elements of that discourse in relation to its surroundings. Distributional information from large unlabeled data is used to automatically create lexicons for the concepts to be tagged, clusters of contextually similar words, and thesauri of distributionally similar words. These automatically generated lexical resources are shown here to be more useful than manually created lexicons for extracting concepts from both literature and narratives. Further, machine learning features based on distributional semantics are shown to improve the accuracy of BANNER, and could be used in other machine learning systems such as cTakes to improve their performance. In addition, in order to simplify the sentence patterns and facilitate association extraction, a new algorithm using a "shotgun" approach is proposed. The goal of sentence simplification has traditionally been to reduce the grammatical complexity of sentences while retaining the relevant information content and meaning to enable better readability for humans and enhanced processing by parsers. Sentence simplification is shown here to improve the performance of association extraction systems for both biomedical literature and clinical notes. It helps improve the accuracy of protein-protein interaction extraction from the literature and also improves relationship extraction from clinical notes (such as between medical problems, tests and treatments). Overall, the two main contributions of this work include the application of sentence simplification to association extraction as described above, and the use of distributional semantics for concept extraction. The proposed work on concept extraction amalgamates for the first time two diverse research areas -distributional semantics and information extraction. This approach renders all the advantages offered in other semi-supervised machine learning systems, and, unlike other proposed semi-supervised approaches, it can be used on top of different basic frameworks and algorithms.

Contributors

Agent

Created

Date Created
  • 2011