Search Content

Association based prioritization of genes

Description

Genes have widely different pertinences to the etiology and pathology of diseases. Thus, they can be ranked according to their disease-significance on a genomic scale, which is the subject of gene prioritization. Given a set of genes known to be related to a disease, it is reasonable to use them…

Genes have widely different pertinences to the etiology and pathology of diseases. Thus, they can be ranked according to their disease-significance on a genomic scale, which is the subject of gene prioritization. Given a set of genes known to be related to a disease, it is reasonable to use them as a basis to determine the significance of other candidate genes, which will then be ranked based on the association they exhibit with respect to the given set of known genes. Experimental and computational data of various kinds have different reliability and relevance to a disease under study. This work presents a gene prioritization method based on integrated biological networks that incorporates and models the various levels of relevance and reliability of diverse sources. The method is shown to achieve significantly higher performance as compared to two well-known gene prioritization algorithms. Essentially, no bias in the performance was seen as it was applied to diseases of diverse ethnology, e.g., monogenic, polygenic and cancer. The method was highly stable and robust against significant levels of noise in the data. Biological networks are often sparse, which can impede the operation of associationbased gene prioritization algorithms such as the one presented here from a computational perspective. As a potential approach to overcome this limitation, we explore the value that transcription factor binding sites can have in elucidating suitable targets. Transcription factors are needed for the expression of most genes, especially in higher organisms and hence genes can be associated via their genetic regulatory properties. While each transcription factor recognizes specific DNA sequence patterns, such patterns are mostly unknown for many transcription factors. Even those that are known are inconsistently reported in the literature, implying a potentially high level of inaccuracy. We developed computational methods for prediction and improvement of transcription factor binding patterns. Tests performed on the improvement method by employing synthetic patterns under various conditions showed that the method is very robust and the patterns produced invariably converge to nearly identical series of patterns. Preliminary tests were conducted to incorporate knowledge from transcription factor binding sites into our networkbased model for prioritization, with encouraging results. Genes have widely different pertinences to the etiology and pathology of diseases. Thus, they can be ranked according to their disease-significance on a genomic scale, which is the subject of gene prioritization. Given a set of genes known to be related to a disease, it is reasonable to use them as a basis to determine the significance of other candidate genes, which will then be ranked based on the association they exhibit with respect to the given set of known genes. Experimental and computational data of various kinds have different reliability and relevance to a disease under study. This work presents a gene prioritization method based on integrated biological networks that incorporates and models the various levels of relevance and reliability of diverse sources. The method is shown to achieve significantly higher performance as compared to two well-known gene prioritization algorithms. Essentially, no bias in the performance was seen as it was applied to diseases of diverse ethnology, e.g., monogenic, polygenic and cancer. The method was highly stable and robust against significant levels of noise in the data. Biological networks are often sparse, which can impede the operation of associationbased gene prioritization algorithms such as the one presented here from a computational perspective. As a potential approach to overcome this limitation, we explore the value that transcription factor binding sites can have in elucidating suitable targets. Transcription factors are needed for the expression of most genes, especially in higher organisms and hence genes can be associated via their genetic regulatory properties. While each transcription factor recognizes specific DNA sequence patterns, such patterns are mostly unknown for many transcription factors. Even those that are known are inconsistently reported in the literature, implying a potentially high level of inaccuracy. We developed computational methods for prediction and improvement of transcription factor binding patterns. Tests performed on the improvement method by employing synthetic patterns under various conditions showed that the method is very robust and the patterns produced invariably converge to nearly identical series of patterns. Preliminary tests were conducted to incorporate knowledge from transcription factor binding sites into our networkbased model for prioritization, with encouraging results. To validate these approaches in a disease-specific context, we built a schizophreniaspecific network based on the inferred associations and performed a comprehensive prioritization of human genes with respect to the disease. These results are expected to be validated empirically, but computational validation using known targets are very positive.

ContributorsLee, Jang (Author) / Gonzalez, Graciela (Thesis advisor) / Ye, Jieping (Committee member) / Davulcu, Hasan (Committee member) / Gallitano-Mendel, Amelia (Committee member) / Arizona State University (Publisher)

Created2011

Advancing biomedical named entity recognition with multivariate feature selection and semantically motivated features

Description

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located…

Automating aspects of biocuration through biomedical information extraction could significantly impact biomedical research by enabling greater biocuration throughput and improving the feasibility of a wider scope. An important step in biomedical information extraction systems is named entity recognition (NER), where mentions of entities such as proteins and diseases are located within natural-language text and their semantic type is determined. This step is critical for later tasks in an information extraction pipeline, including normalization and relationship extraction. BANNER is a benchmark biomedical NER system using linear-chain conditional random fields and the rich feature set approach. A case study with BANNER locating genes and proteins in biomedical literature is described. The first corpus for disease NER adequate for use as training data is introduced, and employed in a case study of disease NER. The first corpus locating adverse drug reactions (ADRs) in user posts to a health-related social website is also described, and a system to locate and identify ADRs in social media text is created and evaluated. The rich feature set approach to creating NER feature sets is argued to be subject to diminishing returns, implying that additional improvements may require more sophisticated methods for creating the feature set. This motivates the first application of multivariate feature selection with filters and false discovery rate analysis to biomedical NER, resulting in a feature set at least 3 orders of magnitude smaller than the set created by the rich feature set approach. Finally, two novel approaches to NER by modeling the semantics of token sequences are introduced. The first method focuses on the sequence content by using language models to determine whether a sequence resembles entries in a lexicon of entity names or text from an unlabeled corpus more closely. The second method models the distributional semantics of token sequences, determining the similarity between a potential mention and the token sequences from the training data by analyzing the contexts where each sequence appears in a large unlabeled corpus. The second method is shown to improve the performance of BANNER on multiple data sets.

ContributorsLeaman, James Robert (Author) / Gonzalez, Graciela (Thesis advisor) / Baral, Chitta (Thesis advisor) / Cohen, Kevin B (Committee member) / Liu, Huan (Committee member) / Ye, Jieping (Committee member) / Arizona State University (Publisher)

Created2013

Empowering Women in Zambia through Computational Thinking Curriculum

Description

The nonprofit organization, I Am Zambia, works to give supplemental education to young women in Lusaka. I Am Zambia is creating sustainable change by educating these females, who can then lift their families and communities out of poverty. The ultimate goal of this thesis was to explore and implement high…

The nonprofit organization, I Am Zambia, works to give supplemental education to young women in Lusaka. I Am Zambia is creating sustainable change by educating these females, who can then lift their families and communities out of poverty. The ultimate goal of this thesis was to explore and implement high level systematic problem solving through basic and specialized computational thinking curriculum at I Am Zambia in order to give these women an even larger stepping stool into a successful future.

To do this, a 4-week long pilot curriculum was created, implemented, and tested through an optional class at I Am Zambia, available to women who had already graduated from the year-long I Am Zambia Academy program. A total of 18 women ages 18-24 chose to enroll in the course. There were a total of 10 lessons, taught over 20 class period. These lessons covered four main computational thinking frameworks: introduction to computational thinking, algorithmic thinking, pseudocode, and debugging. Knowledge retention was tested through the use of a CS educational tool, QuizIt, created by the CSI Lab of School of Computing, Informatics and Decision Systems Engineering at Arizona State University. Furthermore, pre and post tests were given to assess the successfulness of the curriculum in teaching students the aforementioned concepts. 14 of the 18 students successfully completed the pre and post test.

Limitations of this study and suggestions for how to improve this curriculum in order to extend it into a year long course are also presented at the conclusion of this paper.

ContributorsGriffin, Hadley Meryl (Author) / Hsiao, Sharon (Thesis director) / Mutsumi, Nakamura (Committee member) / Arts, Media and Engineering Sch T (Contributor) / Computer Science and Engineering Program (Contributor) / Dean, W.P. Carey School of Business (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

STUDY GENIE: An Analysis of a Web-based Note-Sharing and Cheat Sheet Tool

Description

Research has shown that the cheat sheet preparation process helps students with performance in exams. However, results have been inconclusive in determining the most effective guiding principles in creating and using cheat sheets. The traditional method of collecting and annotating cheat sheets is time consuming and exhaustive, and fails to…

Research has shown that the cheat sheet preparation process helps students with performance in exams. However, results have been inconclusive in determining the most effective guiding principles in creating and using cheat sheets. The traditional method of collecting and annotating cheat sheets is time consuming and exhaustive, and fails to capture students' preparation process. This thesis examines the development and usage of a new web-based cheat sheet creation tool, Study Genie, and its effects on student performance in an introductory computer science and programming course. Results suggest that actions associated with editing and organizing cheat sheets are positively correlated with exam performance, and that there is a significant difference between the activity of high-performing and low-performing students. Through these results, Study Genie presents itself as an opportunity for mass data collection and to provide insight into the assembly process rather than just the finished product in cheat sheet creation.

ContributorsWu, Jiaqi (Co-author) / Wen, Terry (Co-author) / Hsiao, Sharon (Thesis director) / Walker, Erin (Committee member) / Computer Science and Engineering Program (Contributor) / School of Life Sciences (Contributor) / Department of Finance (Contributor) / Barrett, The Honors College (Contributor)

Created2016-12

Personalized Learning in a Virtual Hands-on Lab Platform for Computer Science Education

Description

Personalized learning is gaining popularity in online computer science education due to its characteristics of pacing the learning progress and adapting the instructional approach to each individual learner from a diverse background. Among various instructional methods in computer science education, hands-on labs have unique requirements of understanding learners' behavior and…

Personalized learning is gaining popularity in online computer science education due to its characteristics of pacing the learning progress and adapting the instructional approach to each individual learner from a diverse background. Among various instructional methods in computer science education, hands-on labs have unique requirements of understanding learners' behavior and assessing learners' performance for personalization. Hands-on labs are a critical learning approach for cybersecurity education. It provides real-world complex problem scenarios and helps learners develop a deeper understanding of knowledge and concepts while solving real-world problems. But there are unique challenges when using hands-on labs for cybersecurity education. Existing hands-on lab exercises materials are usually managed in a problem-centric fashion, while it lacks a coherent way to manage existing labs and provide productive lab exercising plans for cybersecurity learners. To solve these challenges, a personalized learning platform called ThoTh Lab specifically designed for computer science hands-on labs in a cloud environment is established. ThoTh Lab can identify the learning style from student activities and adapt learning material accordingly. With the awareness of student learning styles, instructors are able to use techniques more suitable for the specific student, and hence, improve the speed and quality of the learning process. ThoTh Lab also provides student performance prediction, which allows the instructors to change the learning progress and take other measurements to help the students timely. A knowledge graph in the cybersecurity domain is also constructed using Natural language processing (NLP) technologies including word embedding and hyperlink-based concept mining. This knowledge graph is then utilized during the regular learning process to build a personalized lab recommendation system by suggesting relevant labs based on students' past learning history to maximize their learning outcomes. To evaluate ThoTh Lab, several in-class experiments were carried out in cybersecurity classes for both graduate and undergraduate students at Arizona State University and data was collected over several semesters. The case studies show that, by leveraging the personalized lab platform, students tend to be more absorbed in a lab project, show more interest in the cybersecurity area, spend more effort on the project and gain enhanced learning outcomes.

ContributorsDeng, Yuli (Author) / Huang, Dijiang (Thesis advisor) / Li, Baoxin (Committee member) / Zhao, Ming (Committee member) / Hsiao, Sharon (Committee member) / Arizona State University (Publisher)

Created2021

Towards Addressing Key Visual Processing Challenges in Social Media Computing

Description

Visual processing in social media platforms is a key step in gathering and understanding information in the era of Internet and big data. Online data is rich in content, but its processing faces many challenges including: varying scales for objects of interest, unreliable and/or missing labels, the inadequacy of single…

Visual processing in social media platforms is a key step in gathering and understanding information in the era of Internet and big data. Online data is rich in content, but its processing faces many challenges including: varying scales for objects of interest, unreliable and/or missing labels, the inadequacy of single modal data and difficulty in analyzing high dimensional data. Towards facilitating the processing and understanding of online data, this dissertation primarily focuses on three challenges that I feel are of great practical importance: handling scale differences in computer vision tasks, such as facial component detection and face retrieval, developing efficient classifiers using partially labeled data and noisy data, and employing multi-modal models and feature selection to improve multi-view data analysis. For the first challenge, I propose a scale-insensitive algorithm to expedite and accurately detect facial landmarks. For the second challenge, I propose two algorithms that can be used to learn from partially labeled data and noisy data respectively. For the third challenge, I propose a new framework that incorporates feature selection modules into LDA models.

ContributorsZhou, Xu (Author) / Li, Baoxin (Thesis advisor) / Hsiao, Sharon (Committee member) / Davulcu, Hasan (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2018

Advancing Large-Scale Creativity through Adaptive Inspirations and Research in Context

Description

An old proverb claims that “two heads are better than one”. Crowdsourcing research and practice have taken this to heart, attempting to show that thousands of heads can be even better. This is not limited to leveraging a crowd’s knowledge, but also their creativity—the ability to generate something not only…

An old proverb claims that “two heads are better than one”. Crowdsourcing research and practice have taken this to heart, attempting to show that thousands of heads can be even better. This is not limited to leveraging a crowd’s knowledge, but also their creativity—the ability to generate something not only useful, but also novel. In practice, there are initiatives such as Free and Open Source Software communities developing innovative software. In research, the field of crowdsourced creativity, which attempts to design scalable support mechanisms, is blooming. However, both contexts still present many opportunities for advancement.

In this dissertation, I seek to advance both the knowledge of limitations in current technologies used in practice as well as the mechanisms that can be used for large-scale support. The overall research question I explore is: “How can we support large-scale creative collaboration in distributed online communities?” I first advance existing support techniques by evaluating the impact of active support in brainstorming performance. Furthermore, I leverage existing theoretical models of individual idea generation as well as recommender system techniques to design CrowdMuse, a novel adaptive large-scale idea generation system. CrowdMuse models users in order to adapt itself to each individual. I evaluate the system’s efficacy through two large-scale studies. I also advance knowledge of current large-scale practices by examining common communication channels under the lens of Creativity Support Tools, yielding a list of creativity bottlenecks brought about by the affordances of these channels. Finally, I connect both ends of this dissertation by deploying CrowdMuse in an Open Source online community for two weeks. I evaluate their usage of the system as well as its perceived benefits and issues compared to traditional communication tools.

This dissertation makes the following contributions to the field of large-scale creativity: 1) the design and evaluation of a first-of-its-kind adaptive brainstorming system; 2) the evaluation of the effects of active inspirations compared to simple idea exposure; 3) the development and application of a set of creativity support design heuristics to uncover creativity bottlenecks; and 4) an exploration of large-scale brainstorming systems’ usefulness to online communities.

Contributorsda Silva Girotto, Victor Augusto (Author) / Walker, Erin A (Thesis advisor) / Burleson, Winslow (Thesis advisor) / Maciejewski, Ross (Committee member) / Hsiao, Sharon (Committee member) / Bigham, Jeffrey (Committee member) / Arizona State University (Publisher)

Created2019

Context-aware adaptive hybrid semantic relatedness in biomedical science

Description

Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems…

Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1–6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources.

ContributorsEmadzadeh, Ehsan (Author) / Gonzalez, Graciela (Thesis advisor) / Greenes, Robert (Committee member) / Scotch, Matthew (Committee member) / Arizona State University (Publisher)

Created2016

A timeline extraction approach to derive drug usage patterns in pregnant women using social media

Description

Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on social media and pharmacovigilance have concentrated on

Adverse Drug Reaction (ADR) identification. Such methods employ a step of drug search followed…

Proliferation of social media websites and discussion forums in the last decade has resulted in social media mining emerging as an effective mechanism to extract consumer patterns. Most research on social media and pharmacovigilance have concentrated on

Adverse Drug Reaction (ADR) identification. Such methods employ a step of drug search followed by classification of the associated text as consisting an ADR or not. Although this method works efficiently for ADR classifications, if ADR evidence is present in users posts over time, drug mentions fail to capture such ADRs. It also fails to record additional user information which may provide an opportunity to perform an in-depth analysis for lifestyle habits and possible reasons for any medical problems.

Pre-market clinical trials for drugs generally do not include pregnant women, and so their effects on pregnancy outcomes are not discovered early. This thesis presents a thorough, alternative strategy for assessing the safety profiles of drugs during pregnancy by utilizing user timelines from social media. I explore the use of a variety of state-of-the-art social media mining techniques, including rule-based and machine learning techniques, to identify pregnant women, monitor their drug usage patterns, categorize their birth outcomes, and attempt to discover associations between drugs and bad birth outcomes.

The technique used models user timelines as longitudinal patient networks, which provide us with a variety of key information about pregnancy, drug usage, and post-

birth reactions. I evaluate the distinct parts of the pipeline separately, validating the usefulness of each step. The approach to use user timelines in this fashion has produced very encouraging results, and can be employed for a range of other important tasks where users/patients are required to be followed over time to derive population-based measures.

ContributorsChandrashekar, Pramod Bharadwaj (Author) / Davulcu, Hasan (Thesis advisor) / Gonzalez, Graciela (Thesis advisor) / Hsiao, Sharon (Committee member) / Arizona State University (Publisher)

Created2016

Detecting Frames and Causal Relationships in Climate Change Related Text Databases Based on Semantic Features

Description

The subliminal impact of framing of social, political and environmental issues such as climate change has been studied for decades in political science and communications research. Media framing offers an “interpretative package" for average citizens on how to make sense of climate change and its consequences to their livelihoods, how…

The subliminal impact of framing of social, political and environmental issues such as climate change has been studied for decades in political science and communications research. Media framing offers an “interpretative package" for average citizens on how to make sense of climate change and its consequences to their livelihoods, how to deal with its negative impacts, and which mitigation or adaptation policies to support. A line of related work has used bag of words and word-level features to detect frames automatically in text. Such works face limitations since standard keyword based features may not generalize well to accommodate surface variations in text when different keywords are used for similar concepts.

This thesis develops a unique type of textual features that generalize triplets extracted from text, by clustering them into high-level concepts. These concepts are utilized as features to detect frames in text. Compared to uni-gram and bi-gram based models, classification and clustering using generalized concepts yield better discriminating features and a higher classification accuracy with a 12% boost (i.e. from 74% to 83% F-measure) and 0.91 clustering purity for Frame/Non-Frame detection.

The automatic discovery of complex causal chains among interlinked events and their participating actors has not yet been thoroughly studied. Previous studies related to extracting causal relationships from text were based on laborious and incomplete hand-developed lists of explicit causal verbs, such as “causes" and “results in." Such approaches result in limited recall because standard causal verbs may not generalize well to accommodate surface variations in texts when different keywords and phrases are used to express similar causal effects. Therefore, I present a system that utilizes generalized concepts to extract causal relationships. The proposed algorithms overcome surface variations in written expressions of causal relationships and discover the domino effects between climate events and human security. This semi-supervised approach alleviates the need for labor intensive keyword list development and annotated datasets. Experimental evaluations by domain experts achieve an average precision of 82%. Qualitative assessments of causal chains show that results are consistent with the 2014 IPCC report illuminating causal mechanisms underlying the linkages between climatic stresses and social instability.

ContributorsAlashri, Saud (Author) / Davulcu, Hasan (Thesis advisor) / Desouza, Kevin C. (Committee member) / Maciejewski, Ross (Committee member) / Hsiao, Sharon (Committee member) / Arizona State University (Publisher)

Created2018

Filtering by