Matching Items (34)

155690-Thumbnail Image.png

CodeReco - A Semantic Java Method Recommender

Description

The increasing volume and complexity of software systems and the growing demand of programming skills calls for efficient information retrieval techniques from source code documents. Programming related information seeking is

The increasing volume and complexity of software systems and the growing demand of programming skills calls for efficient information retrieval techniques from source code documents. Programming related information seeking is often challenging for users facing constraints in knowledge and experience. Source code documents contain multi-faceted semi-structured text, having different levels of semantic information like syntax, blueprints, interfaces, flow graphs, dependencies and design patterns. Matching user queries optimally across these levels is a major challenge for information retrieval systems. Code recommendations can help information seeking and retrieval by pro-actively sampling similar examples based on the users context. These recommendations can be beneficial in improving learning via examples or improving code quality by sampling best practices or alternative implementations.

In this thesis, an attempt is made to help programming related information seeking processes via pro-active code recommendations, and information retrieval processes by extracting structural-semantic information from source code. I present CodeReco, a system that recommends semantically similar Java method samples. Conventional code recommendations found in integrated development environments are primarily driven by syntactical compliance and auto-completion, whereas CodeReco is driven by similarities in use of language and structure-semantics. Methods are transformed to a vector space model and a novel metric of similarity is designed. Features in this vector space are categorized as belonging to types signature, structure, concept and language for user personalization.

Offline tests show that CodeReco recommendations cover broader programming concepts and have higher conceptual similarity with their samples. A user study was conducted where users rated Java method recommendations that helped them icomplete two programming problems. 61.5% users were positive that real time method recommendations are helpful, and 50% reported this would reduce time spent in web searches. The empirical utility of CodeReco’s similarity metric on those problems was compared with a purely language based similarity metric (baseline). Baseline received higher ratings from novices, arguably due to lack of structure-semantics in their samples while seeking recommendations.

Contributors

Agent

Created

Date Created
  • 2017

157177-Thumbnail Image.png

Understanding, Analyzing and Predicting Online User Behavior

Description

Due to the growing popularity of the Internet and smart mobile devices, massive data has been produced every day, particularly, more and more users’ online behavior and activities have been

Due to the growing popularity of the Internet and smart mobile devices, massive data has been produced every day, particularly, more and more users’ online behavior and activities have been digitalized. Making a better usage of the massive data and a better understanding of the user behavior become at the very heart of industrial firms as well as the academia. However, due to the large size and unstructured format of user behavioral data, as well as the heterogeneous nature of individuals, it leveled up the difficulty to identify the SPECIFIC behavior that researchers are looking at, HOW to distinguish, and WHAT is resulting from the behavior. The difference in user behavior comes from different causes; in my dissertation, I am studying three circumstances of behavior that potentially bring in turbulent or detrimental effects, from precursory culture to preparatory strategy and delusory fraudulence. Meanwhile, I have access to the versatile toolkit of analysis: econometrics, quasi-experiment, together with machine learning techniques such as text mining, sentiment analysis, and predictive analytics etc. This study creatively leverages the power of the combined methodologies, and apply it beyond individual level data and network data. This dissertation makes a first step to discover user behavior in the newly boosting contexts. My study conceptualize theoretically and test empirically the effect of cultural values on rating and I find that an individualist cultural background are more likely to lead to deviation and more expression in review behaviors. I also find evidence of strategic behavior that users tend to leverage the reporting to increase the likelihood to maximize the benefits. Moreover, it proposes the features that moderate the preparation behavior. Finally, it introduces a unified and scalable framework for delusory behavior detection that meets the current needs to fully utilize multiple data sources.

Contributors

Agent

Created

Date Created
  • 2019

158310-Thumbnail Image.png

Generating Vocabulary Sets for Implicit Language Learning using Masked Language Modeling

Description

Globalization is driving a rapid increase in motivation for learning new languages, with online and mobile language learning applications being an extremely popular method of doing so. Many language learning

Globalization is driving a rapid increase in motivation for learning new languages, with online and mobile language learning applications being an extremely popular method of doing so. Many language learning applications focus almost exclusively on aiding students in acquiring vocabulary, one of the most important elements in achieving fluency in a language. A well-balanced language curriculum must include both explicit vocabulary instruction and implicit vocabulary learning through interaction with authentic language materials. However, most language learning applications focus only on explicit instruction, providing little support for implicit learning. Students require support with implicit vocabulary learning because they need enough context to guess and acquire new words. Traditional techniques aim to teach students enough vocabulary to comprehend the text, thus enabling them to acquire new words. Despite the wide variety of support for vocabulary learning offered by learning applications today, few offer guidance on how to select an optimal vocabulary study set.

This thesis proposes a novel method of student modeling which uses pre-trained masked language models to model a student's reading comprehension abilities and detect words which are required for comprehension of a text. It explores the efficacy of using pre-trained masked language models to model human reading comprehension and presents a vocabulary study set generation pipeline using this method. This pipeline creates vocabulary study sets for explicit language learning that enable comprehension while still leaving some words to be acquired implicitly. Promising results show that masked language modeling can be used to model human comprehension and that the pipeline produces reasonably sized vocabulary study sets.

Contributors

Agent

Created

Date Created
  • 2020

158206-Thumbnail Image.png

Diversifying Relevant Search Results from Social Media Using Community Contributed Images

Description

Availability of affordable image and video capturing devices as well as rapid development of social networking and content sharing websites has led to the creation of new type of content,

Availability of affordable image and video capturing devices as well as rapid development of social networking and content sharing websites has led to the creation of new type of content, Social Media. Any system serving the end user’s query search request should not only take the relevant images into consideration but they also need to be divergent for a well-rounded description of a query. As a result, the automated optimization of image retrieval results that are also divergent becomes exceedingly important.

The main focus of this thesis is to use visual description of a landmark by choosing the most diverse pictures that best describe all the details of the queried location from community-contributed datasets. For this, an end-to-end framework has been built, to retrieve relevant results that are also diverse. Different retrieval re-ranking and diversification strategies are evaluated to find a balance between relevance and diversification. Clustering techniques are employed to improve divergence. A unique fusion approach has been adopted to overcome the dilemma of selecting an appropriate clustering technique and the corresponding parameters, given a set of data to be investigated. Extensive experiments have been conducted on the Flickr Div150Cred dataset that has 30 different landmark locations. The results obtained are promising when evaluated on metrics for relevance and diversification.

Contributors

Agent

Created

Date Created
  • 2020

150708-Thumbnail Image.png

An informatics approach to establishing a sustainable public health community

Description

This work involved the analysis of a public health system, and the design, development and deployment of enterprise informatics architecture, and sustainable community methods to address problems with the current

This work involved the analysis of a public health system, and the design, development and deployment of enterprise informatics architecture, and sustainable community methods to address problems with the current public health system. Specifically, assessment of the Nationally Notifiable Disease Surveillance System (NNDSS) was instrumental in forming the design of the current implementation at the Southern Nevada Health District (SNHD). The result of the system deployment at SNHD was considered as a basis for projecting the practical application and benefits of an enterprise architecture. This approach has resulted in a sustainable platform to enhance the practice of public health by improving the quality and timeliness of data, effectiveness of an investigation, and reporting across the continuum.

Contributors

Agent

Created

Date Created
  • 2012

153643-Thumbnail Image.png

Small blob detection in medical images

Description

Recent advances in medical imaging technology have greatly enhanced imaging based diagnosis which requires computational effective and accurate algorithms to process the images (e.g., measure the objects) for quantitative assessment.

Recent advances in medical imaging technology have greatly enhanced imaging based diagnosis which requires computational effective and accurate algorithms to process the images (e.g., measure the objects) for quantitative assessment. In this dissertation, one type of imaging objects is of interest: small blobs. Example small blob objects are cells in histopathology images, small breast lesions in ultrasound images, glomeruli in kidney MR images etc. This problem is particularly challenging because the small blobs often have inhomogeneous intensity distribution and indistinct boundary against the background.

This research develops a generalized four-phased system for small blob detections. The system includes (1) raw image transformation, (2) Hessian pre-segmentation, (3) feature extraction and (4) unsupervised clustering for post-pruning. First, detecting blobs from 2D images is studied where a Hessian-based Laplacian of Gaussian (HLoG) detector is proposed. Using the scale space theory as foundation, the image is smoothed via LoG. Hessian analysis is then launched to identify the single optimal scale based on which a pre-segmentation is conducted. Novel Regional features are extracted from pre-segmented blob candidates and fed to Variational Bayesian Gaussian Mixture Models (VBGMM) for post pruning. Sixteen cell histology images and two hundred cell fluorescent images are tested to demonstrate the performances of HLoG. Next, as an extension, Hessian-based Difference of Gaussians (HDoG) is proposed which is capable to identify the small blobs from 3D images. Specifically, kidney glomeruli segmentation from 3D MRI (6 rats, 3 humans) is investigated. The experimental results show that HDoG has the potential to automatically detect glomeruli, enabling new measurements of renal microstructures and pathology in preclinical and clinical studies. Realizing the computation time is a key factor impacting the clinical adoption, the last phase of this research is to investigate the data reduction technique for VBGMM in HDoG to handle large-scale datasets. A new coreset algorithm is developed for variational Bayesian mixture models. Using the same MRI dataset, it is observed that the four-phased system with coreset-VBGMM has similar performance as using the full dataset but about 20 times faster.

Contributors

Agent

Created

Date Created
  • 2015

151226-Thumbnail Image.png

Modeling time series data for supervised learning

Description

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance,

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of the relevant patterns This dissertation proposes TS representations and methods for supervised TS analysis. The approaches combine new representations that handle translations and dilations of patterns with bag-of-features strategies and tree-based ensemble learning. This provides flexibility in handling time-warped patterns in a computationally efficient way. The ensemble learners provide a classification framework that can handle high-dimensional feature spaces, multiple classes and interaction between features. The proposed representations are useful for classification and interpretation of the TS data of varying complexity. The first contribution handles the problem of time warping with a feature-based approach. An interval selection and local feature extraction strategy is proposed to learn a bag-of-features representation. This is distinctly different from common similarity-based time warping. This allows for additional features (such as pattern location) to be easily integrated into the models. The learners have the capability to account for the temporal information through the recursive partitioning method. The second contribution focuses on the comprehensibility of the models. A new representation is integrated with local feature importance measures from tree-based ensembles, to diagnose and interpret time intervals that are important to the model. Multivariate time series (MTS) are especially challenging because the input consists of a collection of TS and both features within TS and interactions between TS can be important to models. Another contribution uses a different representation to produce computationally efficient strategies that learn a symbolic representation for MTS. Relationships between the multiple TS, nominal and missing values are handled with tree-based learners. Applications such as speech recognition, medical diagnosis and gesture recognition are used to illustrate the methods. Experimental results show that the TS representations and methods provide better results than competitive methods on a comprehensive collection of benchmark datasets. Moreover, the proposed approaches naturally provide solutions to similarity analysis, predictive pattern discovery and feature selection.

Contributors

Agent

Created

Date Created
  • 2012

158317-Thumbnail Image.png

Informatics Approaches to Understand Data Sensitivity Perspectives of Patients with Behavioral Health Conditions

Description

Sensitive data sharing presents many challenges in case of unauthorized disclosures, including stigma and discrimination for patients with behavioral health conditions (BHCs). Sensitive information (e.g. mental health) warrants consent-based sharing

Sensitive data sharing presents many challenges in case of unauthorized disclosures, including stigma and discrimination for patients with behavioral health conditions (BHCs). Sensitive information (e.g. mental health) warrants consent-based sharing to achieve integrated care. As many patients with BHCs receive cross-organizational behavioral and physical health care, data sharing can improve care quality, patient-provider experiences, outcomes, and reduce costs. Granularity in data sharing further allows for privacy satisfaction. Though the subjectivity in information patients consider sensitive and related sharing preferences are rarely investigated. Research, federal policies, and recommendations demand a better understanding of patient perspectives of data sensitivity and sharing.

The goal of this research is to enhance the understanding of data sensitivity and related sharing preferences of patients with BHCs. The hypotheses are that 1) there is a diversity in medical record sensitivity and sharing preferences of patients with BHCs concerning the type of information, information recipients, and purpose of sharing; and 2) there is a mismatch between the existing sensitive data categories and the desires of patients with BHCs.

A systematic literature review on methods assessing sensitivity perspectives showed a lack of methodologies for characterizing patient perceptions of sensitivity and assessing the variations in perceptions from clinical interpretations. Novel informatics approaches were proposed and applied using patients’ medical records to assess data sensitivity, sharing perspectives and comparing those with healthcare providers’ views. Findings showed variations in perceived sensitivity and sharing preferences. Patients’ sensitivity perspectives often varied from standard clinical interpretations. Comparison of patients’ and providers’ views on data sensitivity found differences in sensitivity perceptions of patients. Patients’ experiences (family history as genetic data), stigma towards category definitions or labels (drug “abuse”), and self-perceptions of information applicability (alcohol dependency) were influential factors in patients’ sensitivity determination.

This clinical informatics research innovation introduces new methods using medical records to study data sensitivity and sharing. The outcomes of this research can guide the development of effective data sharing consent processes, education materials to inform patients and providers, granular technologies segmenting electronic health data, and policies and recommendations on sensitive data sharing.

Contributors

Agent

Created

Date Created
  • 2020

157810-Thumbnail Image.png

Three Facets of Online Political Networks: Communities, Antagonisms, and Polarization

Description

Millions of users leave digital traces of their political engagements on social media platforms every day. Users form networks of interactions, produce textual content, like and share each others' content.

Millions of users leave digital traces of their political engagements on social media platforms every day. Users form networks of interactions, produce textual content, like and share each others' content. This creates an invaluable opportunity to better understand the political engagements of internet users. In this proposal, I present three algorithmic solutions to three facets of online political networks; namely, detection of communities, antagonisms and the impact of certain types of accounts on political polarization. First, I develop a multi-view community detection algorithm to find politically pure communities. I find that word usage among other content types (i.e. hashtags, URLs) complement user interactions the best in accurately detecting communities.

Second, I focus on detecting negative linkages between politically motivated social media users. Major social media platforms do not facilitate their users with built-in negative interaction options. However, many political network analysis tasks rely on not only positive but also negative linkages. Here, I present the SocLSFact framework to detect negative linkages among social media users. It utilizes three pieces of information; sentiment cues of textual interactions, positive interactions, and socially balanced triads. I evaluate the contribution of each three aspects in negative link detection performance on multiple tasks.

Third, I propose an experimental setup that quantifies the polarization impact of automated accounts on Twitter retweet networks. I focus on a dataset of tragic Parkland shooting event and its aftermath. I show that when automated accounts are removed from the retweet network the network polarization decrease significantly, while a same number of accounts to the automated accounts are removed randomly the difference is not significant. I also find that prominent predictors of engagement of automatically generated content is not very different than what previous studies point out in general engaging content on social media. Last but not least, I identify accounts which self-disclose their automated nature in their profile by using expressions such as bot, chat-bot, or robot. I find that human engagement to self-disclosing accounts compared to non-disclosing automated accounts is much smaller. This observational finding can motivate further efforts into automated account detection research to prevent their unintended impact.

Contributors

Agent

Created

Date Created
  • 2019

154998-Thumbnail Image.png

The role of teamwork in predicting movie earnings

Description

Intelligence analysts’ work has become progressively complex due to increasing security threats and data availability. In order to study “big” data exploration within the intelligence domain the intelligence analyst

Intelligence analysts’ work has become progressively complex due to increasing security threats and data availability. In order to study “big” data exploration within the intelligence domain the intelligence analyst task was abstracted and replicated in a laboratory (controlled environment). Participants used a computer interface and movie database to determine the opening weekend gross movie earnings of three pre-selected movies. Data consisted of Twitter tweets and predictive models. These data were displayed in various formats such as graphs, charts, and text. Participants used these data to make their predictions. It was expected that teams (a team is a group with members who have different specialties and who work interdependently) would outperform individuals and groups. That is, teams would be significantly better at predicting “Opening Weekend Gross” than individuals or groups. Results indicated that teams outperformed individuals and groups in the first prediction, under performed in the second prediction, and performed better than individuals in the third prediction (but not better than groups). Insights and future directions are discussed.

Contributors

Agent

Created

Date Created
  • 2016