The increasing volume and complexity of software systems and the growing demand of programming skills calls for efficient information retrieval techniques from source code documents. Programming related information seeking is often challenging for users facing constraints in knowledge and experience. Source code documents contain multi-faceted semi-structured text, having different levels of semantic information like syntax, blueprints, interfaces, flow graphs, dependencies and design patterns. Matching user queries optimally across these levels is a major challenge for information retrieval systems. Code recommendations can help information seeking and retrieval by pro-actively sampling similar examples based on the users context. These recommendations can be beneficial in improving learning via examples or improving code quality by sampling best practices or alternative implementations.
In this thesis, an attempt is made to help programming related information seeking processes via pro-active code recommendations, and information retrieval processes by extracting structural-semantic information from source code. I present CodeReco, a system that recommends semantically similar Java method samples. Conventional code recommendations found in integrated development environments are primarily driven by syntactical compliance and auto-completion, whereas CodeReco is driven by similarities in use of language and structure-semantics. Methods are transformed to a vector space model and a novel metric of similarity is designed. Features in this vector space are categorized as belonging to types signature, structure, concept and language for user personalization.
Offline tests show that CodeReco recommendations cover broader programming concepts and have higher conceptual similarity with their samples. A user study was conducted where users rated Java method recommendations that helped them icomplete two programming problems. 61.5% users were positive that real time method recommendations are helpful, and 50% reported this would reduce time spent in web searches. The empirical utility of CodeReco’s similarity metric on those problems was compared with a purely language based similarity metric (baseline). Baseline received higher ratings from novices, arguably due to lack of structure-semantics in their samples while seeking recommendations.