The classification of domain concepts in object-oriented systems
The complexity of the systems that software engineers build has continuously grown since the inception of the field. What has not changed is the engineers' mental capacity to operate on about seven distinct pieces of information at a time. The widespread use of UML has led to more abstract software design activities, however the same cannot be said for reverse engineering activities. The introduction of abstraction to reverse engineering will allow the engineer to move farther away from the details of the system, increasing his ability to see the role that domain level concepts play in the system. In this thesis, we present a technique that facilitates filtering of classes from existing systems at the source level based on their relationship to concepts in the domain via a classification method using machine learning. We showed that concepts can be identified using a machine learning classifier based on source level metrics. We developed an Eclipse plugin to assist with the process of manually classifying Java source code, and collecting metrics and classifications into a standard file format. We developed an Eclipse plugin to act as a concept identifier that visually indicates a class as a domain concept or not. We minimized the size of training sets to ensure a useful approach in practice. This allowed us to determine that a training set of 7:5 to 10% is nearly as effective as a training set representing 50% of the system. We showed that random selection is the most consistent and effective means of selecting a training set. We found that KNN is the most consistent performer among the learning algorithms tested. We determined the optimal feature set for this classification problem. We discussed two possible structures besides a one to one mapping of domain knowledge to implementation. We showed that classes representing more than one concept are simply concepts at differing levels of abstraction. We also discussed composite concepts representing a domain concept implemented by more than one class. We showed that these composite concepts are difficult to detect because the problem is NP-complete.