As Big Data becomes more relevant, existing grouping and clustering algorithms will need to be evaluated for their effectiveness with large amounts of data. Previous work in Similarity Grouping proposes a possible alternative to existing data analytics tools, which acts as a hybrid between fast grouping and insightful clustering. We, the SimCloud Team, proposed Distributed Similarity Group-by (DSG), a distributed implementation of Similarity Group By. Experimental results show that DSG is effective at generating meaningful clusters and has a lower runtime than K-Means, a commonly used clustering algorithm. This document presents my personal contributions to this team effort. The contributions include the multi-dimensional synthetic data generator, execution of the Increasing Scale Factor experiment, and presentations at the NCURIE Symposium and the SISAP 2019 Conference.
Included in this item (4)