Corporations invest considerable resources to create, preserve and analyze
their data; yet while organizations are interested in protecting against
unauthorized data transfer, there lacks a comprehensive metric to discriminate
what data are at risk of leaking.
This thesis motivates the need for a quantitative leakage risk metric, and
provides a risk assessment system, called Whispers, for computing it. Using
unsupervised machine learning techniques, Whispers uncovers themes in an
organization's document corpus, including previously unknown or unclassified
data. Then, by correlating the document with its authors, Whispers can
identify which data are easier to contain, and conversely which are at risk.
Using the Enron email database, Whispers constructs a social network segmented
by topic themes. This graph uncovers communication channels within the
organization. Using this social network, Whispers determines the risk of each
topic by measuring the rate at which simulated leaks are not detected. For the
Enron set, Whispers identified 18 separate topic themes between January 1999
and December 2000. The highest risk data emanated from the legal department
with a leakage risk as high as 60%.