Topic chains for determining risk of unauthorized information transfer

Document
Description
Corporations invest considerable resources to create, preserve and analyze

their data; yet while organizations are interested in protecting against

unauthorized data transfer, there lacks a comprehensive metric to discriminate

what data are at risk of leaking.

This thesis motivates the need for

Corporations invest considerable resources to create, preserve and analyze

their data; yet while organizations are interested in protecting against

unauthorized data transfer, there lacks a comprehensive metric to discriminate

what data are at risk of leaking.

This thesis motivates the need for a quantitative leakage risk metric, and

provides a risk assessment system, called Whispers, for computing it. Using

unsupervised machine learning techniques, Whispers uncovers themes in an

organization's document corpus, including previously unknown or unclassified

data. Then, by correlating the document with its authors, Whispers can

identify which data are easier to contain, and conversely which are at risk.

Using the Enron email database, Whispers constructs a social network segmented

by topic themes. This graph uncovers communication channels within the

organization. Using this social network, Whispers determines the risk of each

topic by measuring the rate at which simulated leaks are not detected. For the

Enron set, Whispers identified 18 separate topic themes between January 1999

and December 2000. The highest risk data emanated from the legal department

with a leakage risk as high as 60%.