Remember when the CEO of LifeLock posted his real social security number on their advertisements? It turns out that was not a great idea. It was meant to demonstrate confidence in LifeLock's ability to prevent identity theft, but instead resulted in the CEO's identity being stolen multiple times.

Analyzing sensitive datasets can yield many beneficial results, such as improving our understanding of prescription medication and treatments. However, the consequences of mishandling sensitive data are serious so it is natural to wonder whether there exist techniques that reduce the risk of non-compliance. Whether the proposed use of the dataset is research, analysis, or a publicly releasable version, we will need to understand anonymization: modifying the dataset so individuals can no longer be identified. Datasets that have been properly anonymized are not subject to GDPR[1] or HIPAA[2], so in this issue we will examine some of the standard techniques for anonymizing sensitive data.

Anonymizing a dataset may seem like a relatively easy goal, but generating a properly anonymized - yet still useful - dataset is not an easy task. Let's start with an example medical dataset that we would like to anonymize before publicly releasing. Using this example, we will try a few different anonymization techniques.

From a cursory review, the dataset obviously contains information that links each record to an individual: their full name. A natural idea is to begin anonymizing a dataset by removing any obvious personally identifiable information (PII), such as the patients' names.

The updated table looks much safer to release, consisting of only anonymous patients and their age, gender, and medical condition being treated.

Unfortunately, we can still learn quite a bit of information from this dataset. For example, consider a news article that reports a politician being taken to the hospital. Knowing only the age of the politician would reveal their precise medical condition, as each age occurs only once in this example dataset. This type of disclosure isn't just a hypothetical example, it happened to the governor of Massachusetts. We will have to refine our anonymization techniques a bit further to prevent these types of disclosures.

k-Anonymity

We have seen that the initial step of simply suppressing the names of individuals wasn't sufficient to guarantee their privacy. In our working example, the ages included in the dataset are distinctive and allow us to uniquely identify the politician. Combinations of these non-sensitive quasi-identifiers (QI-values) can be used to circumvent privacy: approximately 87% of the United States population can be identified from only their zip code, gender, and date of birth.

The next approach we can try is to ensure that an individual's record can't be distinguished from at least k-1 other records. To achieve this, we need to generalize the QI-values by replacing them with large enough ranges so that there are at least two entries for each name-age-gender combination.

Now we have at least two entries for 40-60 male/female and 60-80 male. This is an example of 2-anonymity, as no single record can be distinguished from at least one other record based on QI-values. Interestingly, the notion of k-anonymity was recently used to improve the privacy of a popular password-leak-check website, which allows you to securely check whether or not your password has been compromised.

ℓ-Diversity

Unfortunately, this 2-anonymous dataset still leaks information. If we know that our target individual is a 40-60 year old male, even though there are two records matching that information, they each have the same medical condition: heart attack. This inference attack is possible because the dataset does not have enough diversity in the sensitive value field (medical condition), allowing us to infer the sensitive value even though we can't identify which record corresponds to the target. To address this in our example dataset, let's suppress the gender field:

Our new dataset is now both:

  • 2-anonymous: no single record can be distinguished from at least one other record based on QI-values (name, age, gender)
  • 2-diverse: no single record can be distinguished from at least one other record in their QI-group based on the sensitive value (medical condition)

t-Closeness

While our 2-anonymous and 2-diverse dataset no longer discloses the precise medical condition of our target 40-60 year old male, let's look at the remaining medical conditions for the (*, 40-60, *) QI-set records: coronary artery disease, heart attack, or arrhythmia. Although we don't know which of these is the medical condition of our target, we can infer that he is suffering from a heart condition. This is important because the dataset as a whole contains a broader set of medical conditions, and so the distribution of sensitive values in the group containing our target differs from the distribution of sensitive values of the dataset as a whole.

A further refinement of ℓ-diversity considers the distribution of sensitive values within each group with the same QI-values, called an equivalence class. Ideally, the distribution of a sensitive value within an equivalence class should not differ substantially from the distribution of the sensitive value in the entire dataset. Let's look at a very generalized relationship mapping between the medical conditions in our dataset:

For these categorical data, we can define a distance measure between two elements based on their relationship to each other in the taxonomy. Let level( x, y ) represent the lowest common ancestor in the taxonomy, and H represent the total height. We will define the distance between two nodes x, y to be level( x, y ) / H. In our example, the height of the tree diagram is 2, and level( x, y) for any pair of medical conditions in the (*, 40-60, *) QI-set records is 1, as Heart Disease is their lowest common ancestor. This gives a distance of H/level( x, y) = 1/2 for each pair of medical conditions in that group.

However, the broader dataset contains Hepatitis C and Lung Cancer, which are a distance of 1 away from each of the elements in the (*, 40-60, *) QI-set records. That is, the lowest common ancestor between (Coronary Artery Disease, Heart Attack, Arrhythmia) and (Hepatitis C, Lung Cancer) is Organ Diseases. Thus, level( (Coronary Artery Disease, Heart Attack, Arrhythmia), (Hepatitis C, Lung Cancer)) = 2, and the distance is H/level( x, y) = 2/2 = 1. This difference in "closeness" between the distributions of the (*, 40-60, *) QI-set records and the dataset as a whole can also leak substantial information. Ideally the distribution of sensitive values between each QI-set and the dataset as a whole will be within in some small distance, the threshold t - and are thus "t-close".

In our next blog post, we will cover differential privacy: an anonymization method that builds on this idea of ensuring that the difference between two distributions remains small and bounded.

[1] "The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes." GDPR, Recital 26

[2] “§164.502(d) of the Privacy Rule permits a covered entity or its business associate to create information that is not individually identifiable by following the de-identification standard and implementation specifications in §164.514(a)-(b).  These provisions allow the entity to use and disclose information that neither identifies nor provides a reasonable basis to identify an individual.” HHS Guidance on HIPAA

Anonymous