Are You There or Not? Differential Privacy Says You Shouldn't Be Able To Tell

From the healthcare example in our previous post on data anonymization, we can see that private information can leak even when all of the obvious personally identifiable information (PII) has been removed. Ideally, there would be a way to capture the broader characteristics of the dataset without revealing any private information about any of the individuals that are contained in the dataset - where any one individual could be removed from the dataset without affecting the resulting analysis. However, many of the techniques we have seen so far can result in over-suppression and over-generalization of the data, rendering the information in the modified dataset essentially useless. We ran into this problem when anonymizing our very small healthcare dataset. After implementing 2-anonymity and 2-diversity, the information was reduced to only an age range and the medical condition. The politician could no longer be identified, but the granularity of the data was all but lost.

A different approach is to change how queries over the private database are performed, rather than modifying the database itself with privacy-preserving measures in order to disclose it publicly.

The term differential privacy refers to the difference in information between a dataset that contains an individual, and one that does not. If any descriptive statistics and analysis performed on the dataset remain essentially the same whether or not the individual was included in the dataset, this provides each individual the same privacy as if they were never included in the dataset at all. This idea has been so impactful that the 2020 census used differential privacy to safeguard the statistics that are released. The diagram below illustrates the goal of differential privacy:

Consider the two datasets on the left of the diagram, where the first contains the individual highlighted in red while the second does not. We would like to find a method to query the complete dataset such that the result does not overly depend on any particular individual. A method satisfying that requirement would guarantee that the result of queries over the reduced database (with a single individual removed) and the complete dataset would differ by at most some small amount ε.

Differential privacy achieves this property by introducing an appropriate amount of randomness into the output of each query. How much noise is "appropriate" depends on how sensitive the query is to any single entry in the database. For example, in our very small medical dataset (see our previous post) each entry is 1/6 of the total dataset, so we would have to add much more noise to a query result than if each entry was only 1/1000000 of the total dataset. The degree to which a particular query depends on any particular entry is called its sensitivity. As sensitivity considers the most a query's output could differ across different reduced datasets, we will denote query sensitivity as :

If we modify the query to depend not just the entire dataset, but also on a noise function N that introduces noise proportional to the query sensitivity , we have a differentially private query:

That is, differentially private queries protect the privacy of each individual in a dataset because the result of the query over the entire dataset can’t be distinguished from a query of a dataset that doesn’t contain the individual at all.

Closing Thoughts

When we think of privacy regulations, we tend to think of the Health Insurance Portability and Accountability Act of 1996 (HIPAA). While HIPAA only applies to covered entities which include health plans, health care clearinghouses, and health care providers who send personal identifiers, other privacy laws and standards are swiftly being adopted. NIST, for instance, recently released their first privacy framework with the goal of assisting organizations in providing protection for individual privacy through the identification and management of risk while developing products and services. ISO also recently released standards related to the preservation of privacy: ISO 27701 and ISO/TS 25237.  ISO 27701 describes requirements around the construction of a Privacy Information Management System (PIMS) while ISO/TS 25237 describes the requirements around the use of pseudonymization techniques for health data.

When dealing with datasets that contain PII, it is important to understand what privacy regulations govern its use, and the techniques available to ensure any analysis or disclosure of the data has been properly anonymized.