Data De-Identification: Using a Risk-Based Methodology

So you have a data set that you want to release, either publicly or to a known recipient. It contains personal data that you know you can’t share outside your walls. 

How do you de-identify the data and also defend the decisions you made?

In order to effectively de-identify the data - in the context of its release - you’ll need to use a risk-based de-identification methodology.

And to be explicit: yes, re-identification risk can be quantified, using commonly agreed-to standards and assumptions. Compared against a risk threshold you can then determine if a data set is sufficiently de-identified.

Note that the risk of re-identification changes by release scenario therefore the de-identification performed will also change. For example, you will de-identity to different levels if you are releasing the data publicly versus releasing the data to a trusted third-party. 

And by documenting all of these steps, supported by a proven methodology, you will be better able to stake a defensible position as to how and why you released the data.

What are the steps in this process?

 1.  Select the Direct and Indirect (Quasi) Identifiers

Often it is assumed that if you take out the Direct Identifiers (name, SSN/SIN, email address, etc.) then you have removed any uniquely identifying components of the data. However it’s the Quasi Identifiers (age, location, profession, etc.) that still allow for the possibility of re-id, particularly when strung together.

2.  Set the Risk Threshold

Depending on who the data is being shared with and the sensitivity of the data, the risk threshold will vary. This is critical to avoid a one-size-fits-all de-identification, which will almost certainly create problems and increase risk.

3.  Measure the Data Risk

There are different types of attacks that need to be factored in as they will affect the probability of a re-identification. 

  • Might the data recipient deliberately attempt to re-identify? 
  • Is there the possibility of inadvertent re-identification (“Hey.. that’s my cousin”)?
  • Does the data recipient have the motives and capacity to re-identify the data?
  • What security and privacy controls are in place?

4.  De-Identify the Data

Assuming that data risk exceeds the risk threshold, de-identification is then performed using different techniques such as generalization, suppression and subsampling in order to bring the risk of re-identification below the threshold.

Data de-identification can seem a daunting task but approached systematically with the right methodology and tools it is achievable and, most importantly, defensible. 

To view or add a comment, sign in

More articles by Chris Johnson

Others also viewed

Explore content categories