Measuring Privacy - How to Compute the DPIA Threshold

Companies collect data from people (their users, their employees, etc). Some of this data can be used to trace (and target) individuals, and inevitably, no matter how much care is taken in protecting that data, there will be incidents and breaches, leading to that personal information being corrupted, destroyed or accessed by unauthorised (and potentially malicious) parties.

This is the main problem data protection laws (such as GDPR in EU or LGPD in Brazil) aim to solve. They require companies to outline why they collect data, from whom, who can access it, and how it is protected. They ask companies to clarify that information to people who ask for them (and delete such information if someone requests that). For “high-risk” data processing activities, they even require companies to outline precisely outline the risks of potential incidents (data-leaks, fire in paper archives, etc),  severity of consequences of such incidents (who is affected and how much their privacy is compromised), and to establish that they are taking appropriate measures for mitigating those risks. This last requirement is typically referred to as “Data Protection Impact Assessment”, DPIA for short.

Conducting DPIA for any data processing activity can be a particularly daunting task. This is why regulations typically only require DPIA for “high-risk” data processing activities, though if a company chooses so, they can also conduct it for lower risk activities for further safety (for example here you can find the criteria for high-risk activities in GDPR). This means however, that companies must conduct a “threshold analysis” for all their data processing activities, to be able to identify the activities that are, in fact, “high-risk”, and need to have a proper DPIA. This practice, in turn, is commonly referred to as DPIA Threshold Analysis.

Calculating the riskiness of a data processing activity, in terms of data privacy, can be confusing though. How can one calculate how much privacy is lost  if some particular set of data is compromised? Well, it turns out that there is a lot of theoretical work to answer just that, which can be used to create heuristic precise-enough models which are not just efficient, but can even be automated (by a large degree).

DPIA Threshold Analysis

As mentioned above, the goal of DPIA Threshold Analysis is to find out which data protection activities have a particularly high risk. That risk can be broken down into two major components: the actual risk of something going wrong during processing of said personal data, and the potential harm that would befall people whose data was involved, if something does go wrong.

This model results in a chart like the following (which is a screenshot of how threshold analysis is done in ECOMPLY), where risk is a rough estimate of probability of incidents (data breaches, data being destroyed, etc), and severity is a rough estimate of how bad it would be, if an incident did happen.

The overall riskiness of some processing activity, i.e. the expected damage as a result of incidents, can now roughly be estimated as a product of aforementioned risk and severity:

Expected Damage = Risk of Incident * Severity of Ensuing Damages

This estimation roughly corresponds to the color coding of the threshold table. The more red a cell in the table is, the higher the expected damage or the overall riskiness of the processing activity would be, which would mean proper DPIA assessment is more likely to be necessary. Note that due to practical and legal reasons this correspondence is not precise: For example if the severity of incidents would be high, then even in case of negligible risk you would be better having a deeper look into the process, as although not a lot of people might get affected in case of an incident, the repercussions will be severe for those who do.

This model allows us to break down the problem of estimating potential privacy loss to two simpler problems: How much a process is at risk, i.e. how probable incidents are in a process, and how bad would the damage of such incidents be. Though each of these problems do have nuances and complexities of their own, if the corresponding factors are to be calculated precisely, it is much easier to provide rough estimates for them, which in turn opens the door for automated threshold DPIA analysis that can act as a good baseline, though in corner cases would require manual modifications.

Estimating Risk of Incidents

While precise analysis of probability of incidents for processing data can be tough (can you estimate precisely the chance of getting hacked, or a disgruntled employee leaking some database, for example?), having a rough estimation, for the purpose of a baseline threshold analysis, is actually not that difficult. The key factor here is that incidents happen at organizational nodes (teams, departments, external vendors, etc) that take part in processing the data, so the overall probability of incident can be calculated as the combination of risk of incident at each such node.

Risk of incident at each organizational node is different based on its own data protection practices and measures, however for a rough baseline estimation we can assume the same risk factor based on different types of organizational nodes. For example, we can assume that on average, the risk of incident in an internal department in the organization is about 2%, and the risk of incident in an external data processor is about 5%. With that simplification, we can quickly get a baseline estimate for risk of incident of a processing activity by simply considering the number of internal departments and number of external vendors involved. With this level of simplicity, we can automatically fill up the baseline risk assessment simply by looking at specified data flow of the processing activity, which is how such automatic assessment is in fact conducted in

Note that this is merely a baseline estimate that can be further improved by considering more individual risk factors for specific organizational nodes: for example a higher risk factor can be assumed for an external vendor that does not have a credible track record in data protection practices.

Estimating Severity of Incidents

Severity of data incidents can be thought of as expected loss of privacy of individuals whose personal data was involved in data processing activities affected by the incident. 

While loss of privacy might feel like an abstract, subjective concept that cannot be properly measured, in fact there is a properly developed mathematical framework that also does provide a precise mathematical definition for privacy loss. This framework, called differential privacy, was developed for situations where data, or at least some aggregates of it, needs to be published publicly (for example, statistical surveys or census data). In this framework, you can think of privacy loss as how precisely an individual can be identified using data that was leaked or published: for example if the leaked data includes zip code and age, then privacy loss is based on how many people living in that zip code are of the same age, and if there is only one person of age 92 living in a particular zip code, then with that data becoming publicly available, that person’s privacy is completely lost. You can learn more about differential privacy here or here.

Now that we have an objective, measurable definition of privacy loss, we can also formulate a heuristic for getting a baseline estimate for privacy loss in case of an incident. We can attribute a sensitivity factor for various types of personal data that are collected and processed in the processing activity, based on how much they would help in precise identification of a particular person, and also how much they are considered sensitive information legally. For example, GPS data or Passport IDs are of high sensitivity since they can facilitate pin-pointing an individual, and criminal records or information regarding participation in trade unions are of high sensitivity since they have high potential for abuse, if they are linked to individuals via other leaked data (which is commonly the case and is called a linkage attack).

Moreover, this privacy loss approximation can be further contextualized by using a scaling factor based on the relationship the data processor has with the individual. For example, data of patients requires more sensitive handling than data of employees, which in turn requires more sensitivity than data of managers.

Utilizing this heuristic approach allows us to automatically calculate a baseline severity group by considering the type of data processed and the relationship between the data processor and individuals whose data is being processed. Combining this with automatic baseline estimation for risk, we can (and in fact, at ECOMPLY, do) conduct a DPIA threshold assessment fully automatically, based on other data that is documented for each processing activity. Needless to say that these estimates are again mere baselines that can be improved by taking into consideration nuances and specifics of each processing activity.

With these heuristic approaches and rough estimations, we can now roughly measure how much we are affecting the privacy of our customers, employees, partners, etc without the need for a ton of work to just get to know which of our processes need further focus and which don’t. These automated estimations, as rough as the might be, do act as proper baselines for creating a big picture: they seamlessly outline where the data protection team needs to acquire further attention, which allows us to act more swiftly in covering the gaps in our data protection efforts and measures, ideally resulting in an environment where privacy is maintained in an efficient and painless manner for businesses.