We need to revise our approach to anonymised data

Data is a complex, dynamic issue. We often like to make large buckets where we want to classify it. The Personal Data Protection Bill does this by making five broad categories, personal data, personal sensitive data, critical personal data, non-personal data, and anonymised data. While it is nice to have these classifications that help us make sense of how data operates, it is important to remember that the real world does not operate this way.

For instance, think about surnames. If you had a list of Indian surnames in a dataset, they alone would not be enough to identify people. So, you would put that dataset under the ambit of personal data. But since it is India, and context matters, surnames would be able to tell you a lot more about a person such as their caste. As a result, surnames alone might not be able to identify people, but they can go on to identify whole communities. That makes surnames more sensitive than just personal data. So you could make a case for them to be included in the personal sensitive category.

And that is the larger point here, data is dynamic, as a result of how it can be combined or used alone in varying contexts. As a result, it is not always easy to pin it down to broad buckets of categories.

This is something that is often not appreciated enough in policy making, especially in the case of anonymised or non-personal data. Before I go on, let me explain the difference between the two, as there is a tendency to use them interchangeably.

Anonymised data refers to a dataset where the immediate identifiers (such as names or phone numbers) are stripped off rest of the dataset. Nonpersonal data, on the other hand is a broader, negative term. So anything that is not personal data can technically come under this umbrella, think anything from traffic signal data to a company’s growth projections for the next decade.

Not only is there a tendency to use the terms interchangeably, but there is also a false underlying belief that data, once anonymised cannot be deanonymised. The reason the assumption is false is because data is essentially like puzzle pieces. Even if it is anonymized, having enough of anonymized data can lead to deanonymization and identification of individuals or even whole communities. For instance, if a malicious hacker has access to a history of your location through Google Maps, and can combine that with a history of your payments information from your bank account (or Google Pay), s/he does not need your name to identify you.

In the Indian policy making context, there does not seem to be a realization that anonymisation can be reversed once you have enough data. The recently introduced Personal Data Protection Bill seems to be subject to this assumption.

Through Section 91, it allows “the central government to direct any data fiduciary or data processor to provide any personal data anonymised or other non-personal data to enable better targeting of delivery of services or formulation of evidencebased policies by the Central government”.

There are two major concerns here. Firstly, Section 91 gives the Government power to gather and process non-personal data. In addition, multiple other sections ensure that this power is largely unchecked. For instance, Section 35 provides the Government the power to exempt itself from the constraints of the bill. Also, Section 42 ensures that instead of being independent, the Data Protection Authority is constituted by members selected by the Government. Having this unchecked power when it comes to collecting and processing data is problematic especially it has the potential to give the Government the ability to use this data to identify minorities.

Secondly, it just does not make sense to address nonpersonal data under a personal data protection bill. Even before this version of the bill came out, there had been multiple calls to appoint a separate committee to come up with recommendations in this space. It would have then been ideal to have a different bill that looks at non-personal data. Because the subject is so vast, it does not make sense for it to be governed by a few lines in Section 91 for the foreseeable future.

So the bottom line is that anonymised data and nonpersonal data can be used to identify people. The government having unchecked powers to collect and process these kinds of data has the potential to lead to severely negative consequences. It would be better instead, to rethink the approach to non-personal and anonymised data and have a separate committee and regulation for this.

This article was first published in Deccan Chronicle.

(The writer is a technology policy analyst at the Takshashila Institution. Views are personal)