Identity in India: Are We Entering an Era of Public Databases of Personal Data

One of the changes that we expect to see in the coming years, is the creation of more public repositories of personal data. By personal, I mean data like Name, Date of Birth, Address, Bank Account Number, Aadhaar Number, Health records, Education records, Employment information, DNA data, credit history, biometrics, iris scans, among other things: i.e. data pertaining to the person, and including, but not limited to, sensitive personal data.

If you consider Aadhaar, it performs two functions as of now:

First, and the one that the state considers most important, is the de-duplication of identification in a database. This is what the original intent of Aadhaar was – to return a yes or no answer – and allow someone to get an answer to a simple question: are you who you say you are. This de-duplication is meant to help multiple things: mass surveillance (with NATGRID), addressing the ghost beneficiary issue of government welfare (even though that is overstated and is failing), documenting attendance, among other things.
Second, and what is more important for the industry, is the function of KYC, or “Know Your Customer”, which is authentic personal identification information, and a legal requirement that the government of India has imposed on services like Banking (for the money trail) and Telecom (for surveillance). Back in December 2009, in a submission to the government of India on scaling up Mobile Payments, Nokia had estimated that the cost of KYC then was Rs 200 per customer. This would have gone up significantly by now. What the Aadhaar database has done, by forcibly collecting a large amount of personal identification data, is created a KYC database of almost everyone in India. With the UIDAI’s eKYC system, businesses can digitally acquire user data that they can trust, quicker and cheaper than ever before, with user consent, of course.

However, the eKYC database needs to be looked at separately from the authentication function of Aadhaar, because it is essentially a Public KYC Database of personal data. There will be more: we’re heading into an era of more such databases, which intend to structurally change the way our economy functions.

To my knowledge, two other databases have already been planned:

National Public Credit Registry:This will be a public repository of lending history for individuals and companies. RBI Deputy Governor Viral Acharya spoke recently about the setting up of a public credit registry, which will track data on borrowers economic and financial health, bank-borrower loan level data and terms, terms of restructuring, borrower debt level default and recovery data, and maybe even transactional data like payments to utilities like power and telecom for retail customers and trade credit data for businesses.

Health Records Register(s):The National Health Policy suggests the creation of Aadhaar linked registries – for example, of patients, health service providers, service, diseases, among other things, for “enhanced public health/big data analytics”. There’s the idea of the creation of a health exchange, which “necessitates private sector participation in developing and linking systems into a common network/grid which can be accessed by both public and private healthcare providers.” The specifics of how the government intends to approach this, but needless to say, electronic health records will be Aadhaar linked (de-duplication) and shared across information exchanges, and the setting up of a National Health Information Network.

What else?

While we haven’t noticed anything specific, the signals are there, though we don’t know the extent of sharing of this data:

Education:schools are making Aadhaar mandatory for admission, while the UGC has also made it mandatory for degrees.
Employment:Nothing significant so far, but Aadhaar is also being made mandatory by employers for linking to the Employee Provident Fund.
Travel:Airports are currently pushing for Aadhar linked entry.
DNA:There’s a DNA profiling bill planned

Much of this is personal data, especially health records, DNA and credit information, but to some extent, even granular educational information and employment records. We’re in the have-access-will-monitor-and-collect-data era: apps may be copying all your SMS’ once you give access to SMS to read OTP permissions. This is essentially data that you would keep private, and share on a case by case basis, and often not in its entirety. At the same time, public registries of this personal data would make sharing of data easier and reliable.

Public registries, as Acharya pointed out with the Public Credit Registry, essentially address issues of:

Fragmentation of datasets, which leads to information asymmetry. Different organisations have different data sets, and this leads to inefficient (and expensive decision making), and costly acquisition of data. It creates an unlevel playing field and (lack of) access to data, compared to larger players, becomes a barrier to entry.
Helps bring in regulatory intervention: for example, in case of ascertaining diseases in an area, based on macro information on health records, obtained from doctors. Education interventions are possible in areas where, for example, grades in a particular subject are poor.
Allows private entities to make better decisions: for example, banks in case of lending, doctors in case of health, schools, in case of education.
Addresses falsification of information.

More on Acharya’s points here.

Apart from this,

It reduces costs of gathering information, with the state taking on the onus of collecting and making accessible such personal data. It means that this data will become commodity, instead of proprietary, which reduces cost of access. For example, lending startups now would have access to base credit information, not just other data sets like bill payments, to improve their ability to give loans without collateral (Flow based lending).

Concerns remain

Mandatoriness: The data that is being collected, is being coerced, leaving citizens with no choice but to share data. Children are being forced to share data for admission in schools, citizens for subsidies, and even private services such as mobile connections. These leave people with no choice but to share data with the government, which is then in a position to get vast amount of data on each individual, and this can fundamentally change the relationship between the citizen and the state. The risk of revocation can destroy a persons access to all services.

If you’re not on record, do you even exist?

Sharing can be coerced and consent is broken: Sharing of data might be seen as mandatory, but consent is broken. Just as people don’t read the terms and conditions while signing up for an app, given what should be a false choice between receiving medical treatment (or a prepaid mobile connection) and sharing of data, they tend to choose the latter.

We’re in an “all that you can collect” era of data collection.

An application may not allow you to install it if you don’t allow it access to all your contacts, even if it doesn’t need that data to run. Also, as Bhairav Acharya pointed out last year, the notice and choice model does not work.

Discriminatory pricing:Access to data allows businesses (and potentially, the state) access to a citizens vulnerabilities and preferences. While at one level, the preferences can lead to better customisation of services, it also can lend itself to discriminatory pricing. I’d explained the perils of predatory, dynamic pricing here, from a telecom and transportation (uber/ola) perspective, but consider how this data could (in the absence of regulation), impact pricing of medicines, if someone does what Martin Shkreli did, knowing there is a crisis.
Data is a toxic asset: Bruce Schneier put it best: “What all these data breaches are teaching us is that data is a toxic asset and saving it is dangerous. Saving it is dangerous because it’s highly personal”…”Saving it is dangerous because many people want it”…”It’s hard for companies to secure. For a lot of reasons, computer and network security is very difficult. Attackers have an inherent advantage over defenders, and a sufficiently skilled, funded and motivated attacker will always get in”…”And saving it is dangerous because failing to secure it is damaging. It will reduce a company’s profits, reduce its market share, hurt its stock price, cause it public embarrassment, and — in some cases — result in expensive lawsuits and occasionally, criminal charges.”

In particular, fingerprints are toxic assets. They can be duplicated from photographs.

Sharing ecosystem is broken: As we’re seeing with the eKYC ecosystem, the design of data sharing is flawed and poorly thought out: because of the scale of the data sharing, and the number of end points that get created, it is impossible to secure this data. You might manage to secure your KYC User Agencies (and even that has failed), but who manages the vendors of the KYC User Agencies? Once the data is out of your system, it is gone. Once it is in their system, it is gone. With consent broken, the data is gone.
Surveillance can be all encompassing: Aadhaar is for de-duplication, a core function of which is authentication. NATGRID is a government framework for data collection from 10 user agencies and 21 service providers in phase 1, and subsequently as many as 950 organisations, according to Business Standard. The scale of surveillance has major implications for citizens. Data is power, and this changes the relationship between the state and the citizen. It also impacts the course of democracy, with bureaucrats and politicians susceptible to surveillance.

Data is power, not oil.

What Next?

There are clear benefits of having access to this data, but fairly significant risks of collecting, storing, transmitting and sharing that data. Is data more secure if it’s distributed, instead of having a single point of failure, in single registries? How does that impact the efficiency of the system? What checks and balances that should be put into place? Should we be even collecting some types of data? Should we even allow sharing of some types of data? What should be allowed, restricted and disallowed?

As you can tell, there are no easy answers here, but there’s a need for a framework.

Author Nikhil Pahwa ( @nixxin , +NikhilPahwa , nikhil@medianama.com ) Published August 23, 2017 in Medianama