The exploitation of personal data has become crucial for companies but cannot be done without conditions. The methods of pseudonymization and anonymization of data make it possible to find the balance between liberation of uses, security and protection.
The issue around the protection of personal data (DCP) is no longer to be demonstrated, with on the one hand companies that need it to better know their customers and thus ensure the development of their economic model, and ‘another side of citizens whose privacy must be preserved in order to avoid dangerous abuses for individual freedoms.
The use cases for the use of data are multiple, whether it is profiling or geolocation of customers for marketing purposes, the analysis of health data to advance research, the exploitation of data for application development and testing and much more. Releasing these uses of data while strengthening their protection, that is the objective to be achieved.
Beyond the essential safeguard of the European GDPR that everyone knows today and which sets the limits not to be crossed, companies have been seeking for several years to develop consumer confidence in their brand, which undoubtedly requires good control of personal data.
This protection, many companies think they know how to manage it, often wrongly, content to erase certain information from their databases, with the end result of killing the value of the data, without even strengthening its security. However, there are desensitization methods with proven effectiveness.
Protect personal data
Before detecting and desensitizing the personal data held by the company, it is already necessary to define what we are talking about. Personal data is information that makes it possible to identify a natural person, in other words a name, a photo, a postal or email address, a telephone or social security number, a fingerprint, an IP address, etc.
To preserve the privacy of individuals, companies must respect several commitments such as the transparency of the processing of personal data, the possibility of intervening on this data (modifying or deleting it) but also intraceability, which guarantees that personal data cannot be processed. cannot be linked across domains, such as between a bank account and a medical record.
When we want to process real data as part of a professional project, such as testing an application to validate its relevance, it is a question of making this data anonymous to people who are not supposed to have access to it. . Let us take as a speaking example that of the health platforms on which to make an appointment for the anti-Covid vaccination. The context of the use case and its sensitive nature must therefore be taken into account in order to control the risks.
Detect personal data
The first step logically consists in mapping all the DCPs that the company stores in all of its databases, which are often heterogeneous. Proceeding in a non-automatic manner via metadata quickly turns out to be time consuming and opens the door to approximations, raising the question of the reliability of this method on large masses of data. Especially since confidentiality is not always guaranteed when data is processed manually.
It is therefore a question of proceeding based on an ontology which categorizes the DCPs according to defined attributes. Concretely, we will use two methods of analysis: a first, called regular expressions, which automatically identifies the forms of specific values such as an email address or a telephone number, and a second, when the first is not possible, which detects personal data by comparing the data with reference databases, such as the list of names in France or even listed diseases. So much knowledge that will enrich the ontology and refine the detection of DCP.
We will thus obtain a list of attributes for each DCP, which we will classify into three types: identifier (making it possible to directly identify a person), quasi-identifier (making it possible to identify a group of people) and sensitive (non- identifying but to protect).
Desensitization by pseudonymization or anonymization
Once detected, personal data must be “transformed” so that they can no longer be used to identify a person and reveal some of their attributes. It will nevertheless be necessary to ensure that this desensitization does not degrade too much the quality of the data and therefore its usefulness. Depending on the needs of the different use cases, we can call on two main types of methodologies and then check their effectiveness.
Pseudonymization consists of replacing an identifier (such as a name) with an artificial identifier or a pseudonym. This process, which masks the identities of people with a symmetric encryption system, is completely reversible as long as we have the decryption keys, stored separately and in a secure manner. This automatic and confidential method makes it possible to preserve all the precision and therefore the quality of the data for use cases of AI for example.
Anonymization, on the other hand, aims to change the content or structure of data irreversibly, so that it is impossible to identify a person. As the quality of the data is affected, it will therefore be necessary to find the right balance between legal constraints and practical needs by consulting a DPO, the database administrator and the business lines. However, some use cases require strong anonymization by default, such as the use of public data in Open Data.
Anonymize the data without emptying it of its substance
When used more, anonymization can be carried out through several methods to be selected, applied, evaluated and then validated, bearing in mind that continuous monitoring of regulatory and technological developments remains essential in order to be able to adapt periodically. Careful adjustments are also required as soon as new types of data or identifiable attributes are added to the database.
Among the most common anonymization methods, we find that by generalization which replaces a precise value by a more generic one, such as a postal address by a region for example, or an age by an age group, allowing to maintain the correlation. between the data. We can also operate a local deletion to process rare values in the database. The aggregation method consists in grouping data to obtain an average, admittedly less faithful, but which fulfills its role. Let us also quote the method by random permutation which mixes the data, not very powerful but interesting in a context of test.
Whatever methods are used, privacy protection models must be applied to validate the effectiveness of anonymization. This involves, among other things, verifying in the database that a minimum number of individuals have a unique value of quasi-identifiers and that they cannot be linked to sensitive attributes. Take as an example a study of the impact of pesticides on farms. To protect the identity of the operators, it will be determined that it is necessary to list at least 5 farms in each department (quasi-identifier) and ensure that it is not indicated that each of them uses the same pesticide. (sensitive attribute). Binding but essential precautions to guarantee anonymity.
A project that is more organizational than technical
As can be seen, desensitizing FADs while retaining their usefulness is not an easy exercise and one to be taken lightly. If having technical skills is of course a prerequisite, it is the issue of the uses sought by the professions and the scope of action that will determine the procedure to be followed in the short and long term, in consultation with the IT Department, the RSSI and a DPO.
Succeeding in bringing together all the stakeholders and setting up effective change management, this is where the main difficulty lies with this type of project, which concerns more or less all companies. Calling on a desensitization specialist who masters the end-to-end approach, whether for legal and organizational aspects or even technological monitoring, will therefore quickly prove to be judicious to obtain the approval of the CNIL.