What is data anonymization and how does it work?

Training of the artificial intelligence (AI) models requires massive amounts of data, especially when this models are used in healthcare industry. The final result of these models should be very precise as it directly influences the health of patients.

First and foremost, health data for training AI models should be anonymized to protect patients’ confidentiality. Anonymization makes sharing of health data possible for such secondary purposes like analysis, research, development, training, and/or quality control of AI algorithms. So how should data anonymization be performed so it’s not compromising patients’ privacy?

What is data anonymization?

Let’s start with the definition of ‘data anonymization’. Data anonymization is the process of removing personally identifiable information from data sets (e.g., imaging like CTs, MRIs, X-Rays or videos like OR or colonoscopy videos), so that the people whom the data describes or who are in the images/videos remain anonymous.

People are identifiable if imaging or video data includes any references to an identifier such as a name, an identification number, personnel number of a person, account data, customer number or any other personal data which directly or indirectly can help identify the person.

Hospitals and clinics must share only anonymized data with third parties like research organizations or healthcare software development companies. Any sensitive metadata like the patient’s name, social security number, the hospital’s name, and address should be erased. Direct identifiers must be removed or rewritten with random values.

GDPR and HIPAA compliance

Data anonymization, data storage and data transfer are regulated by GDPR in EU and HIPAA in the US. A good example of this approach is the Safe Harbor standard in the HIPAA Privacy Rule. It specifies 18 data elements that need to be removed or encrypted. If this is done properly, the data is considered anonymized with accordance to HIPAA.

This list includes:

Names of patients, nurses, doctors
Geographic locations
All elements of dates (except the year) that are related to an individual.
Telephone, cellphone, and/or fax numbers
Email addresses
IP addresses
Social Security Numbers
Medical record numbers
Health plan beneficiary numbers
Device identifiers and serial numbers
Certificate/license numbers
Account numbers
Vehicle identifiers and serial numbers, including license plates
Website URLs
Full-face photos
Biometric identifiers (e.g., fingerprints, voice prints, and retinal images)
Any unique identifying numbers, characteristics, or codes.

Should all patients’ metadata be anonymized?

To meet GDPR and/or HIPAA compliance not all fields, associated with imaging or video data should be removed. Often medical research is focused on some specific gender, pathology, age group or geography. This means that some information in the metadata description might be left as is but only if this data is not identifying people in them in any way.

Patient’s consent vs data anonymization

Anonymized data is no longer considered personal health data as people in the images or videos can’t be identified. Thus if the data is anonymized then no patient’s consent is required. On the other hand, if any details might lead to uncovering the patient’s identity, the patient consent is obligatory.

What is data anonymization and how does it work?

What is data anonymization?

GDPR and HIPAA compliance

Should all patients’ metadata be anonymized?

Patient’s consent vs data anonymization

Want to know how we can accelerate your AI success?