What is data anonymization and how does it work?


Accelerating your AI Success

January 26, 2023 | 6 min read

What is data anonymization and how does it work?

Training of the artificial intelligence (AI) models requires massive amounts of data, especially when this models are used in healthcare industry. The final result of these models should be very precise as it directly influences the health of patients.

First and foremost, health data for training AI models should be anonymized to protect patients’ confidentiality. Anonymization makes sharing of health data possible for such secondary purposes like analysis, research, development, training, and/or quality control of AI algorithms. So how should data anonymization be performed so it’s not compromising patients’ privacy?

What is data anonymization?

Let’s start with the definition of ‘data anonymization’. Data anonymization is the process of removing personally identifiable information from data sets (e.g., imaging like CTs, MRIs, X-Rays or videos like OR or colonoscopy videos), so that the people whom the data describes or who are in the images/videos remain anonymous.

People are identifiable if imaging or video data includes any references to an identifier such as a name, an identification number, personnel number of a person, account data, customer number or any other personal data which directly or indirectly can help identify the person.

Hospitals and clinics must share only anonymized data with third parties like research organizations or healthcare software development companies. Any sensitive metadata like the patient’s name, social security number, the hospital’s name, and address should be erased. Direct identifiers must be removed or rewritten with random values.

GDPR and HIPAA compliance

Data anonymization, data storage and data transfer are regulated by GDPR in EU and HIPAA in the US. A good example of this approach is the Safe Harbor standard in the HIPAA Privacy Rule. It specifies 18 data elements that need to be removed or encrypted. If this is done properly, the data is considered anonymized with accordance to HIPAA.

This list includes:

  • Names of patients, nurses, doctors
  • Geographic locations
  • All elements of dates (except the year) that are related to an individual.
  • Telephone, cellphone, and/or fax numbers
  • Email addresses
  • IP addresses
  • Social Security Numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Device identifiers and serial numbers
  • Certificate/license numbers
  • Account numbers
  • Vehicle identifiers and serial numbers, including license plates
  • Website URLs
  • Full-face photos
  • Biometric identifiers (e.g., fingerprints, voice prints, and retinal images)
  • Any unique identifying numbers, characteristics, or codes.

Should all patients’ metadata be anonymized?

To meet GDPR and/or HIPAA compliance not all fields, associated with imaging or video data should be removed. Often medical research is focused on some specific gender, pathology, age group or geography. This means that some information in the metadata description might be left as is but only if this data is not identifying people in them in any way.

Patient’s consent vs data anonymization

Anonymized data is no longer considered personal health data as people in the images or videos can’t be identified. Thus if the data is anonymized then no patient’s consent is required. On the other hand, if any details might lead to uncovering the patient’s identity, the patient consent is obligatory.

You may also like:

Want to know how we can accelerate your AI success?

Get a quote