2anonymity

Accessible data anonymization using a naive k-anonymity algorithm

Over one weekend, we developed a web tool for anonymizing data. This was in response to one of the presented hackathon problems highlighting how anonymity is often a difficult-to-approach topic for researchers working with sensitive data.

Working with my team, I developed an algorithm for finding potentially identifying values and determining the best way to handle them.

Under the k-anonymity model, sensitive data can be handled in one of two ways:

Data Suppression: This happens when data is ‘redacted’, and should be done for personal identifiers such as names, email addresses, and phone numbers. These will always be able to uniquely identify a person, and there is usually little value in generalizing them.
Data Generalization: This happens when data is ‘generalized’, and should be done for quasi-identifiers such as weight, height, and age. This is often referred to as ‘binning’, where values are placed into wider ranges, like grouping years into decades. Under our algorithm, bin ranges are variable, and dynamically calculated according to the user anonymization preferences. This approach is particularly useful when an attribute is not uniform, as often a consistent bin size will unnecessarily over-generalize in denser areas of the distribution. For example, if the age of study participants is normally distributed around 30 years, then knowing someone is 60 is much more identifying than knowing they are 30. Our approach deals with this by keeping the number of entries in each bin constant, rather than keeping the width of the bin constant.

The team then wrote code to implement this approach and incorporated it into an easy-to-use web app.

The web app, source code, and final presentation can be seen below.

March, 2023