[DP#1] Database anonymization – k-anonymity, l-diversity, t-closeness

k-anonymity

IDQID (Quasi identifiers)
NameBobZIP12345
AddressBobway 42Sexmale
IP127.0.0.1Age17
NameAgeSexZIPDisease
Alice28F23467Cancer
Bob17M12345Heart disease
Charly34M65490Flu
Dave41M84933Bronchitis
AgeSexZIPDisease
21-30F23467Cancer
10-19M12345Heart disease
31-40M65490Flu
41-50M84933Bronchitis
Still unique!

k-anonymity: an individual’s quasi identifiers have to be equivalent to at least k-1 other individuals
form an equivalence class

Generalization

AgeSexZIPDisease
10-29F23467Cancer
10-29M12345Heart disease
30-49M65490Flu
30-49M84933Bronchitis

e.g., generalize the age to intervals of 20,
we can see that the first two cells are equal and the second two cells are equal

k = 2 anonymity in Age

Suppression

AgeSexZIPDisease
10-29*[10000-29999]Cancer
10-29*[10000-29999]Heart disease
30-49M[60000-89999]Flu
30-49M[60000-89999]Bronchitis

always have to weigh the Utility of the data against the Disclosure risk
important concept to understand before dealing with more sophisticated concepts, such as differential privacy

I-diversity

extends on the concept of k-anonymity and addresses some privacy issues that remain after k-anonymity is applied to protect a database from attacks
if the data is not diverse, individuals can still be identified

NameAgeZIPDisease
Alice2947677Heart Disease
Bob2247602Heart Disease
Charly2747678Heart Disease
Dave4347905Flu
Eve5247909Heart Disease
Ferris4747906Cancer
George3047605Heart Disease
Harvey3647673Cancer
Iris3247607Cancer
AgeZIPDisease
2*476**Heart Disease
2*476**Heart Disease
2*476**Heart Disease
40-504790*Flu
40-504790*Heart Disease
40-504790*Cancer
3*476**Heart Disease
3*476**Cancer
3*476**Cancer
without Salary, k = 3

3-diverse: three sensitive values within 2 class
2-diverse: two represented values within 3 class
we cannot do anything for the first equivalence class, as we would have to eliminate this equivalence class if we would want to have 2-diversity for the database

NameAgeZIPSalaryDisease
Alice29476773KGastric ulcer
Bob22476024KGastritis
Charly27476785KStomach cancer
Dave43479056KGastritis
Eve524790911KFlu
Ferris47479068KBronchitis
George30476057KBronchitis
Harvey36476739KPneumonia
Iris324760710KStomach cancer
AgeZIPSalaryDisease
2*476**3KGastric ulcer
2*476**4KGastritis
2*476**5KStomach cancer
>404790*6KGastritis
>404790*11KFlu
>404790*8KBronchitis
3*476**7KBronchitis
3*476**9KPneumonia
3*476**10KStomach cancer
without Salary, k = 3

3-anonymity, 3-diversity

l-diversity does not care about semantics

t-closeness

Salary = {3k, 4k, 5k, 6k, 7k, 8k, 9k, 10k, 11k}
P1 = {3k, 4k, 5k}, P2 = {6k, 8k, 11k}

Earth Mover’s distance

Ordered distance \(D_O = \frac{|i-j|}{n-1}\)

11-5 + 10-5 + 9-5 + 8-4 + 7-4 + 6-4 + 5-3 + 4-3 = 27
27 / 8 = 3.375
3.375 / 9 = 0.375: optimal mass flow

D[P1, Q] = 0.375, D[P2, Q] = 0.167

{Gastric ulcer, Gastritis, Stomach cancer, Flu, Pneumonia, Bronchitis}
P1 = {Gastric ulcer, Gastritis, Stomach cancer} – (3×1)/6 = 0.5
P2 = {Gastric ulcer, Stomach cancer, Pneumonia} – (1/3 + 1/3 + 1)/6 = 0.278

AgeZIPSalaryDisease
20-404767*3KGastric ulcer
20-404767*9KPneumonia
20-404767*5KStomach cancer
40-604790*6KGastritis
40-604790*11KFlu
40-604790*8KBronchitis
20-404760*7KBronchitis
20-404760*10KStomach cancer
20-404760*4KGastritis
3-anonymity, 3-diversity
0.167-close (salary), 0.278-close (disease)

Differential Privacy explained

data gets perturbed
noise added to the data on the server
global or centralized differential privacy

NameAgeZIPSalaryDisease
Alice29476773KGastric ulcer
Bob22476024KGastritis
Charly27476785KStomach cancer
Dave43479056KGastritis
Eve524790911KFlu
Ferris47479068KBronchitis
George30476057KBronchitis
Harvey36476739KPneumonia
Iris324760710KStomach cancer
AgeZIPSalaryDisease
20-404767*3KGastric ulcer
20-404767*9KPneumonia
20-404767*5KStomach cancer
40-604790*6KGastritis
40-604790*11KFlu
40-604790*8KBronchitis
20-404760*7KBronchitis
20-404760*10KStomach cancer
20-404760*4KGastritis

improve our data subjects’ privacy by applying differential privacy to the salary
how do we apply the noise?


epsilon: privacy parameter of differential privacy
improve our data subjects’ privacy by applying differential privacy to the salary
how do we apply the noise?

AgeZIPSalaryDiseaseNoise
20-404767*3K4KGastric ulcer1
20-404767*9K11KPneumonia2
20-404767*5K5KStomach cancer0
40-604790*6K3KGastritis-6
40-604790*11K11KFlu0
40-604790*8K8KBronchitis0
20-404760*7K6KBronchitis-1
20-404760*10K15KStomach cancer5
20-404760*4K4KGastritis0

Noise can also be so large that values can become negative
Global or Centralized DP
noisy mean salary = 7.44

Randomized Response

the idea is trivial: the participant flips a coin
randomized response gives 75% chance of the answer being the actual answer with 25% of being the wrong one

Differential privacy: a technique that wanders around the edges of utility and privacy

Crypto Shredding explained

Name: Bob
Address: Bobway 42
ZIP: 12345
Sex: male
Age: 17

NameAgeSexZIPAddress
Alice28F23467Alice road 20
Bob17M12345Bobway 42
Charly34M65490Charly Avenue 137
Dave41M84933Dave street 98
NameKey
Alice209fwefjs0f9fSEwf0f8h
Bobae09ffjvnxcvdsgertEWE
Charly99dfjsd9f0safjssfaaWEEF
Davesdfoicnvnynvre8u8WEW

analyst </> – (Cache DB – Encrypted DB – Keys) – Cache builder
Cache builder: obviously first the keys decrypt the data and encrypted data from the encrypted database
Cache builder – Keys
Cache DB – analyst: query + read
analyst → Encrypted DB: write
analyst – Keys

References

3 thoughts on “[DP#1] Database anonymization – k-anonymity, l-diversity, t-closeness”

Leave a Comment