k-anonymity
ID | QID (Quasi identifiers) | ||
Name | Bob | ZIP | 12345 |
Address | Bobway 42 | Sex | male |
IP | 127.0.0.1 | Age | 17 |
Name | Age | Sex | ZIP | Disease |
Alice | 28 | F | 23467 | Cancer |
Bob | 17 | M | 12345 | Heart disease |
Charly | 34 | M | 65490 | Flu |
Dave | 41 | M | 84933 | Bronchitis |
Age | Sex | ZIP | Disease |
21-30 | F | 23467 | Cancer |
10-19 | M | 12345 | Heart disease |
31-40 | M | 65490 | Flu |
41-50 | M | 84933 | Bronchitis |
k-anonymity: an individual’s quasi identifiers have to be equivalent to at least k-1 other individuals
form an equivalence class
Generalization
Age | Sex | ZIP | Disease |
10-29 | F | 23467 | Cancer |
10-29 | M | 12345 | Heart disease |
30-49 | M | 65490 | Flu |
30-49 | M | 84933 | Bronchitis |
e.g., generalize the age to intervals of 20,
we can see that the first two cells are equal and the second two cells are equal
k = 2 anonymity in Age
Suppression
Age | Sex | ZIP | Disease |
10-29 | * | [10000-29999] | Cancer |
10-29 | * | [10000-29999] | Heart disease |
30-49 | M | [60000-89999] | Flu |
30-49 | M | [60000-89999] | Bronchitis |
always have to weigh the Utility of the data against the Disclosure risk
important concept to understand before dealing with more sophisticated concepts, such as differential privacy
I-diversity
extends on the concept of k-anonymity and addresses some privacy issues that remain after k-anonymity is applied to protect a database from attacks
if the data is not diverse, individuals can still be identified
Name | Age | ZIP | Disease |
Alice | 29 | 47677 | Heart Disease |
Bob | 22 | 47602 | Heart Disease |
Charly | 27 | 47678 | Heart Disease |
Dave | 43 | 47905 | Flu |
Eve | 52 | 47909 | Heart Disease |
Ferris | 47 | 47906 | Cancer |
George | 30 | 47605 | Heart Disease |
Harvey | 36 | 47673 | Cancer |
Iris | 32 | 47607 | Cancer |
Age | ZIP | Disease |
2* | 476** | Heart Disease |
2* | 476** | Heart Disease |
2* | 476** | Heart Disease |
40-50 | 4790* | Flu |
40-50 | 4790* | Heart Disease |
40-50 | 4790* | Cancer |
3* | 476** | Heart Disease |
3* | 476** | Cancer |
3* | 476** | Cancer |
3-diverse: three sensitive values within 2 class
2-diverse: two represented values within 3 class
we cannot do anything for the first equivalence class, as we would have to eliminate this equivalence class if we would want to have 2-diversity for the database
Name | Age | ZIP | Salary | Disease |
Alice | 29 | 47677 | 3K | Gastric ulcer |
Bob | 22 | 47602 | 4K | Gastritis |
Charly | 27 | 47678 | 5K | Stomach cancer |
Dave | 43 | 47905 | 6K | Gastritis |
Eve | 52 | 47909 | 11K | Flu |
Ferris | 47 | 47906 | 8K | Bronchitis |
George | 30 | 47605 | 7K | Bronchitis |
Harvey | 36 | 47673 | 9K | Pneumonia |
Iris | 32 | 47607 | 10K | Stomach cancer |
Age | ZIP | Salary | Disease |
2* | 476** | 3K | Gastric ulcer |
2* | 476** | 4K | Gastritis |
2* | 476** | 5K | Stomach cancer |
>40 | 4790* | 6K | Gastritis |
>40 | 4790* | 11K | Flu |
>40 | 4790* | 8K | Bronchitis |
3* | 476** | 7K | Bronchitis |
3* | 476** | 9K | Pneumonia |
3* | 476** | 10K | Stomach cancer |
3-anonymity, 3-diversity
l-diversity does not care about semantics
t-closeness
Salary = {3k, 4k, 5k, 6k, 7k, 8k, 9k, 10k, 11k}
P1 = {3k, 4k, 5k}, P2 = {6k, 8k, 11k}
Earth Mover’s distance
Ordered distance \(D_O = \frac{|i-j|}{n-1}\)
11-5 + 10-5 + 9-5 + 8-4 + 7-4 + 6-4 + 5-3 + 4-3 = 27
27 / 8 = 3.375
3.375 / 9 = 0.375: optimal mass flow
D[P1, Q] = 0.375, D[P2, Q] = 0.167
{Gastric ulcer, Gastritis, Stomach cancer, Flu, Pneumonia, Bronchitis}
P1 = {Gastric ulcer, Gastritis, Stomach cancer} – (3×1)/6 = 0.5
P2 = {Gastric ulcer, Stomach cancer, Pneumonia} – (1/3 + 1/3 + 1)/6 = 0.278
Age | ZIP | Salary | Disease |
20-40 | 4767* | 3K | Gastric ulcer |
20-40 | 4767* | 9K | Pneumonia |
20-40 | 4767* | 5K | Stomach cancer |
40-60 | 4790* | 6K | Gastritis |
40-60 | 4790* | 11K | Flu |
40-60 | 4790* | 8K | Bronchitis |
20-40 | 4760* | 7K | Bronchitis |
20-40 | 4760* | 10K | Stomach cancer |
20-40 | 4760* | 4K | Gastritis |
0.167-close (salary), 0.278-close (disease)
Differential Privacy explained
data gets perturbed
noise added to the data on the server
global or centralized differential privacy
Name | Age | ZIP | Salary | Disease |
Alice | 29 | 47677 | 3K | Gastric ulcer |
Bob | 22 | 47602 | 4K | Gastritis |
Charly | 27 | 47678 | 5K | Stomach cancer |
Dave | 43 | 47905 | 6K | Gastritis |
Eve | 52 | 47909 | 11K | Flu |
Ferris | 47 | 47906 | 8K | Bronchitis |
George | 30 | 47605 | 7K | Bronchitis |
Harvey | 36 | 47673 | 9K | Pneumonia |
Iris | 32 | 47607 | 10K | Stomach cancer |
Age | ZIP | Salary | Disease |
20-40 | 4767* | 3K | Gastric ulcer |
20-40 | 4767* | 9K | Pneumonia |
20-40 | 4767* | 5K | Stomach cancer |
40-60 | 4790* | 6K | Gastritis |
40-60 | 4790* | 11K | Flu |
40-60 | 4790* | 8K | Bronchitis |
20-40 | 4760* | 7K | Bronchitis |
20-40 | 4760* | 10K | Stomach cancer |
20-40 | 4760* | 4K | Gastritis |
improve our data subjects’ privacy by applying differential privacy to the salary
how do we apply the noise?

epsilon: privacy parameter of differential privacy
improve our data subjects’ privacy by applying differential privacy to the salary
how do we apply the noise?
Age | ZIP | Salary | Disease | Noise | |
20-40 | 4767* | 3K | 4K | Gastric ulcer | 1 |
20-40 | 4767* | 9K | 11K | Pneumonia | 2 |
20-40 | 4767* | 5K | 5K | Stomach cancer | 0 |
40-60 | 4790* | 6K | 3K | Gastritis | -6 |
40-60 | 4790* | 11K | 11K | Flu | 0 |
40-60 | 4790* | 8K | 8K | Bronchitis | 0 |
20-40 | 4760* | 7K | 6K | Bronchitis | -1 |
20-40 | 4760* | 10K | 15K | Stomach cancer | 5 |
20-40 | 4760* | 4K | 4K | Gastritis | 0 |
Noise can also be so large that values can become negative
Global or Centralized DP
noisy mean salary = 7.44
Randomized Response
the idea is trivial: the participant flips a coin
randomized response gives 75% chance of the answer being the actual answer with 25% of being the wrong one
Differential privacy: a technique that wanders around the edges of utility and privacy
Crypto Shredding explained
Name: Bob
Address: Bobway 42
ZIP: 12345
Sex: male
Age: 17
Name | Age | Sex | ZIP | Address |
Alice | 28 | F | 23467 | Alice road 20 |
Bob | 17 | M | 12345 | Bobway 42 |
Charly | 34 | M | 65490 | Charly Avenue 137 |
Dave | 41 | M | 84933 | Dave street 98 |
Name | Key |
Alice | 209fwefjs0f9fSEwf0f8h |
Bob | ae09ffjvnxcvdsgertEWE |
Charly | 99dfjsd9f0safjssfaaWEEF |
Dave | sdfoicnvnynvre8u8WEW |
analyst </> – (Cache DB – Encrypted DB – Keys) – Cache builder
Cache builder: obviously first the keys decrypt the data and encrypted data from the encrypted database
Cache builder – Keys
Cache DB – analyst: query + read
analyst → Encrypted DB: write
analyst – Keys
References
- list: https://www.youtube.com/playlist?list=PLZeK3TZueogEhGK0kTztL5ALQ_MkxgFCv
- Security and Privacy Academy, (2/11) k-anonymity explained, Jan 20, 2023, https://youtu.be/Q0DNOIGUzMc?si=JO9VpoVWDuyZuKIw
- Security and Privacy Academy, (1/11) L-Diversity explained, Feb 1, 2023, https://youtu.be/GNhb3PcmjmA?si=vyo4HKOBtQUz3Pe8
- Security and Privacy Academy, (3/11) t-closeness explained, Feb 3, 2023, https://youtu.be/Upb8jqlsbFM?si=68z96b1NNGVS0_1v
- Security and Privacy Academy, (4/11) Differential Privacy explained, Feb 6, 2023, https://youtu.be/XgotQQpXwio?si=8qL8KzW5l8Gm0NC4
- Security and Privacy Academy, (5/11) Crypto Shredding explained, May 8, 2023, https://youtu.be/iBg8OC8MzIQ?si=lozjNozvpQFU0_Zx
dicta eligendi dolorem et ducimus in enim doloribus excepturi est sequi hic facilis. voluptatem aperiam repellat beatae ipsum error et adipisci commodi velit voluptatibus eligendi similique sunt quaer
et ut quidem rerum porro. est ut ipsam aut non ad dolores neque ullam. maiores quas et aliquid quia omnis consequatur ducimus iste amet hic provident quas aut. tenetur est culpa eum molestiae corporis
I couldn’t refrain from commenting. Exceptionally well written.