## The Mathematics Behind Differential Privacy

\(Pr[\mathcal{M}(D) \in \mathcal{S}] \leq e^{\varepsilon} \cdot Pr[\mathcal{M}(D’) \in \mathcal{S}]\)

#### Sensitivity and Query Function

Sensitivity and Query Function: \(\Delta f = max_{D, D^{‘}} ||f(D) – f(D^{‘})||\)

D and D’ are neighboring datasets

D and D’ differ in exactly one element

Name | Alice | Bob | Charly | Dave | Eve | Ferris | George | Harvey | Iris |

Age | 29 | 22 | 27 | 43 | 52 | 47 | 30 | 36 | 32 |

Name | Alice | Bob | Charly | Dave | Ferris | George | Harvey | Iris | |

Age | 29 | 22 | 27 | 43 | 47 | 30 | 36 | 32 |

(e.g., count(D) = 9, count(D’) = 8 – difference will always be one – Δf = 1

mean_age(D) = 35.3, mean_age(D’) = 33.3 – Δf = 2

#### Formal Definition of Differential Privacy

Differential privacy provides a mathematical framework to quantify and control the privacy guarantees of a data analysis system.

It is defined in terms of a parameter called ε, which represents the privacy budget or the level of privacy protection provided.

#### Detailed Explanation of the Differential-Privacy Equation

\(Pr[\mathcal{M}(D) \in \mathcal{S}] \leq e^{\varepsilon} \cdot Pr[\mathcal{M}(D’) \in \mathcal{S}]\)

The probability that a mechanism M, applied on a database D,

produces a certain output S is approximately the same

if applied to a neighboring database, with the level of similarity controlled by the privacy parameter \(\varepsilon\).

it is with the level of similarity controlled by the privacy parameter Epsilon

and this is now well this is e to the power of Epsilon,

this way it’s a bit easier to calculate the Epsilon because it’s a we cat just use logarithm

The probability distribution of the output remains nearly unchanged whether an individual’s data is included or excluded.

## How to Add Noise to Achieve Differential Privacy

#### Sensitivity and Noise

it’s a bit more easy and I think it’s more straightforward to understand how it works and it’s basically the same thing with gaussian noises the math behind it is just a little bit more complicated

how to implement it in Python

laplacian function: Probability Distribution function

Two exponential functions

Lap(µ, b), µ = 0, b = Δf /ε

µ: the value which luckily for us is always going to be zero

laplacian function these two exponentials are always mirrored at the y-axis so at point zero

b: the parameter where the sensitivity comes into place so b equals the sensitivity Delta F divided by Epsilon

obviously I should remember that Epsilon is the Privacy parameter of the differential privacy equation

count-query: Δf = 1

b = 1/ε

different values of Epsilon I have different values or different outcomes of the probability distribution function of the laplacian function

Epsilon(ε) is the privacy parameter of differential privacy

ε = 1.0 – biggest Peak or the largest Peak

ε = 0.2 – very low (the total probability is higher however the peak is lower)

#### Python

```
import numpy as np
sensitivity = 1.00
epsilon = 0.5
b = sensitivity / epsilon
n = 200 # number of individuals
noise = np.random.laplace(0, b, n)
print(noise)
```

## Step-by-Step implementation of k-anonymity with the Mondrian Algorithm in Python

pip install pandas

python mondrian.py

```
import pandas as pd
k = 3
data = pd.read_csv('data.csv')
print(data.head())
def mondrian(data, k):
partition[]
if(len(data)) <= (2*k-1):
partition.append(data)
return [data]
# define QIs
qis = ['Age']
# sort by QIs
data = data.sort_values(by=qis)
# number of total values
si = data['Age'].count()
mid = si/2
lhs = data[:mid]
rhs = data[mid:]
partition.extend(mondrian(lhs, k))
partition.extend(mondrian(rhs, k))
return partition
result_partitions = mondrian(data, k)
results_final = []
for i, partition in enumerate(result_partitions, start=1):
part_min = partition['Age'].min()
part_max = partition['Age'].max()
gen = f"|{part_min} - {part_max}|"
partitions['Age'] = gen
results_final.append(partition)
print(f"Partition, length={len(partition)}:")
if(len(partition)<k):
print("Error: Partition too small")
else:
print(partition)
anonymized_data = pd.concat(results_final, ignore_index=True)
anonymized_data.to_csv('anon.csv', index=False)
# k = 3
# len - partition = 6 <- 2*k
# partition1 = 3
# partition2 = 2
```

### References

- list: https://www.youtube.com/playlist?list=PLZeK3TZueogEhGK0kTztL5ALQ_MkxgFCv
- Security and Privacy Academy, (6/11) The Mathematics Behind Differential Privacy, Jun 20, 2023, https://youtu.be/QJ3D4koSc6A?si=9ngmz8CZvS5pis81
- Security and Privacy Academy, (7/11) How to Add Noise to Achieve Differential Privacy, Jul 11, 2023, https://www.youtube.com/watch?v=mO0NmkVGapM&t=4s
- Security and Privacy Academy, (8/11) Step-by-Step implementation of k-anonymity with the Mondrian Algorithm in Python, Sep 30, 2023, https://youtu.be/aMDXM5HlB64?si=xAG5qt_xUbGYF6RC