The Mathematics Behind Differential Privacy
\(Pr[\mathcal{M}(D) \in \mathcal{S}] \leq e^{\varepsilon} \cdot Pr[\mathcal{M}(D’) \in \mathcal{S}]\)
Sensitivity and Query Function
Sensitivity and Query Function: \(\Delta f = max_{D, D^{‘}} ||f(D) – f(D^{‘})||\)
D and D’ are neighboring datasets
D and D’ differ in exactly one element
Name | Alice | Bob | Charly | Dave | Eve | Ferris | George | Harvey | Iris |
Age | 29 | 22 | 27 | 43 | 52 | 47 | 30 | 36 | 32 |
Name | Alice | Bob | Charly | Dave | Ferris | George | Harvey | Iris | |
Age | 29 | 22 | 27 | 43 | 47 | 30 | 36 | 32 |
(e.g., count(D) = 9, count(D’) = 8 – difference will always be one – Δf = 1
mean_age(D) = 35.3, mean_age(D’) = 33.3 – Δf = 2
Formal Definition of Differential Privacy
Differential privacy provides a mathematical framework to quantify and control the privacy guarantees of a data analysis system.
It is defined in terms of a parameter called ε, which represents the privacy budget or the level of privacy protection provided.
Detailed Explanation of the Differential-Privacy Equation
\(Pr[\mathcal{M}(D) \in \mathcal{S}] \leq e^{\varepsilon} \cdot Pr[\mathcal{M}(D’) \in \mathcal{S}]\)
The probability that a mechanism M, applied on a database D,
produces a certain output S is approximately the same
if applied to a neighboring database, with the level of similarity controlled by the privacy parameter \(\varepsilon\).
it is with the level of similarity controlled by the privacy parameter Epsilon
and this is now well this is e to the power of Epsilon,
this way it’s a bit easier to calculate the Epsilon because it’s a we cat just use logarithm
The probability distribution of the output remains nearly unchanged whether an individual’s data is included or excluded.
How to Add Noise to Achieve Differential Privacy
Sensitivity and Noise
![]() | ![]() |
it’s a bit more easy and I think it’s more straightforward to understand how it works and it’s basically the same thing with gaussian noises the math behind it is just a little bit more complicated
how to implement it in Python
laplacian function: Probability Distribution function
Two exponential functions
Lap(µ, b), µ = 0, b = Δf /ε
µ: the value which luckily for us is always going to be zero
laplacian function these two exponentials are always mirrored at the y-axis so at point zero
b: the parameter where the sensitivity comes into place so b equals the sensitivity Delta F divided by Epsilon
obviously I should remember that Epsilon is the Privacy parameter of the differential privacy equation
count-query: Δf = 1
b = 1/ε

different values of Epsilon I have different values or different outcomes of the probability distribution function of the laplacian function
Epsilon(ε) is the privacy parameter of differential privacy
ε = 1.0 – biggest Peak or the largest Peak
ε = 0.2 – very low (the total probability is higher however the peak is lower)
Python
import numpy as np
sensitivity = 1.00
epsilon = 0.5
b = sensitivity / epsilon
n = 200 # number of individuals
noise = np.random.laplace(0, b, n)
print(noise)
Step-by-Step implementation of k-anonymity with the Mondrian Algorithm in Python
pip install pandas
python mondrian.py
import pandas as pd
k = 3
data = pd.read_csv('data.csv')
print(data.head())
def mondrian(data, k):
partition[]
if(len(data)) <= (2*k-1):
partition.append(data)
return [data]
# define QIs
qis = ['Age']
# sort by QIs
data = data.sort_values(by=qis)
# number of total values
si = data['Age'].count()
mid = si/2
lhs = data[:mid]
rhs = data[mid:]
partition.extend(mondrian(lhs, k))
partition.extend(mondrian(rhs, k))
return partition
result_partitions = mondrian(data, k)
results_final = []
for i, partition in enumerate(result_partitions, start=1):
part_min = partition['Age'].min()
part_max = partition['Age'].max()
gen = f"|{part_min} - {part_max}|"
partitions['Age'] = gen
results_final.append(partition)
print(f"Partition, length={len(partition)}:")
if(len(partition)<k):
print("Error: Partition too small")
else:
print(partition)
anonymized_data = pd.concat(results_final, ignore_index=True)
anonymized_data.to_csv('anon.csv', index=False)
# k = 3
# len - partition = 6 <- 2*k
# partition1 = 3
# partition2 = 2
References
- list: https://www.youtube.com/playlist?list=PLZeK3TZueogEhGK0kTztL5ALQ_MkxgFCv
- Security and Privacy Academy, (6/11) The Mathematics Behind Differential Privacy, Jun 20, 2023, https://youtu.be/QJ3D4koSc6A?si=9ngmz8CZvS5pis81
- Security and Privacy Academy, (7/11) How to Add Noise to Achieve Differential Privacy, Jul 11, 2023, https://www.youtube.com/watch?v=mO0NmkVGapM&t=4s
- Security and Privacy Academy, (8/11) Step-by-Step implementation of k-anonymity with the Mondrian Algorithm in Python, Sep 30, 2023, https://youtu.be/aMDXM5HlB64?si=xAG5qt_xUbGYF6RC
May I simply just say what a relief to uncover someone that actually knows what they are talking about on the internet. You certainly know how to bring an issue to light and make it important. A lot more people should check this out and understand this side of your story. I was surprised you aren’t more popular given that you most certainly possess the gift.