[DP#2] Database anonymization – differential privacy

The Mathematics Behind Differential Privacy

\(Pr[\mathcal{M}(D) \in \mathcal{S}] \leq e^{\varepsilon} \cdot Pr[\mathcal{M}(D’) \in \mathcal{S}]\)

Sensitivity and Query Function

Sensitivity and Query Function: \(\Delta f = max_{D, D^{‘}} ||f(D) – f(D^{‘})||\)

D and D’ are neighboring datasets
D and D’ differ in exactly one element

NameAliceBobCharlyDaveEveFerrisGeorgeHarveyIris
Age292227435247303632
D: mean_age(D) = 35.3, max change = 52
NameAliceBobCharlyDaveFerrisGeorgeHarveyIris
Age2922274347303632
D’

(e.g., count(D) = 9, count(D’) = 8 – difference will always be one – Δf = 1
mean_age(D) = 35.3, mean_age(D’) = 33.3 – Δf = 2

Formal Definition of Differential Privacy

Differential privacy provides a mathematical framework to quantify and control the privacy guarantees of a data analysis system.
It is defined in terms of a parameter called ε, which represents the privacy budget or the level of privacy protection provided.

Detailed Explanation of the Differential-Privacy Equation

\(Pr[\mathcal{M}(D) \in \mathcal{S}] \leq e^{\varepsilon} \cdot Pr[\mathcal{M}(D’) \in \mathcal{S}]\)

The probability that a mechanism M, applied on a database D,
produces a certain output S is approximately the same
if applied to a neighboring database, with the level of similarity controlled by the privacy parameter \(\varepsilon\).
it is with the level of similarity controlled by the privacy parameter Epsilon
and this is now well this is e to the power of Epsilon,
this way it’s a bit easier to calculate the Epsilon because it’s a we cat just use logarithm

The probability distribution of the output remains nearly unchanged whether an individual’s data is included or excluded.

How to Add Noise to Achieve Differential Privacy

Sensitivity and Noise

Laplacian Noise, Gaussian Noise

it’s a bit more easy and I think it’s more straightforward to understand how it works and it’s basically the same thing with gaussian noises the math behind it is just a little bit more complicated
how to implement it in Python

laplacian function: Probability Distribution function
Two exponential functions

Lap(µ, b), µ = 0, b = Δf /ε

µ: the value which luckily for us is always going to be zero
laplacian function these two exponentials are always mirrored at the y-axis so at point zero

b: the parameter where the sensitivity comes into place so b equals the sensitivity Delta F divided by Epsilon
obviously I should remember that Epsilon is the Privacy parameter of the differential privacy equation

count-query: Δf = 1
b = 1/ε

different values of Epsilon I have different values or different outcomes of the probability distribution function of the laplacian function

Epsilon(ε) is the privacy parameter of differential privacy
ε = 1.0 – biggest Peak or the largest Peak
ε = 0.2 – very low (the total probability is higher however the peak is lower)

Python

import numpy as np

sensitivity = 1.00
epsilon = 0.5
b = sensitivity / epsilon

n = 200 # number of individuals
noise = np.random.laplace(0, b, n)
print(noise)

Step-by-Step implementation of k-anonymity with the Mondrian Algorithm in Python

pip install pandas
python mondrian.py

import pandas as pd

k = 3
data = pd.read_csv('data.csv')
print(data.head())

def mondrian(data, k):
    partition[]

    if(len(data)) <= (2*k-1):
        partition.append(data)
        return [data]

    # define QIs
    qis = ['Age']

    # sort by QIs
    data = data.sort_values(by=qis)

    # number of total values
    si = data['Age'].count()

    mid = si/2

    lhs = data[:mid]
    rhs = data[mid:]

    partition.extend(mondrian(lhs, k))
    partition.extend(mondrian(rhs, k))

    return partition

result_partitions = mondrian(data, k)
results_final = []

for i, partition in enumerate(result_partitions, start=1):
    part_min = partition['Age'].min()
    part_max = partition['Age'].max()

    gen = f"|{part_min} - {part_max}|"

    partitions['Age'] = gen
    results_final.append(partition)

    print(f"Partition, length={len(partition)}:")

    if(len(partition)<k):
        print("Error: Partition too small")
    else:
        print(partition)

anonymized_data = pd.concat(results_final, ignore_index=True)
anonymized_data.to_csv('anon.csv', index=False)

#    k = 3
#    len - partition = 6 <- 2*k
#    partition1 = 3
#    partition2 = 2

References

Leave a Comment