{"id":3297,"date":"2023-12-17T18:14:35","date_gmt":"2023-12-17T09:14:35","guid":{"rendered":"https:\/\/saraheee.com\/?p=3297"},"modified":"2023-12-17T18:14:38","modified_gmt":"2023-12-17T09:14:38","slug":"dp2-database-anonymization-differential-privacy","status":"publish","type":"post","link":"https:\/\/saraheee.com\/ko\/2023\/12\/dp2-database-anonymization-differential-privacy\/","title":{"rendered":"[DP#2] Database anonymization &#8211; differential privacy"},"content":{"rendered":"<h2 class=\"wp-block-heading\">The Mathematics Behind Differential Privacy<\/h2>\n\n\n\n<p>\\(Pr[\\mathcal{M}(D) \\in \\mathcal{S}] \\leq e^{\\varepsilon} \\cdot Pr[\\mathcal{M}(D&#8217;) \\in \\mathcal{S}]\\)<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Sensitivity and Query Function<\/h4>\n\n\n\n<p>Sensitivity and Query Function: \\(\\Delta f = max_{D, D^{&#8216;}} ||f(D) &#8211; f(D^{&#8216;})||\\)<\/p>\n\n\n\n<p>D and D&#8217; are neighboring datasets<br>D and D&#8217; differ in exactly one element<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Name<\/td><td>Alice<\/td><td>Bob<\/td><td>Charly<\/td><td>Dave<\/td><td>Eve<\/td><td>Ferris<\/td><td>George<\/td><td>Harvey<\/td><td>Iris<\/td><\/tr><tr><td>Age<\/td><td>29<\/td><td>22<\/td><td>27<\/td><td>43<\/td><td>52<\/td><td>47<\/td><td>30<\/td><td>36<\/td><td>32<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">D: mean_age(D) = 35.3, max change = 52<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Name<\/td><td>Alice<\/td><td>Bob<\/td><td>Charly<\/td><td>Dave<\/td><td><\/td><td>Ferris<\/td><td>George<\/td><td>Harvey<\/td><td>Iris<\/td><\/tr><tr><td>Age<\/td><td>29<\/td><td>22<\/td><td>27<\/td><td>43<\/td><td><\/td><td>47<\/td><td>30<\/td><td>36<\/td><td>32<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">D&#8217;<\/figcaption><\/figure>\n\n\n\n<p>(e.g., count(D) = 9, count(D&#8217;) = 8 &#8211; difference will always be one &#8211; \u0394f = 1<br>mean_age(D) = 35.3, mean_age(D&#8217;) = 33.3 &#8211; \u0394f = 2<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Formal Definition of Differential Privacy<\/h4>\n\n\n\n<p>Differential privacy provides a mathematical framework to quantify and control the privacy guarantees of a data analysis system.<br>It is defined in terms of a parameter called \u03b5, which represents the privacy budget or the level of privacy protection provided.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Detailed Explanation of the Differential-Privacy Equation<\/h4>\n\n\n\n<p>\\(Pr[\\mathcal{M}(D) \\in \\mathcal{S}] \\leq e^{\\varepsilon} \\cdot Pr[\\mathcal{M}(D&#8217;) \\in \\mathcal{S}]\\)<\/p>\n\n\n\n<p>The probability that a mechanism M, applied on a database D,<br>produces a certain output S is approximately the same<br>if applied to a neighboring database, with the level of similarity controlled by the privacy parameter \\(\\varepsilon\\).<br><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-book-reviews-quotation-color\">it is with the level of similarity controlled by the privacy parameter Epsilon<br>and this is now well this is e to the power of Epsilon,<br>this way it&#8217;s a bit easier to calculate the Epsilon because it&#8217;s a we cat just use logarithm<\/mark><\/p>\n\n\n\n<p>The probability distribution of the output remains nearly unchanged whether an individual&#8217;s data is included or excluded.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to Add Noise to Achieve Differential Privacy<\/h2>\n\n\n\n<h4 class=\"wp-block-heading\">Sensitivity and Noise<\/h4>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td><img loading=\"lazy\" decoding=\"async\" width=\"391\" height=\"460\" src=\"https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-9.png\" alt=\"\"><\/td><td><img loading=\"lazy\" decoding=\"async\" width=\"369\" height=\"469\" src=\"https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-10.png\" alt=\"\"><\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Laplacian Noise, Gaussian Noise<\/figcaption><\/figure>\n\n\n\n<p><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-book-reviews-quotation-color\">it&#8217;s a bit more easy and I think it&#8217;s more straightforward to understand how it works and it&#8217;s basically the same thing with gaussian noises the math behind it is just a little bit more complicated<br>how to implement it in Python<\/mark><\/p>\n\n\n\n<p>laplacian function: Probability Distribution function<br>Two exponential functions<\/p>\n\n\n\n<p>Lap(\u00b5, b), \u00b5 = 0, <mark style=\"background-color:var(--base)\" class=\"has-inline-color\">b = \u0394f \/\u03b5<\/mark><\/p>\n\n\n\n<p><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-book-reviews-quotation-color\">\u00b5: the value which luckily for us is always going to be zero<br>laplacian function these two exponentials are always mirrored at the y-axis so at point zero<\/mark><\/p>\n\n\n\n<p><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-book-reviews-quotation-color\">b: the parameter where the sensitivity comes into place so b equals the sensitivity Delta F divided by Epsilon<br>obviously I should remember that Epsilon is the Privacy parameter of the differential privacy equation<\/mark><\/p>\n\n\n\n<p>count-query: \u0394f = 1<br>b = 1\/\u03b5<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-11-1024x775.png\" alt=\"\" class=\"wp-image-3294\" width=\"512\" height=\"388\" srcset=\"https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-11-1024x775.png 1024w, https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-11-300x227.png 300w, https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-11-768x582.png 768w, https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-11-1536x1163.png 1536w, https:\/\/saraheee.com\/wp-content\/uploads\/2023\/12\/image-11.png 1672w\" sizes=\"(max-width: 512px) 100vw, 512px\" \/><\/figure><\/div>\n\n\n<p><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-book-reviews-quotation-color\">different values of Epsilon I have different values or different outcomes of the probability distribution function of the laplacian function<\/mark><\/p>\n\n\n\n<p><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-book-reviews-quotation-color\">Epsilon(\u03b5) is the privacy parameter of differential privacy<br>\u03b5 = 1.0 &#8211; biggest Peak or the largest Peak<br>\u03b5 = 0.2 &#8211; very low (the total probability is higher however the peak is lower)<\/mark><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Python<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>import numpy as np\n\nsensitivity = 1.00\nepsilon = 0.5\nb = sensitivity \/ epsilon\n\nn = 200 # number of individuals\nnoise = np.random.laplace(0, b, n)\nprint(noise)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Step-by-Step implementation of k-anonymity with the Mondrian Algorithm in Python<\/h2>\n\n\n\n<p>pip install pandas<br>python mondrian.py<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\nk = 3\ndata = pd.read_csv('data.csv')\nprint(data.head())\n\ndef mondrian(data, k):\n    partition&#091;]\n\n    if(len(data)) &lt;= (2*k-1):\n        partition.append(data)\n        return &#091;data]\n\n    # define QIs\n    qis = &#091;'Age']\n\n    # sort by QIs\n    data = data.sort_values(by=qis)\n\n    # number of total values\n    si = data&#091;'Age'].count()\n\n    mid = si\/2\n\n    lhs = data&#091;:mid]\n    rhs = data&#091;mid:]\n\n    partition.extend(mondrian(lhs, k))\n    partition.extend(mondrian(rhs, k))\n\n    return partition\n\nresult_partitions = mondrian(data, k)\nresults_final = &#091;]\n\nfor i, partition in enumerate(result_partitions, start=1):\n    part_min = partition&#091;'Age'].min()\n    part_max = partition&#091;'Age'].max()\n\n    gen = f\"|{part_min} - {part_max}|\"\n\n    partitions&#091;'Age'] = gen\n    results_final.append(partition)\n\n    print(f\"Partition, length={len(partition)}:\")\n\n    if(len(partition)&lt;k):\n        print(\"Error: Partition too small\")\n    else:\n        print(partition)\n\nanonymized_data = pd.concat(results_final, ignore_index=True)\nanonymized_data.to_csv('anon.csv', index=False)\n\n#    k = 3\n#    len - partition = 6 &lt;- 2*k\n#    partition1 = 3\n#    partition2 = 2<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>list: <a href=\"https:\/\/www.youtube.com\/playlist?list=PLZeK3TZueogEhGK0kTztL5ALQ_MkxgFCv\" rel=\"noopener\">https:\/\/www.youtube.com\/playlist?list=PLZeK3TZueogEhGK0kTztL5ALQ_MkxgFCv<\/a><\/li>\n\n\n\n<li>Security and Privacy Academy, (6\/11) The Mathematics Behind Differential Privacy, Jun 20, 2023, <a href=\"https:\/\/youtu.be\/QJ3D4koSc6A?si=9ngmz8CZvS5pis81\" rel=\"noopener\">https:\/\/youtu.be\/QJ3D4koSc6A?si=9ngmz8CZvS5pis81<\/a><\/li>\n\n\n\n<li>Security and Privacy Academy, (7\/11) How to Add Noise to Achieve Differential Privacy, Jul 11, 2023, <a href=\"https:\/\/www.youtube.com\/watch?v=mO0NmkVGapM&amp;t=4s\" rel=\"noopener\">https:\/\/www.youtube.com\/watch?v=mO0NmkVGapM&amp;t=4s<\/a><\/li>\n\n\n\n<li>Security and Privacy Academy, (8\/11) Step-by-Step implementation of k-anonymity with the Mondrian Algorithm in Python, Sep 30, 2023, <a href=\"https:\/\/youtu.be\/aMDXM5HlB64?si=xAG5qt_xUbGYF6RC\" rel=\"noopener\">https:\/\/youtu.be\/aMDXM5HlB64?si=xAG5qt_xUbGYF6RC<\/a><\/li>\n<\/ul>\n\n\n\n<p><\/p>","protected":false},"excerpt":{"rendered":"<p>I explain the mathematical foundations and Laplacian noise of differential privacy.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[154],"tags":[160,164,155,157,161,162,163],"class_list":["post-3297","post","type-post","status-publish","format-standard","hentry","category-dp","tag-database-anonymization","tag-dec-17-2023","tag-differential-privacy","tag-k-anonymity","tag-mathematics","tag-mondrian","tag-python"],"_links":{"self":[{"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/posts\/3297"}],"collection":[{"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/comments?post=3297"}],"version-history":[{"count":52,"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/posts\/3297\/revisions"}],"predecessor-version":[{"id":3367,"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/posts\/3297\/revisions\/3367"}],"wp:attachment":[{"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/media?parent=3297"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/categories?post=3297"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/saraheee.com\/ko\/wp-json\/wp\/v2\/tags?post=3297"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}