A quick guide to differential privacy
One method for privacy-preserving machine learning
TL;DR
This newsletter is about differential privacy. It looks at what it is, how it works and some practical applications of this solution, including for the development of machine learning models.
Here are the key takeaways:
Differential privacy is about quantifying the information leakage of an algorithm. This leakage is measured by using the algorithm in a computation over an underlying dataset.
Differential privacy works by adding random noise to a computation to make it more difficult to identify a specific individual in a dataset without degrading the accuracy of the computation's output.
Differential privacy has two main parameters:
Sensitivity, which is about how much random noise is added to the output.
Privacy budget, which is about the deviation between an output with an individual's data and an output without such data.
Differential privacy that can be used for a variety of use cases. This ranges from the collection of answers to survey questions to the development of machine learning models.
What is differential privacy?
Differential privacy (DP) is about quantifying "the information leakage of an algorithm from the computation over an underlying dataset."1
But what exactly is DP designed to protect against? And how does it do so?
Let's consider the following scenario:
Company
has a database containing personal information about its employees.User A
queries the database at the beginning of January to learn the average salary ofCompany
.User A
learns that the average salary among the 100 employees was $55,000.At the end of January, Charlie joins the company and his information is added to the employee database.
In February,
User B
queries the database to learn the average salary ofCompany
.User B
learns that the average salary among the 101 employees was $56,000.Attacker A
learns Charlie's salary by comparing the reports fromUser A
andUser B
.Attacker A
learned this information by using the information from the reports:From this information,
Attacker A
was able to deduce that Charlie's salary is $156,000.
The privacy leakage that Attacker A
was able to exploit is the kind of vulnerability that DP is designed to protect against:
Even though the study, analysis, or computation only releases aggregated statistical information from a dataset, that information can still lead to meaningful but sensitive conclusions about individuals.2
DP does this by performing calculations on the dataset to produce the information that is queried without disclosing the private information contained in the dataset. This enables the generation of ""aggregated or statistical information from a database or dataset without revealing the presence of any individual in that database."3
When applying DP to the above scenario, the outcome would be the following:
Company
has a database containing personal information of its employees.User A
queries the database at the beginning of January to learn the average salary ofCompany
.User A
learns that the average salary among the 100 employees was $55,000.At the end of January, Charlie joins the company and his information is added to the database.
In February,
User B
queries the database to learn the average salary ofCompany
.User B
learns that the average salary among the 101 employees was $56,000.For the queries of both
User A
andUser B
, DP is applied by adding random noise. However, the application of this random noise is such that the aggregated information produced from the database in January and February is essentially the same.The random noise works like this:
The output from the query of
User A
is that the average salary is $55,500 from 103 employees (whereas the actual average is $55,000, making the noise added to the salary $500, and the true number of employees is 100, making the noise added to the number of employees 3).The output from the query of
User B
is that the average salary is $56,600 from 99 employees (whereas the actual average is $56,000, making the noise added to the salary -$600, and the true number of employees is 101, making the noise added to the number of employees -2).
Because of this,
Attacker A
cannot use the outputs to infer information about an individual. The noisy outputs are not too different from the actual outputs, but they are different enough to preventAttacker A
from concluding that a new employee joinedCompany
and figuring out their salary.
In a nutshell, differential privacy is about differences - a system is differentially private if it hardly makes any difference whether your data is in the system or not. This difference is why we use the word differential in the term differential privacy.4
The mechanics of differential privacy
There are two key parameters of any DP solution: sensitivity and privacy budget.
Sensitivity
Sensitivity is about how much random noise is added to the output.
If the noise is too small, then it may not provide sufficient protection of private information. Conversely, if the noise is too large, then it renders the output essentially meaningless (i.e., too inaccurate).
So how do you determine how much random noise to add? In essence, "the amount of random noise to be added should be proportional to the largest possible difference that one individual's private information could make to that aggregated data."5
To explain further:
...in our private company scenario, we have two aggregated datasets to be published: the total number of employees and the average salary. Since an old employee leaving or a new employee joining the company could at most make a +1 or -1 difference to the total number of employees, its sensitivity is 1. For the average salary, since different employees (having different salaries) leaving or joining the company could have different influences on the average salary, the largest possible difference would come from the employee who has the highest possible salary. Thus, the sensitivity of the average salary should be proportional to the highest salary.6
Privacy budget
Privacy budget refers to the amount of deviation applied by the DP solution "between the output of the analyses with or without any individual's information."7
With a DP solution, there are two outputs: an output with an individual's information, and an output without an individual's information. Privacy budget is about how close these outputs are.
The closer the outputs are, the less deviation there is when including an individual's data, meaning that there is less privacy leakage and therefore less privacy budget to spend. And vice versa, the further away the outputs are, the more deviation there is when including an individual's data, meaning that there is more privacy leakage and more privacy budget to spend.
The privacy budget is represented by the the Greek symbol ϵ, which means epsilon. The smaller the value of ϵ, the smaller the deviation (and the smaller the privacy budget).
Setting the epsilon lower will give you better privacy but then less accuracy. So if you set ϵ to 0
, you would get high privacy but produce meaningless results from querying the database.
ϵ is usually a small value:
For statistical analysis tasks such as mean or frequency estimation, ϵ is generally set between 0.001 and 1.0. For ML or deep learning tasks, ϵ is usually set somewhere between 0.1 and 10.0.8
A simple application of differential privacy
Let's walk through a simple application of DP known as the binary mechanism.
The binary mechanism is one of the oldest DP techniques, originating as far back as the 1970s. It has been used by social scientists for their social studies.
Consider the following study:
You want to survey a group of 100 people on whether they have smoked weed in the last 6 months.
The answers will either by 'yes' or 'no', which is a binary response (0 = no and 1 = yes)
To protect privacy, you add noise to the survey answers by using a balanced coin (i.e., the chances of it landing on heads or tails when flipped is 50%, or p = 0.5)
You use this coin to implement the following DP solution:
Flip the coin
If its heads, the submitted answer is the same as the real answer (0 or 1)
If its tails, flip the coin again and the answers is 1 if its heads or 0 if its tails
The diagram below outlines the process for implementing the binary mechanism for the survey:
On the sensitivity parameter of the binary mechanism:
The randomization in this algorithm comes from the two coin flips. This randomization creates uncertainty about the true answer, which provides the source of privacy. In this case, each data owner has a 3/4 probability of submitting the real answer and 1/4 chance of submitting the wrong answer. For a single data owner, their privacy will be preserved, since we will never be sure whether they are telling the truth. But the data user that conducts the survey will still get the desired answers, since 3/4 of the participants are expected to tell the truth.9
On the privacy budget:
...the privacy budget can be considered a measure of tolerance to privacy leakage. In this example, a higher p value means less noise is added (since it will generate the true answer with a higher probability), leading to more privacy leakage and a higher privacy budget. In essence, with this binary mechanism, the users can adjust the p value to accommodate their own privacy budget.10
In essence, for the binary mechanism:
The sensitivity (i.e., random noise) comes from the two coin flips.
The privacy budget comes from the probability of the coin landing heads.
DP in machine learning
In the context of machine learning (ML), DP solutions can be used to achieve what is called input perturbation.
Input perturbation is about adding noise to the training data used to train the ML model. The application of the DP solution to the training data results in a ML model that is differentially private.
Such solutions therefore prevent attackers from being able to perform membership attacks; an adversary attempting to "infer a sample based on the ML model output to identify whether it was in the original dataset."11 The aim of such an attack is to determine whether a given sample (i.e., input) is in the training dataset used to develop the model.
A short primer on linear regression
For example, DP solutions can be applied to linear regression models. This is a type of machine learning model that "describes the relationship between input and output as a straight line."12
Such a model applies the following mathematical formula:
To break this down:
x is the input for the model.
y is the output (or prediction) of the model.
The model contains parameters φ, which define the different relationships between the input and the output.
f is the machine learning model that applies the parameters to the input to produce the output (the prediction).
Learning or training the model is about finding parameters φ that produce the best predictions from x input. For training a linear regression model, the model is trying to find the line of best fit for the dataset when projected onto a graph.
This line of best fit has two parameters:
The Y-intercept (where the line starts from when displayed on a graph)
The slope (the steepness of the line when displayed on a graph)
Changing these parameters determines the relation between input and output. This means that the equation being applied by the model "defines a family of possible input-output relations (all possible lines), and the choices of parameters determines the member of this family (the particular line)."13
The training data for supervised learning models like linear regression consist of input/output pairs {xi, yi}. This means that each input (data feature) in the dataset is paired with the correct output (data label).
A loss function is used to "assign a numerical value to each choice of parameters that quantifies the degree of mismatch between the model and the data." The mismatch is determined by the deviation "between the model predictions f[xi, φ] (height of the line at xi) and the ground truth outputs ."14
Accordingly, the "loss L is a function of the parameters φ; it will be larger when the model fit is poor and smaller when it is good."15 Simply put, the lower this value (i.e., the lower the loss) the better the fit.
When training the model, the initial parameters are randomly set. During the training process, the parameters are improved by "walking down" the loss function to the lowest point possible.
This can be done in different ways. One of the most popular ways is gradient descent:
...measure the gradient of the surface at the current position and take a step in the direction that is most steeply downhill. Then we repeat this process until the gradient is flat and we can improve no further.16
So training a model involves the following components:
A prediction f[xi, φ] produced by the model, which is generated by applying the parameters φ to inputs xi.
A loss function that measures the deviation between prediction f[xi, φ] and ground truth yi, which produces a loss function L.
A function for adjusting parameters φ based on loss score L, which is repeated for every input/output pair {xi, yi} in the training dataset.
This process is shown in the diagram below:
For linear regression models, we can calculate the mean absolute error (MAE) to evaluate the performance of the model. This metric measures the difference between the actual values and the predicted values, so the lower the MAE the better the model.
Applying DP to linear regression models
The IBM Differential Privacy Library provides open-source tools for implementing DP. We can use this library to build and test differentially-private linear regression models.
We can start by implementing the IBM library as well as all the other python libraries and modules we will need:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
from sklearn.linear_model import LinearRegression as sk_LinearRegression
from diffprivlib.models import LinearRegression
Next, we can import the dataset we want to train the model on. For this we can use the diabetes dataset from the sklearn
library, which is a synthetic dataset containing different pieces of information of 442 diabetes patients.
Using dataset.feature_names
, we can see the information included in each patient record. This consists of their age, sex, body mass index (BMI), average blood pressure, and six different blood serum measurements.
We will need to split the dataset into a training set and a testing set. The training set will be used to train the model, and the test will be used to test the performance of the trained model:
X_train, X_test, y_train, y_test = train_test_split(dataset.data[:, :2], dataset.target, test_size=0.2)
Using a 80:20 split, we have 353 records in the training set, and 89 records in the testing set. We can then set up our standard linear regression model that does not use differential privacy:
regr = sk_LinearRegression()
regr.fit(X_train, y_train)
We can then use MAE to evaluate the performance of the model:
mae = mean_absolute_error(y_test, regr.predict(X_test))
Our standard model achieves an MAE of 63.5640
. Now let's set up the model with differential privacy and compare the difference:
regr = LinearRegression()
regr.fit(X_train, y_train)
mae = mean_absolute_error(y_test, regr.predict(X_test))
Our differentially-private model produces an MAE of 69.8480
, which is higher than our standard model. This indicates a slightly worse performing model, showing the potential tradeoff in performance when using DP for developing machine learning models.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.28
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.28.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.29.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.29.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.31.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.31.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.32.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.32.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.35.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.37.
J. Morris Chang et al, Privacy-Preserving Machine Learning (Manning Publications 2023), p.13.
Simon JD Prince, Understanding Deep Learning (MIT Press 2023), p.18.
Simon JD Prince, Understanding Deep Learning (MIT Press 2023), p.19.
Simon JD Prince, Understanding Deep Learning (MIT Press 2023), p.19.
Simon JD Prince, Understanding Deep Learning (MIT Press 2023), p.21.
Simon JD Prince, Understanding Deep Learning (MIT Press 2023), p.22.