Notes on LLMs and privacy leakage
A paper that demonstrates the importance of privacy-preserving machine learning
TL;DR
These notes are on attacks against large language models (LLMs) that can reveal personal data in its training data. This comes from a 2020 paper authored by researchers and engineers from Google, OpenAI, Apple and several universities.
The experiment conducted for this paper was performed on GPT-2, an older iteration of OpenAI's LLM; its latest such model is GPT-4 which apparently powers the new Bing search product (for which see more here). The paper shows it is possible to "perform a training data extraction attack to recover individual training examples by querying the language model." This is shown in the diagram below:
The paper focuses on training data extraction attacks using the following methodology:
The researchers used the LLM to generate a large number of samples using certain prompts. The outputs were then fed back into the model to find specific natural language that the model consistently considers the most likely response to the prompt, thus indicating the kind of data it may have memorized during training. 200,000 such samples were generated.
These samples were narrowed down to 1,800 samples by identifying those samples most likely to be memorized by the model.
Using limited access to the training data for GPT-2, the researchers validated these outputs to confirm how much was in fact contained in the training dataset. 604 unique memorized training examples were identified from the 1,800 potentially-memorized samples generated by GPT-2.
The results of this experiment are quite interesting. The extraction attacks managed to reveal names, phone numbers, addresses and social media accounts existing in the training data. Furthermore, the attack can be done even with only 'black-box input-output' access to a LLM, meaning that there is a wide threat vector at play.
It seems possible that such extraction attacks could be carried out against current LLMs like ChatGPT. This is because the model was probably trained on a dataset similar to that used to train GPT-2 and GPT-3.
The paper highlights how (i) privacy-preserving machine learning is important for avoiding such privacy leakages in deep learning models, and (ii) how the training datasets for these LLMs could trigger issues with GDPR compliance.
The structure of these notes is as follows:
How are LLMs trained?
What is overfitting and how can this lead to privacy leakage?
What attacks can result in privacy leakage?
How much privacy leakage was in GPT-2?
PETs (privacy-enhancing technologies) in machine learning and data protection issues
If you find this interesting, consider subscribing to be notified of updates.
1. How are LLMs trained?
LLMs are machine learning models that use neural networks. A simple explanation of how LLMs work can be found in my post on LLMs and influence operations here:
This post is more focused on how they are trained.
Neural networks are trained using a feedback loop that consists of the following:
A prediction from the model
A loss function
A loss score
An optimizer
The prediction from the model is a prediction made on the input it receives. In the case of LLMs, this is the model's prediction of the next word in a sentence when given part of a sentence as an input (or a prompt). The loss function then measures whether the model was correct in its prediction using the 'ground truth' (i.e., the actual correct next word as derived from the training data); this loss function then produces a loss score. This loss score is then processed by an optimizer function which uses that score to adjust the parameters of the model to improve the accuracy of its predictions. The model does this repeatedly for all the data in the training dataset; an epoch refers to the number of times that the model runs through the entire training dataset.
The data used to train LLMs consists of bodies of text, in the form of documents, articles or even webpages. Lots of data is required for LLMs to perform well. GPT-2, for instance, contains 1.5 billion parameters "trained on a dataset of 8 million web pages." This data was scraped from the internet using outbound links from Reddit, and then cleaning the documents to only keep the raw text. This process was used to create "a final dataset of 40GB of text data, over which the model was trained for approximately 12 epochs." (Carlini et al 2020, 3) GPT-3 has 175 billion parameters trained on a much larger dataset consisting of webpages, books and Wikipedia articles.
Due to the these models being so great in size and being trained on large datasets, they are often trained for a single epoch (though exceptionally GPT-2 went through about 12 epochs). The reason for using fewer epochs is simply that training these models is very expensive in terms of time and money. Particularly on the money part, LLMs are expensive to run generally. In a post by SemiAnalysis here on Substack, the estimated cost of ChatGPT is $694,444 per day and requires almost 30,000 GPUs.
The number of epochs used to train a model is important as this has an impact on the degree to which that model could result in privacy leakages. This is because the epochs and the training loop can give rise to a phenomenon called overfitting which is a pre-cursor to several problems including an increased propensity for the model to leak training data when subject to certain extraction attacks.
2. What is overfitting and how can this lead to privacy leakage?
Before training commences, the model will be set up with random parameters. Therefore, when the model begins training, the loss score will likely be quite high. The goal of training the model, however, is to ensure that it learns the optimal values for the parameters so that it produces the correct outputs and reduces the loss score.
This means that, during the training process, the model will go through different states. At the beginning, the model will be underfit, namely it has not recognised the relevant patterns in the training data and reflected this in the values for the parameters of the model. Underfitting can be identified by a high loss score. As the model undergoes training, the loss score should, hopefully, begin to fall as the model better learns the relevant patterns of the training data. At this point, the model will approach a robust fit. However, developers must ensure that the model does not then start to overfit on its training data. This is where the model begins to put too much weight on the specific patterns of the training data and therefore loses the ability to recognise the more general patterns that could be observed in new data. In other words, overfitting means that the model loses the ability to generalise and therefore correctly interpret data it has not seen before.
The graph below shows the different stages of a model's training journey, and also shows how overfitting can be identified when the loss score on the training data is much lower than the loss score on the validation data (i.e., the data not included in the data used to train the model but instead used to test is performance):
The diagram below provides a visual demonstration what happens when a model overfits and loses the bigger picture. The image on the left shows a robust fit that is able to generalise and therefore identify the general patterns that exist in the training data. Conversely, the image on the right shows a model that is overfitting, i.e., it is going out of its way to include outliers in the data and therefore ignores the general patterns in the training data.
Sometimes "overfitting often indicates that a model has memorized examples from its training set." (Carlini et al 2020, 1) In fact, "overfitting is a sufficient condition for privacy leakage and many attacks work by exploiting overfitting." (Carlini et al 2020, 1) Another paper from 2017 found that there can be "a clear relationship" between overfitting the privacy leakage and that "overfitting is sufficient to allow an attacker to perform [attacks on the model]." (Yeom et al 2017)
So when a model overfits, it is more likely to memorize specific data in its training data, and certain attacks could be levied against the model to reveal this memorization and therefore reveal personal data that exists in the training data.
However, one of the interesting aspects of this paper on LLMs and privacy leakage is that LLMs do not tend to overfit. This is because of the expense of training these models given their size and complexity. Accordingly, as noted beforehand, these models are usually trained using few epochs. In fact, GPT-2 "does not overfit: the training loss is only roughly 10% smaller than the test loss across all model sizes." (Carlini et al 2020, 3) Even so, one of the findings of the paper is that, even if an LLM does not overfit, it can still memorize specific examples in its training data of which can be revealed through specific attacks.
3. What attacks can result in privacy leakage?
The paper references 3 different types of privacy attacks against models that have managed to memorize their training data:
Membership inference attacks:
This is where "an adversary can predict whether or not a particular example was used to train the model." (Carlini et al 2020, 3)
This can be carried out by attacker without having access to the model's parameters and just merely observing the output of the model.
To explain: "In this technique, an attacker creates random records for a target machine learning model served on a cloud service. The attacker feeds each record into the model. Based on the confidence score the model returns, the attacker tunes the record’s features and reruns it by the model. The process continues until the model reaches a very high confidence score. At this point, the record is identical or very similar to one of the examples used to train the model. After gathering enough high confidence records, the attacker uses the dataset to train a set of “shadow models” to predict whether a data record was part of the target model’s training data. This creates an ensemble of models that can train a membership inference attack model. The final model can then predict whether a data record was included in the training dataset of the target machine learning model."
Model inversion attacks:
This is where the attacker recreates "representative views of a subset of examples (e.g., a model inversion attack on a face recognition classifier might recover a fuzzy image of a particular person that the classifier can recognize)." (Carlini et al 2020, 3)
For example, a paper by researchers at DeepMind and Microsoft shows how this kind of attack can be used against model's that process image data, such as image classifiers.
Training data extraction attacks (TDEAs):
This is similar to model inversion attacks except this attack attempts to recreate training data points.
In particular, the aim of such attacks is to "reconstruct verbatim training exam- ples and not just representative “fuzzy” examples." (Carlini et al 2020, 3)
The focus of the paper is TDEAs and how these can be used against GPT-2. It attempts to show how such attacks are not merely "theoretical or academic" and are actually "practical". (Carlini et al 2020, 3)
4. How much privacy leakage was in GPT-2?
The threat model
The paper sets out the threat model used for its experiment with TDEAs and GPT-2. This includes explaining (i) the adversary's capabilities, (ii) the adversary's objectives, and (iii) the attack target:
Capabilities:
The paper considers an adversary with only ""black-box input-output access to a language model." (Carlini et al 2020, 4)
This could include anybody, for example, that has signed up to use ChatGPT and can input prompts and receive the model's outputs.
This means that the adversary can "obtain next-word predictions, but does not allow the adversary to inspect individual weights or hidden states (e.g., attention vectors) of the language model." (Carlini et al 2020, 4)
Objectives:
The objective of a TDEA is to "extract memorized training data from the model." (Carlini et al 2020, 4)
However, the experiment conducted for the paper did not "aim to extract targeted pieces of training data, but rather indiscriminately extract training data." (Carlini et al 2020, 4)
Target:
The target here was GPT-2
Initial extraction attack
This involved two steps:
Text generation: The researchers first fed the model with a "one-token prompt containing a special start-of-sentence token." (Carlini et al 2020, 5) They then sampled tokens from the outputs generated from the model and fed that back into the model to generate new outputs. By doing this process repeatedly, the researchers sought to "sample sequences that the model considers "high likely", and that likely sequences correspond to memorized text." (Carlini et al 2020, 5) This generated 200,000 samples.
Predicting which outputs contained memorized text: This involved sifting through the samples generated by the model and choosing those examples "that are assigned the highest likelihood by the model." (Carlini et al 2020, 5) Thus, the 200,000 samples generated were sorted by measuing "how well the LM "predicts" the tokens" in a given sequence. (Carlini et al 2020, 6)
The findings of this initial attack were as follows:
The extraction attacks found a variety of memorized content. For instance, "GPT-2 memorizes the entire text of the MIT public license." (Carlini et al 2020, 6) It also memorized "popular individuals' Twitter handles or email addresses." The researchers found that all the memorized content identified was "likely to have appeared in the training dataset many times." (Carlini et al 2020, 6)
There were two key weaknesses of the attack. Firstly, it produced a low diversity of outputs (among the 200,000 samples generated). Secondly, there was a high false positive rate, namely "content that is assigned high likelihood but is not memorized." (Carlini et al 2020, 6) Many of these contained repeated strings, such as the same phrase repeated multiple times.
Improved extraction attack
The researchers launched a second, more improved, extraction attack on GPT-2 which involved the following changes :
Adjustments were made to increase the diversity the samples generated by the model. This involved the researchers using their own internet scrapes in order to "generate samples with a diverse set of prefixes that are similar in nature to the type of data GPT-2 was trained on." (Carlini et al 2020, 6)
The use of a second model to compare the respective probabilities generated by the models and find "more diverse and rare forms of memorization." (Carlini et al 2020, 7)
Of the 200,000 generated samples, the researchers narrowed this down to 1,800 samples of content that had been potentially memorized by GPT-2. Then, working with the original creators of GPT-2 at OpenAI, the researchers obtained limited access to its training dataset to validate results.
Some of the interesting findings of this improved attack were as follows:
604 unique memorized training examples were identified from the 1,800 samples generated by GPT-2.
The researchers identified "numerous examples of individual peoples' names, phone numbers, addresses and social media accounts." (Carlini et al 2020, 9) Furthermore, some of this personal data only appeared in a few documents in the training dataset for GPT-2:
"We find 46 examples that contain individual peoples' names. When counting occurrences of named individuals, we omit memorized samples that relate to national and international news (e.g., if GPT-2 emits the name of a famous politician, we do not count this as a named individual here). We further find 32 examples that contain some form of contact information (e.g., a phone number or social media handle). Of these, 16 contain contact information for businesses, and 16 contain private individuals' contact details." (Carlini et al 2020, 10)
According to the researchers, the results of their extraction attacks "vastly underestimate the true amount of content that GPT-2 memorized" and that "There are likely prompts that would identify much more memorized content, but because we stick to simple prompts we do not find this memorized content." (Carlini et al 2020, 11)
5. PETs (privacy-enhancing technologies) in machine learning and data protection issues
The vulnerabilities exposed in an LLM like GPT-2 demonstrate the importance of privacy-preserving machine learning and the privacy-enhancing technologies (PETs) that could be used to achieve this. Differential privacy is mentioned in paper as a possible PET that could mitigate privacy leakages in LLMs, but that this does have some limitations and needs more exploring:
Large companies have...used DP in production machine learning models to protect users' sensitive information. The tradeoff between privacy and utility of models have been studied extensively: differentially-private training typically prevents models from capturing the long tails of the data distribution and thus hurts utility...In the context of language modeling, recent work demonstrates the privacy benefits of user-level DP models. (Carlini et al 2020, 12)
(Differential privacy is a topic that I might explore further in future posts.)
The attacks in this paper also relate to an interesting point about LLM training data and data protection compliance. Looking at the GDPR specifically:
As mentioned before, LLMs require large training datasets of natural language. As such, the datasets for models like GPT-2 or ChatGPT are compiled in part by scraping lots of natural language from the internet.
Accordingly, these LLM training datasets likely contain personal data, as the extraction attacks demonstrate.
To the extent that such personal data concerns EU citizens (which seems likely if the scraping is done indiscriminately across the web), then the relevant requirements from the GDPR would apply (as per Article 2(a)).
A few particularly important requirements may apply in the case of data scraping for training an LLM:
Purpose Limitation (Article 5(1)(b)): Personal data can only be collected for "explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes." Scraping data to build a training dataset for an LLM is certainly an identifiable purpose, the question is whether this is legitimate.
Data Minimisation (Article 5(1)(c)): The personal data collected must be "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed." Data scraping can results in large amounts of data being collected. For complex LLMs with billions of parameters designed for quite general tasks, vast quantities of natural language data are required for effective training. But the data minimisation principle could make training such a model difficult.
Legal Basis for Processing (Article 6(1)): Of the six lawful bases available under the GDPR, 'legitimate interest' would seem to be the most appropriate. However, using this basis requires a balancing test to be set out whereby the legitimate interests of the controller must not override "the interests or fundamental rights and freedoms of the data subject." This needs to consider the reasonable expectations of data subjects based on their relationship with the controller. In the case of data scraping for an LLM, the developer realistically has no relationship with the data subjects whose data are collected and therefore it would be hard to argue that such processing is within the reasonable expectations of those data subjects.
Transparency (Article 14(1)): Where personal data is not collected directly from the data subject, certain information needs to be provided to the data subject detailing the nature of such data processing. In March 2019, the Polish data protection authority fined a company involved in data scraping activities for failing to meet the requirements of transparency under the GDPR.
Data protection impact assessments (Article 35): Due to the above, data scraping could easily be considered 'high risk' processing and therefore require the completion of a DPIA. This exercise may also require the controller to consult a data protection authority depending on the level of risk identified by the assessment.
Sources
François Chollet, Deep Learning with Python (2nd edition, Manning Publications 2021)
Samuel Yeom et al, Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting (2018)