More Notes on LLMs and privacy leakage
A paper exploring the causes of privacy leakage in language models
TL;DR
These notes are on a 2023 paper by Carlini et al looking at the factors that cause large language models (LLMs) to memorize their training data verbatim and thus increase the risk of privacy leakage where that data consists of personal data.
I explored the risk of privacy leakage in a previous post on a 2020 paper by Carlini. That paper showed that GPT-2 (and possibly other GPT models) can memorize personal data included in its training dataset and reproduce this verbatim when subject to training data extraction attacks. Personal data may exist in these datasets because they are often trained with text indiscriminately scraped from lots of webpages, which may include certain pieces of personal data. LLMs then end up getting trained on this data and, if they manage to memorise specific extracts from its training data, may reproduce these personal data in their outputs if provided with the right prompt. You can read more about this here:
The researchers of this 2023 paper found that memorization significantly increases in models when increasing three properties:
The size the model
The number of times data is duplicated in the training dataset
The number of tokens of context used to prompt the model
The overall finding presented in the paper is that "memorization in [language models] is more prevalent than previously believed and will likely get worse as models continue to scale, at least without active mitigations." (Carlini et al 2023, 1)
Increased memorization is undesirable for several reasons. It interferes with privacy, decreases the utility of the model (because text that is easy to remember is usually of low quality, e.g., open source licenses) and hurts fairness (i.e., some text is memorized over others).
Furthermore, regarding the interference with privacy, if the data memorised relates to people who are not necessarily public figures, then EU law is likely to prioritise their right to privacy over the economic and/or public interest in using their data for training an open source LLM where there is a risk of privacy leakage. See the Google Spain Case for evidence of this (while this case was about including personal data in search engine results, the principle could also be applicable to LLMs):
As the data subject may, in the light of his fundamental rights under Articles 7 and 8 of the Charter, request that the information in question no longer be made available to the general public by its inclusion in such a list of results, it should be held, as follows in particular from paragraph 81 of the present judgment, that those rights override, as a rule, not only the economic interest of the operator of the search engine but also the interest of the general public in finding that information upon a search relating to the data subject’s name. However, that would not be the case if it appeared, for particular reasons, such as the role played by the data subject in public life, that the interference with his fundamental rights is justified by the preponderant interest of the general public in having, on account of inclusion in the list of results, access to the information in question. (at para. 97 of the judgment)
If you enjoy this content, consider subscribing!
How was the research conducted?
The paper looks to quantify memorization across different language models with their respective training datasets. The main model that the researchers tested on was GPT-Neo. This model is a text generation model that comes in four sizes: 125 million, 1.3 billion, 2.7 billion and 6 billion parameters. It is trained on the Pile dataset, which consists of books, scraped webpages and open source code.
The method used to determine whether a model has memorized its training data was as follows:
Memorization is where a model is able to reproduce data using certain inputs with a given context length. The input here would be a prefix of the data in the training dataset. For example, let's say that the training dataset for the model contains the sequence 'Bob's phone number is 123456'. If you give the model a prompt that only contains the start of the sequence, such as 'Bob's phone number is', and the model completes this by outputting '123456' in its response, then that the original sequence has likely been memorized by the model and is therefore extractable.
The researchers then took a subset of the training dataset (50,000 sequences) so that it could run the tests in an efficient manner. Trying to use the whole dataset, whic are 100s of GBs in size, would be very computationally demanding and thus expensive and time-consuming to run. The researchers ensured to select samples that were representative of the broader dataset.
For each of the 50,000 sequences, 50 tokens were taken from the end. The remaining parts of the sequence therefore formed the prefix to be included in the test prompts for the model. If the model correctly reproduced the 50 tokens taken away from the original sequence, then that sequence was deemed to be memorized.
What were the results?
The researchers looked at the impact of three model properties on its ability to memorise its training data verbatim: size, the number of duplicates in the training dataset, and the size of the context window. The context window is the span of words that the model considers when producing responses to a given sequence of text. For instance, if the model is prompted with the sequence 'The cat sat on the', and the context window is set to 3, then it will consider the words 'sat', 'on' and 'the' when predicting the next word in the sentence.
See the post below for a more detailed explanation of how LLMs work:
Model Size
On this, the researchers found that "larger models memorize significantly more than smaller models do." In particular, "a ten fold increase in model size corresponds to an increase in memorization of 19 percentage points." (Carlini et al 2023, 4)
Also, the researchers tested their curated dataset on GPT-2. The purpose of this was to see whether the inputs in their dataset were merely easy to predict for any LLM or whether the inputs were actually being memorized by GPT-Neo. If the former is true, then GPT-2 should reproduce the test inputs to an extent similar to GPT-Neo. If the latter is true, then GPT-2 should not reproduce very many of the inputs. In turned out that the latter was in fact true; GPT-Neo was actually memorizing its training data and not merely reproducing data that is universally memorable across different models:
We find that GPT-2 correctly completes approximately 6% of the examples in our evaluation set, compared to 40% for the similarly sized 1.3B parameter GPT-Neo model. (Carlini et al 2023, 5)
Accordingly, the paper finds that "larger models have a higher fraction of extractable training data because they have actually memorized the data; it is not simply that the larger models are more accurate." (Carlini et al 2023, 5)
Number of duplicates
The researchers found that "memorization in language models increases with the number of times sequences are repeated in the training set." (Carlini et al 2023, 5) This happens with even a few duplicates. This may mean that deduplication applied to a dataset before it is used for training may not fully mitigate the capacity for a model to memorize data (though see further below).
Context window
The researchers found that "it is possible to vary the amount of extractable training data by controlling the length of the prefix passed to the model." (Carlini et al 2023, 5) In particular, "the fraction of extractable sequences increases log-linearly with the number of tokens of context." (Carlini et al 2023, 5)
Accordingly, the researchers developed a concept called the 'discoverability phenomenon': "some memorization only becomes apparent under certain conditions, such as when the model is prompted with a sufficiently long context." (Carlini et al 2023, 5) Under this rule, if the model is conditioned on 100 tokens of context as opposed to 50 tokens, then the model with 100 tokens would "estimate the probability of the training data as higher." (Carlini et al 2023, 5)
As such, developers deploying LLMs via APIs can "significantly reduce extraction risk by restricting the maximum prompt length available to users." (Carlini et al 2023, 6) Additionally, the results of the researchers tests "suggests that correctly auditing [LLMs] likely requires prompting the model with the training data, as there are no known techniques to identify the tail of memorized data without conditioning the model with a large context." (Carlini et al 2023, 6)
What if you deduplicate the training data?
The researchers tested the GPT-Neo models on deduplicated data to see what impact this would have on memorization. Two findings were made.
The first (and perhaps expected):
....models trained on deduplicated datasets memorize less data than models trained without deduplication. For example, for sequences repeated below 35 times, the exact deduplicated mode memorizes an average of 1.2% of sequences, compared to 3.6% without deduplication. (Carlini et al 2023, 9)
The second (and a very interesting one):
...while deduplication does help for sequences repeated up to ~100 times, it does not help for sequences repeated more often! The extractability of examples repeated at least 408 times is statistically significantly higher than any other number of repeats before this. We hypothesize that this is due to the fact that any deduplication strategy is necessarily imperfect in order to efficiently scale to hundreds of gigabytes of training data. (emphasis added) (Carlini et al 2023, 9)
Does the scaling trend apply to all model sizes?
This would seem to be the case. The tests revealed evidence of sequences memorized across all model sizes:
We found most of these universally-memorized sequences to be "unconventional" texts such as code snippets or highly duplicated texts such as open source licences. (Carlini et al 2023, 7)
Does the scaling trend apply to other models?
The researchers also tested the OPT family of models. These models are trained on a dataset that overlaps with Pile. For the OPT tests they deduplicated the dataset.
They found "nearly identical scaling trends." (Carlini et al 2023, 9)
Source: Carlini et al 2023, 14
This could mean two things:
Careful curation of the training data can mitigate memorization.
Small shifts in the data distribution can greatly alter the data that the model memorises.
Conclusion
Overall, this paper finds that if the distribution of the training data is slightly skewed due to, for instance, duplicated sequences, then larger models are "likely to learn these unintended dataset peculiarities." (Carlini et al 2023, 9)
Larger models will likely remember verbatim more training data than smaller models. As such, memorization scales with model size. In fact, the biggest model size tested for the paper had 6 billion parameters. State-of-the-art models contain over 200 times more parameters. OpenAI's GPT-3, for example, has 175 billion parameters and GPT-4 may have many more. Thus, it is likely that these larger models will memorise even more of their training data. But for smaller models, sequences appearing just once in the training dataset are rarely memorised and so deduplication prior to training could be an effective measure in this context.