TL;DR
This newsletter is about whether large language models (LLMs) store personal data they may have been trained on. It looks at how LLMs work, the definition of personal data under the GDPR and the various arguments around this issue.
Here are the key takeaways:
In July 2024, the Hamburg DPA published a discussion paper arguing that LLMs do not store personal data. That paper contends that LLMs only store the correlations between fragments of words (i.e., tokens) in a numerical form, which does not constitute information that can be used to identify a person.
The Hamburg DPA explains how the complex architecture of LLMs transforms personal data into these numerical representations. Such architectures process its training data in such a way that the information stored by the LLM on its own cannot be linked back to specific individuals.
However, if the model comes across certain data in its training data frequently enough, it can create strong correlations between the tokens that make up that data. Those strong correlations may be learned and retained by the model such that the correct prompt can be used to extract this data.
LLMs may therefore be regarded as storing personal data that it has memorised in this way. Accordingly, such personal data would constitute pseudonomized data when stored in the model, and then becomes readable when the correct prompt is used to generate it from the model.
European data protection authorities appear to have diverging views on this issue. This could be addressed by the European Data Protection Board in a future opinion it may publish on AI models and the GDPR.
Summary of the Hamburg paper
Back in July, the Hamburg data protection authority (DPA) published a discussion paper looking at whether LLMs store personal data.
The argument presented by the Hamburg DPA in their paper is as follows:
The text data collected for the training of the LLM (which may contain personal data) go through a process of tokenisation. With this process, the text data are broken up into chunks (either individual words or fragments of words) and converted into numerical values (i.e., tokens).1
These tokens are then processed into word embeddings. These embeddings consist of information about the correlations between the different tokens.2
LLMs do not store whole words or sentences. Instead, they only store "linguistic fragments" as "abstract mathematical representations" and learn the correlations and patterns between these fragments during its training as contained in its parameters.3
So when an LLM produces its output, it uses the correlations and patterns contained in its parameters to combine the different fragments together. This is used to produce whole sentences and paragraphs in a response to a given prompt.
Accordingly, LLMs produce probabilistic outputs (i.e., the outputs are predictions of what it thinks the best response to the prompt is based on the learning obtained from its training). This is different from deterministic outputs that a database query would return from a database, where a certain input always produces the same output.4
Based on this argument, the Hamburg DPA concludes that LLMs do not store personal data as per the GDPR.
The paper distinguishes the information stored in LLMs from "IP addresses, exam responses, legal memos by public offices, vehicle identification numbers or other coded character strings such as the [transparency and control] string."5 In doing so, it cites the different CJEU cases touching on the processing of each type of personal data.
The common thread running through all these cases is that these types of data "indicate a reference to the identification of a specific person or to objects assigned to persons".6 This makes such data "identifiers" or information "relating" to a data subject.
This stems from the function that such data performs. For example, the purpose of an IP address is to specify the location of an internet-enabled device so that it can receive and send information via the internet.
The tokens stored in LLMs, on the other hand, "lack individual information content and do not function as placeholders for such."7 The models only store "highly abstracted and aggregated data points from training data and their relationships to each other, without concrete characteristics or references that "relate" to individuals."8
Even the existence of LLM privacy attacks do not change this finding. I have written previously on how OpenAI's GPT models can be subject to training data extraction attacks.
The Hamburg DPA contends that such attacks do not evidence the storage of personal data in LLMs for two main reasons:
The presence of personal data in an LLM's output "is not conclusive evidence that personal data has been memorized, as LLMs are capable of generating texts that coincidentally matches training data."9
As per CJEU case law, "data can only be classified as personal data if identification is possible through means of the controller or third parties that are not prohibited by law and do not require disproportionate effort in terms of time, cost and manpower."10
The implication of the Hamburg's arguments is that, if an organisation takes a copy of an LLM produced by a developer, the mere hosting of that LLM in its own environment does not mean that it is storing the personal data the model may have been trained on. All that is contained in the model are tokens, not personal data.
But I think the other more significant implication relates to data subjects rights, which the Hamburg notes in its paper:
A person enters their name into a company's or authority's LLM-based chatbot. The LLM-based chatbot provides incorrect information about them. Which data subject rights can they assert in relation to which subject matter?
As LLMs don't store personal data, they can't be the direct subject of data subject rights under articles 12 et seq. GDPR. However, when an AI system processes personal data, particularly in its outputs or database queries, the controller must fulfil data subject rights.
In the case outlined above, this means that the data subject can request the organization to provide, at least regarding the input and the output of the LLM chatbot,
that information is provided in accordance with article 15 GDPR,
that their personal data is rectified in accordance with article 16 GDPR,
if applicable, that their personal data is erased in accordance with article 17 GDPR.11
To be clear, the Hamburg DPA does still argue that personal data may be contained in the training data used to develop the LLM. But after that training data has been used for this purpose, the resulting trained model does not itself store that personal data.
Accordingly, it is implied that organisations wanting to fine-tune LLMs in their own environment are exempt from certain GDPR obligations, such as data subjects requesting the rectification of their personal data in the model.
How do LLMs work?
There are many different LLMs out there at this point. This post just focuses on the GPT models by OpenAI, although generally other LLMs work in a similar way.
The GPT model consists of the following components:
Tokenisation
Word embeddings
Positional encoding
Transformer blocks
Probability distributions
Fine-tuning
Tokenisation
For LLMs to work with a text input, it needs that text to be converted into a string of numbers to interpret and perform functions on them.
This is what tokenisation is about. When an LLM takes in text data as an input, it converts that data into a string of numbers using a tokeniser.
This is a mechanism that "splits the text into a vocabulary of smaller constituent units (tokens) that can be processed by the [LLM]."12 This vocabulary consists of "both common words and word fragments from which larger and less frequent words can be composed."13
For example, if we took OpenAI's tiktoken, which is a tokeniser that is uses for its GPT models, and applied it to the sentence "With great power comes great responsibility", each word in that sentence would be represented as values that would look like the following:
[2886, 2212, 3470, 5124, 2212, 16143]
Accordingly, from the perspective of the LLM:
"With" turns into
2886
"great" turns into
2212
"power" turns into
3470
"comes" turns into
5124
"responsibility" turns into
16143
As the Hamburg DPA points out in is paper though, the tokenisation for LLMs will convert both whole words and fragments of words into tokens. So in reality tokenising the above sentence would result in more tokens representing fragments of the words contained in that sentence.
Word embeddings
After tokenisation, the tokens go through an embedding layer whereby the tokens are mapped to a word embedding.
The purpose of this is to identify the relationships between the tokens. This helps to separate tokens that are closely related to each other from those tokens that are not related to each other.
The embedding layer encodes each token in a matrix so that each token represents a vector. In the context of LLMs, a vector is essentially a list of numbers whereby each value is 0
except for the value corresponding to the token which is set to 1
.14
For example, the vectors for the tokens in the sentence "With great power comes great responsibility" would look something like the following:
[[1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]]
Above is 2-dimensional array (or matrix) where each row represents a word in the sentence. For example, the first row [1. 0. 0. 0. 0. 0.]
represents the word 'With'. The row [0. 1. 0. 0. 0. 0.]
represents 'great' and appears twice in the matrix since that word appears twice in the sentence.
When these embeddings are projected onto a graph, you can visualise how close or far away the tokens are from each other. The screenshot below is from 3Blue1Brown's video on YouTube on LLMs that shows what these word embeddings look like:
In the screenshot you can see the connections the LLM picks up from the sentence "All data in deep learning must be represented as vectors" by looking at where it places each word from that sentence on the graph. However, the screenshot only shows the vectors projected onto a geometric space with a few dimensions.
For LLMs, the vectors for the word embeddings will have thousands of dimensions that capture several different relationships between the tokens. For example, GPT-3 processes word embeddings with over 12,000 dimensions.15
Positional encoding
Following the embedding layer is the positional encoding mechanism. The purpose of positional encoding is to identify the position of each word in the sentence and capture this information within a vector.
This enables the LLM to have a better understanding of the text data it processes, including grammar and syntax. The vector representing the positional encoding of a word in a sentence might look something like this:
[5.40302306e-01 3.87674234e-01 9.87466836e-01 6.30538780e-02 9.99684538e-01 9.99983333e-03 9.99992076e-01 1.58489253e-03 9.99999801e-01 2.51188641e-04 9.99999995e-01 3.98107170e-05 1.00000000e+00 6.30957344e-06 1.00000000e+00 1.00000000e-06 1.00000000e+00 1.58489319e-07 1.00000000e+00 2.51188643e-08]
Transformer blocks
After the text data goes through the tokeniser, embedding layer and positional encoder of the LLM, it is transformed into a new input comprising of three sets of information: the token, the embedding vector and the positional encoding vector.
This input then goes through the transformer blocks. This is the crux of the LLM.
The transformers used for the GPT models consist of what are known as decoders. These decoders consist of multiple layers of (i) masked self-attention mechanisms, and (ii) feed forward neural networks.
Attention is essentially about looking at every word in the given text data to understand the relevant context. But the GPT models use a specific technique called self-attention, where the model looks only at the words in the given input text data to gauge the correct context that in turn helps to identify the correct meaning of the sentence.
Take the following two sentences:
"Server, can I have the check please?"
"Looks like I just crashed the server."
Both sentences include the word 'server'. However, that word has a different meaning in each sentence. The way that we can tell this is by observing the other words in the sentence.
In the first sentence, the word 'check' indicates that the type of server being used in that sentence describes an employee in a restaurant. In the second sentence, the word 'crashed' indicates that the server being referred to is a type of computer.
This is what self-attention is about; the model is observing the words in the given text data input to identify the relevant context and therefore interpret the sentence correctly. However, the GPT models add one more step to this, which is where the 'masking' comes in:16
When training the model, we could feed it the whole sentence, for example: "With great power comes great responsibility."
However, the GPT models are autoregressive, meaning that the LLM is trained to generate an output word-by-word, where the predicted next word is conditioned on the previous words.
Accordingly, if we gave the model the full sentence, it would essentially see the answers that it is supposed to be predicting and the model would not be learning properly.
So instead of instructing the model to look at all the words in the sentence, we can instruct the model to only look at the current or previous words.
For example, we would give the model the words "With great power comes" so that it can apply the self-attention mechanism to those words in order to predict the next word, which in this case the correct next word would be "great".
The masked self-attention mechanism ultimately gives each word in the input a score that essentially represents the level of attention that each word should be given. After being assigned a masked self-attention score, the text data then flows through a feed forward neural network.
The role of the neural network is essentially to find a path from the input text data to the correct next word in the sentence. This is done by activating the correct neurons in each layer in the network and applying the correct values for the weights and bias (i.e., the parameters of the neural network).
The network learns to find these pathways by:
Making a prediction on the input
Measuring the distance between the prediction and the true output using a loss function and recording this within a loss score
Feeding the loss score into an optimizer that then adjusts the parameters based on this loss score (using a process called gradient descent)
The ultimate goal of all this is to find the optimal set of parameters that make the most accurate predictions. The GPT models are trained to do this in an unsupervised manner, as Stephen Wolfram explains in his blog post:
Recall that the basic task for ChatGPT is to figure out how to continue a piece of text that it’s been given. So to get it “training examples” all one has to do is get a piece of text, and mask out the end of it, and then use this as the “input to train from”—with the “output” being the complete, unmasked piece of text...the main point is that—unlike, say, for learning what’s in images—there’s no “explicit tagging” needed; ChatGPT can in effect just learn directly from whatever examples of text it’s given.
Probability distributions
After the text data in a prompt goes through the several layers of decoder blocks in the transformer architecture, the LLM will map each word embedding to all the tokens in its vocabulary (i.e., all the tokens that the model has learned from its training data).17 A function is then applied to convert the mappings into probabilities for each word.
What the model is doing here is looking at all the tokens that it knows and working out the probability of each of token being the next word in the given input text data. This is what is meant by a probability distribution.
For example, if we provided the model with the words "With great power comes", it might produce the following probability distribution:
Top 10 predicted words:
great: 0.24902514
with: 0.0548905
a: 0.0316601
in: 0.02712334
after: 0.025105478
lies: 0.02287713
to: 0.020820534
that: 0.016439462
and: 0.015829766
an: 0.01466654
At this point, there are a few different ways to select a word from this probability distribution. The simplest method would be to select the word with the highest probability.
Alternatively, to make the LLM generate more nuanced outputs, you could instruct the model to select from a sample of the distribution. For example, with top-k sampling, the model will select the next word "from only the top-K most likely possibilities."18
If k is set to 10, let's say, then the model will randomly select from one of the 10 most probable words from the probability distribution. The purpose of using a method like top-k sampling is to prevent the model "from accidentally choosing from the long tail of low probability tokens and leading to an unnecessary linguistic dead end."19
How LLMs generate text
Putting all the above together, the GPT models generate new text in the following way:20
A tokeniser is applied to the input text data to convert the natural language into tokens (i.e., a string of numbers).
The tokens go through an embedding layer where each token is mapped to a word embedding, resulting a matrix of vectors where each vector represents each token.
A positional encoder adds another vector to each word in the input text data representing the position of word in the sentence.
These vectors then go through multiple layers of decoders in a transformer architecture consisting of a masked self-attention mechanism and a feed forward neural network:
The masked self-attention mechanism gives each token a score representing how much attention should be given to that word in the context of the input.
The feed forward neural network finds the best path between the tokens and the correct next word in the sentence.
The model produces a probability distribution and selects a token from this distribution.
But in order for the model to produce longer bodies of text, there are two more steps involved:
6. The model adds the predicted token to the original input to create a new input.
7. The LLM then takes in the new input and repeats steps 1-6 until it creates a complete sentence or even a paragraph.
Fine-tuning
I won't go into how fine-tuning works since it is not too pertinent to the topic of this post. However you can read my previous explanations of fine-tuning, both a simple explanation in a post on OpenAI's o1 model and a more detailed explanation in a post on Google's Gemini model.
What constitutes the 'procesing' of 'personal data' under the GDPR?
Under Article 4.1 of the GDPR, 'personal data' is defined as:
any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
This definition can be broken down into four main elements:
'any information'
'relating to'
'identified or identifiable'
'natural person'
'any information'
Personal data can include information that may be objective or subjective in nature. This means that opinions or assessments about a person can constitute their personal data.21
The content of personal data could include information about one's private life that could be regarded as sensitive, but it is not just limited to this.22 Recital (30) GDPR states that online identifiers, such as IP addresses and cookie identifiers, can also be classed as personal data.
This information can also come in a variety of forms. It could be in a written form or even a sound or image.23
'relating to'
For information to be personal data under the GDPR, that information must be about an individual. This means that either the content, purpose or effect must be linked to a particular person:24
The content element is satisfied if the information itself is about an individual, such as the exam result of a student.
The purpose element is satisfied if the information can be used to evaluate or analyse an individual.
The effect element is satisfied if the use of the information has an impact on an individual's rights or interests.
'identified or identifiable'
Under the definition of personal data, identifiability means a data controller or another person being able to single out an individual either directly or indirectly.
Whether a piece of information can be used to identify a person must be assessed from the perspective of those parties with access to that information.25 This could be the data controller or a third party that manages to gain access to the data.
However, Recital (26) states that account must be taken of "means reasonably likely to be used" to identify a person:
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
Accordingly, the CJEU has held that:
...if the identification of the data subject was prohibited by law or practically impossible on account of the fact that it requires a disproportionate effort in terms of time, cost and man-power...the risk of identification appears in reality to be insignificant.26
'natural person'
To be personal data, the information must relate to a natural person, and therefore not corporations, partnerships or other legal persons. Additionally, Recital (27) states that the GDPR does not apply to deceased persons.
Processing of personal data
Under Article 4.2 GDPR, the 'processing' of personal data means any operation performed on that data. This includes storing personal data, but also includes, among other things, its collection, organising, structuring, alteration, transmission or dissemination.
Furthermore, the CJEU has previously found that processing:
...may consist in one or a number of operations, each of which relates to one of the different stages that the processing of personal data may involve.27
LLMs do not store personal data
I think the Hamburg paper does a good job of covering the arguments for why there is no personal data stored in LLMs after training. But I want to expand on this a little further.
The Hamburg DPA's explanation of LLM architectures only looks at the tokeniser and the embedding layer. While these are important components, it does not provide the full picture.
As mentioned before, when an LLM is being trained, it constructs a vocabulary (or vocab
) consisting of all the tokens it comes across in its training. When making its predictions about the next word in a sentence, it refers to this vocab
to produce a probability distribution from which it selects the appropriate tokens.
This vocab
consists of a matrix of word embeddings that are learned during training. Such a matrix essentially looks like a table:
For example, GPT-3 has vocab
consisting of 50,257 tokens with an embedding size of 12,288. So the table would have 50,257 columns representing each token, and each column would have 12,288 rows of values representing the word embeddings for each token.
But this table IS NOT a database. The information in the vocab
consists of values for the parameters that are learned during training, and those values codify the relationships between all the tokens that the model comes across during training (and this is in addition to the weights stored in the neural networks of each transformer block).
So altogether then, an LLM stores:
All the tokens it comes across during training
The embedding vectors for each token
The weights for the neural networks in each transformer block
This information does not itself relate to an identifiable natural person. An LLM, as the Hamburg paper contends, merely consists of "abstract mathematical representations" of "linguistic fragments" and the patterns between these fragments.
So when you look under the hood of an LLM, this is the information you see. And this information cannot ordinarily be linked to specific individuals whose personal data may be in the training data, for it is just a bunch of numbers.
Or is it?
LLMs do store personal data
Perhaps the most confusing part of this topic is how we are to conclude that LLMs do not store personal data when they are capable of producing personal data in their outputs.
Following the Hamburg DPA's thinking, the answer to this is that the LLM uses the patterns and correlations learned from training to predict the best response to the prompt. Sometimes this results in personal data being constructed from the word fragments derived from its vocab
.
But when I consider the research on the tendency for LLMs to memorise verbatim data it has seen in its training data, I start to have doubts about this conclusion that LLMs never store personal data.
After training on a large dataset of text, the LLM has an understanding of language in the sense that it understands the correlations and patterns between words. However, the model does not necessarily learn factual information about the language it learns.
Any "factual knowledge" that the model is able to produce is merely derived from its understanding of language and the correlations between different words (or tokens). So by understanding language, it is possible for LLMs to "learn" factual information.
In particular, such factual information could be learned if the relevant patterns appear frequently enough in the training dataset. The more often the model comes across a certain pattern during training, the more prominently that pattern will be represented in its parameters.
Whenever the model then receives an input containing text relating to that more frequent pattern, it will rely on that pattern to produce its response. In doing so, the model could produce data that it has essentially 'memorised' from its training data.
This is an observation made by David Rosenthal of the Swiss law firm Vischer in his blog post exploring the question of whether LLMs store personal data. He uses the example of Donald Trump's birthday to illustrate how LLMs can sometimes memorise bits of factual information from its training data if it appears often enough:
If you ask a common LLM when Donald Trump was born, the answer will be "On 14 June 1946". If you ask who was born on 14 June 1946, the answer will be "Donald Trump". On the other hand, if you ask who was born in June 1946, you will sometimes get Trump as the answer, sometimes other people who were not born in June 1946 (but had their birthday then), sometimes others who were actually born in June 1946, depending on the model. The models have seen the date 14 June 1946, the word "birthday" or "born" and the name Donald Trump associated with each other very frequently during training and therefore assume that this pairing of words in a text is most likely to match, i.e. they have a certain affinity with each other and therefore form the most likely response to the prompt. The models do not "know" more than this.
With this being the case, you could argue that, ultimately, at least some personal data is being represented in the parameters of the model, and are therefore stored in the model:
...even if language models are not technically designed to store factual knowledge about individual, identifiable natural persons and therefore personal data, this can still occur, as shown in the example of Donald Trump. Even the association between Trump, his birthday and 14 June 1946 will not be a 100% probability in the respective models, but ultimately this is not necessary for it to be the most probable choice for the output. It is clear to everyone that this information, and therefore personal data, is somehow "inside" the models in question.
It should be stressed though that not all personal data will end up being memorised in this way. This phenomenon would only be reserved for the most frequent or the most unique pieces of information in the training data.
But this does still nevertheless imply that certain personal data is, in a way, stored in the LLM. Or at least this would most likely be the case with respect to public figures or those with a heavy online presence.
One could say, though, that the form of this memorised data, as the Hamburg DPA contends, is only in the form of word fragments represented as a bunch of numbers that lack "concrete characteristics or references that "relate" to individuals." However, this might not matter.
Rosenthal argues in his blog post that while individuals cannot be identified, even indirectly, from the tokens or word embeddings, identification becomes possible when these pieces of information are combined together by the model. That we "do not understand exactly where and how it is stored in an LLM or how the model retrieves it, does not mean that the information does not exist."
The main argument being made by Rosenthal is that personal data that has been memorised by the model can transformed from an unreadable to a readable format by using the right prompt. He compares this with the way computers render images from a a JPEG file:
The file itself only contains the data of the pixels. Nobody will be able to recognise personal data in this data. For us to read the text and thus access the personal data, the file needs to be opened by an image viewer software that will convert the data into pixels and use the pixels to generate an image. This is also the case with a language model, with the difference that putting the information together as a kind of key requires the prompt and the process is much more complex.
This thinking could also be applied to encrypted data. Such data is only considered personal data if the key is available to decrypt the cipher text, otherwise the cipher text is completely unintelligible.
Only with access to the correct cryptographic key can the cipher text can be transformed back into the plaintext that can be read. Under EU data protection law, encrypted data is still personal data so long as you have the key; the cipher text is only considered pseudonymised, and not anonymised, data.28
Similarly, personal data memorised by the LLM is pseudonymised data. With the right prompt, you can get the model to produce that personal data.
However, even if this is the case, as per CJEU caselaw, we also need to consider the means reasonably likely to be used to identify the person by the party with access to the data. For Rosenthal, this triggers two key questions:
Is it reasonably likely that somebody will use the the LLM with a prompt that results in the generation of personal data that is not included in the prompt itself?
Is it reasonably likely that those with access to these outputs will be able identify the data subject concerned?
Rosenthal does state that most LLM uses cases will probably not result in personal data being generated. But to the extent this does happen, the controller will need to comply with the relevant provisions of the GDPR, including, for instance, responding the data subject rights requests with respect to the model itself.
So do LLMs store personal data?
Truthfully, I am not sure where I stand on this.
The Hamburg DPA's position is quite attractive because of its simpler consequences. If LLMs do not personal data, then organisations wanting to take copies of trained models to fine-tune in their own environment do not need to be concerned with GDPR obligations that would have applied to personal data contained in those models.
Conversely though, Rosenthal's explanation does also make sense. With his relative approach, LLMs do not personal data unless it can be shown that there are prompts that produce personal data about individuals.
However, the consequences of this seem a little messier from a data protection perspective.
On 5 November, the EDPB is organising a stakeholder event involving various experts to discuss issues pertaining to AI models. This is in light of the Irish DPC's request to the EDPB for an opinion on the use of personal data by in these models.
The EDPB could take the opportunity to clarify the issue of whether personal data is stored in LLMs. This might be especially welcomed given that the (preliminary) report from its ChatGPT Taskforce did not cover this topic.
Renowned DP expert Gabriela Zanfir-Fortuna highlights this omission in the report in her blog post, which she thinks might be due to the diverging views among European data protection authorities:
[This] could either indicate that whether personal data is processed within the model is one of the contentious points at EDPB level, or that the DPAs agree to only focus on the five stages of processing personal data listed in the Report.
Let's see if we ever get clarity on this issue.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), pp.2-3.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.3.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), pp.3-4.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.4.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.5.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.5.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.6.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.6.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.7.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), p.7. See also Case C‑582/14, Patrick Breyer v Bundesrepublik Deutschland (19 October 2016), para. 56.
Hamburg DPA, ‘Discussion Paper: Large Language Models and Personal Data’ (July 2024), pp.8-9.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 218.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 220.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 218.
Brown et al, ‘Language Models are Few Shot Learners’ (2020), 8.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 224.
GPT-3 processes around 300 billion tokens consisting of roughly 50,000 unique words.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 225.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 225.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 223.
Case C‑434/16, Nowak v Data Protection Commissioner (20 December 2017), para. 34.
Joined Cases C-141/12 and C-372/12, YS (AG Opinion), (12 December 2013), para. 45.
Joined Cases C-141/12 and C-372/12, YS (AG Opinion), (12 December 2013), para. 45.
Case C‑434/16, Nowak v Data Protection Commissioner (20 December 2017), para. 35.
Case C-319/22, Gesamtverband Autoteile-Handel eV v Scania CV AB (9 November 2023), para. 45.
Case C‑582/14, Patrick Breyer v Bundesrepublik Deutschland (19 October 2016), para. 43.
Case C-40/17, Fashion ID GmbH & Co. KG v Verbraucherzentrale NRW e.V (AG Opinion), (19 December 2018), para. 99.
See this from the EDPB and this 2014 opinion from the former Article 29 Working Party (on p.20).
Good article.
I can see why the Hamburg conclusion is convenient for policy reasons, but it does seem like a massive loophole through which sophisticated controllers are goig to be able to drive a coach and horses.
By way of a quick example, how would the Hamburg authority distinguish their findings on personal data in LLMs from the findings the DPC (eventually) made about "lossy hashes" in the WhatsApp case? Facebook's approach there was to take people's contact lists, represent them in an abstract mathematical fashion, map relationships between them, add some indeterminacy, and declare that they weren't storing personal data. AFAICT, under Hamburg's reasoning, they would have been right?
Really stellar piece, Madhi -- your explainer on how LLMs work was excellent and you made me realize I needed to add even more to my own post to reflect the technical complexity.
I also am conflicted by the various conflicting views on the personal data question (California has also weighed in on this, as well as a few other jurisdictions.
I hope we'll get some clarity by the end of the year.