The EDPB's thoughts on web scraping for AI development (#2)
A look at the Board's opinion on the data protection aspects of AI models
TL;DR
This newsletter is about the European Data Protection Board's opinion data protection and AI models. It looks at the context of the opinion, the specific analysis on using legitimate interest as a legal basis for web scraping for AI development, and the consequences this has for those in the AI eco-system.
Here are the key takeaways:
On 17 December 2024, the European Data Protection Board (EDPB) adopted 'Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models'. The opinion addresses:
The application of the concept of personal data
The principle of lawfulness, with specific regard to legitimate interest in the context of AI models
The consequences of unlawful processing of personal data in the development phase of AI models
The EDPB provided this Opinion due to a request made by the Irish Data Protection Commissioner (DPC) in September 2024. This request was made after the DPC concluded proceedings against X for using the public posts of its users to train its AI model Grok.
Among the issues the Opinion addresses is the use of web-scraped datasets for model training. The EDPB addresses how legitimate interest could be used as a legal basis under the GDPR for using such datasets to the extent that they contain personal data.
The Opinion does not necessarily conclude that legitimate interest would be an appropriate legal basis for developers to rely on. Rather, it sets out the requirements that would need to be met for developers to rely on such a basis under the GDPR.
Developers that engage in web scraping to construct datasets to develop AI models will be impacted by this Opinion. However, given that building foundation models requires massive amounts of training data that often come from web-scraped datasets, frontier model developers will probably be most affected by this opinion and how European data protection regulators use it to guide their enforcement actions.
Intro
On 17 December 2024, the EDPB adopted 'Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models'.
The opinion concerns the following:
...questions on (i) the application of the concept of personal data; (ii) the principle of lawfulness, with specific regard to the legal basis of legitimate interest, in the context of AI models; as well as, on (iii) the consequences of unlawful processing of personal data in the development phase of AI models, on the subsequent processing or operation of the model.1
Under Article 64.2 of the GDPR, any of the data protection regulators in the EU can request the EDPB to give an Opinion on "any matter of general application or producing effects in more than one Member State."
Such a request was made of the EDPB by the Irish Data Protection Commissioner (DPC) after the latter concluded proceedings against X's AI model Grok. The DPC had commenced proceedings against X over "significant concerns that the processing of personal data contained in the public posts of X’s EU/EEA users for the purpose of training its AI ‘Grok’ gave rise to a risk to the fundamental rights and freedoms of individuals." When these proceedings came to an end in September 2024 after X agreed to suspend such processing, the DPC announced that it would be making a request to the EDPB to provide an opinion on:
...the extent to which personal data is processed at various stages of the training and operation of an AI model, including both first party and third party data and the related question of what particular considerations arise, in relation to the assessment of the legal basis being relied upon by the data controller to ground that processing.
The EDPB's Opinion is not legally binding on AI developers. Rather, the Opinion should be treated as guidance addressed to European data protection regulators on how they should assess the issues addressed in the Opinion:
This Opinion provides a framework for competent SAs to assess specific cases where (some of) the questions raised in the Request would arise. This Opinion does not aim to be exhaustive, but rather to provide general considerations on the interpretation of the relevant provisions, which competent SAs should take utmost account of when using their investigative powers. While this Opinion is addressed to competent SAs and relates to their activities and powers, it is without prejudice to the obligations of controllers and processors under the GDPR. In particular, pursuant to the accountability principle enshrined in Article 5(2) GDPR, controllers shall be responsible for, and be able to demonstrate compliance with, all the principles relating to their processing of personal data.2
In this post, I focus only on the parts of the Opinion pertaining to web scraping for AI development.
Web scraping and legitimate interest
An important part of the development of LLMs, and what has seemingly become almost standard practice for curating the training data for these models, is web scraping. For the purposes of the EDPB's Opinion, web scraping is given the following definition:
"Web scraping" is a commonly used technique for collecting information from publicly available online sources. Information scraped from, for example, services such as news outlets, social media, forum discussions and personal websites, many contain personal data.3
Cf. from 3 reasons why web scraping for AI development may be coming to an end:
Web-scraped datasets have been crucial to the development of large language models.
This has particularly been the case with the CommonCrawl datasets. Containing petabytes of text data from millions of webpages gathered over several years, they provide a useful corpus to feed large LLMs during pre-training.
With pre-training, the models learn to predict the next word in a sentence. The aim is to build a model capable of parsing and generating language derived from a large reservoir of text.
The CommonCrawl datasets have proven crucial for this purpose. The paper for GPT-3 lists it as one of the several datasets used to train the model, and since 2019 many more LLMs have been trained using the same.
Web-scraped data has therefore become standard for the development of modern AI models. But this might not be the case for much longer.
The pertinent question here from a data protection perspective is one of lawful basis. Which of the legal bases can developers rely on to carry out web scraping to build training datasets for their models to the extent that personal data are involved?
The EDPB's Opinion explores the possibility of legitimate interest as an appropriate legal basis for such activity. I say possibility since the Opinion does not state that such a legal basis would be appropriate for web scraping, and instead provides the various requirements that would need to be fulfilled if an AI developer were to rely on legitimate interests.
The relevant provision here is Article 6.1(f) GDPR, which states that personal data may be processed if the processing is:
...necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
Accordingly, the question addressed in the opinion is:
Question 2: Where a data controller is relying on legitimate interests as a legal basis for personal data processing to create, update and/or develop an AI model, how should that controller demonstrate the appropriateness of legitimate interest as a legal basis, both in relation to the processing of third-party and first-party data?
i. What considerations should that controller take into account to ensure that the interests of the data subjects, whose personal data are being processed, are appropriately balanced against the interests of that controller in the context of:
(a) Third-party data
(b) First-party data4
To rely on legitimate interest, the EDPB points out three requirements that need to be met by the developer:5
Identify a legitimate interest
Meet the necessity test
Meet the balancing test
Is there an interest and is it legitimate?
In its guidelines for legitimate interest (adopted in October 2024), the EDPB defines an 'interest' for the purposes of Article 6.1(f) as "the broader stake or benefit that a controller or third party may have in engaging in a specific processing activity."6
But for an interest to be legitimate, it needs to meet three cumulative criteria:
The interest needs to be lawful. Simply put, the interest pursued cannot be illegal. However, while this does not necessarily mean that the interest must be explicitly recognised by law, legal frameworks can be taken into account to determine the legitimacy of the interest. For instance, Article 16 of the EU Charter recognises the freedom to conduct business.
The interest needs to be clearly and precisely articulated. If the interest being relied on cannot be clearly identified, then meeting the other relevant requirements is not possible.7
The interest needs to be real and present, and not speculative. This means that the interest cannot be merely hypothetical.8
One example that the EDPB provides in its Opinion of a legitimate interest in the AI context is "developing the service of a conversational agent to assist users." Such an agent could be developed by fine-tuning or otherwise augmenting a foundation model, which in turn may involve web scraping to build the dataset needed for such an engineering project.
The necessity test
Legitimate interest can only be relied on if the processing of personal data is necessary to pursue such an interest.
Recital (39) GDPR provides that necessity in this context means that the interest cannot be "reasonably fulfilled by other means." This means considering two elements:
Whether the processing will allow the pursuit of the interest. The EDPB's Opinion states that "if the pursuit of the purpose is also possible through an AI model that does not entail processing of personal data, then processing personal data should be considered as not necessary."9
Whether there is no less intrusive way of pursuing the interest. If the processing of personal data is needed to pursue the interest, then it needs to be kept to the minimum possible. This requirement is encapsulated in the principle of data minimisation under the GDPR (Article 5.1(c)), which requires that the data processed are "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed." The EDPB's Opinion states that particular attention should be paid to "the amount of personal data processed and whether it is proportionate to pursue the legitimate interest at stake."10
The balancing test
As stated in Article 6.1(f), the legitimate interest can be "overridden by the interests or fundamental rights and freedoms of the data subject." This therefore requires a balancing test to be carried out to ensure that the respective interests of the developer and the data subjects are appropriately balanced.
The balancing test for legitimate interests entails the following:
Identifying the rights and interests of the data subject. There is a distinction between rights and interests. Rights refer to those afforded by law, including those in the EU Charter. The EDPB notes how "large scale and indiscriminate data collection by AI models in the development phase may create a sense of surveillance for data subjects, especially considering the difficulties to prevent public data from being scraped", and in turn undermine their freedom of expression (a right protected under Article 11 of the Charter).11 Examples of data subject interests identified by the EDPB in its Opinion in the context of personal data used for AI development include "the interest in self-determination and retaining control over one's own personal data (e.g. the data gathered for the development of the model).”12
Impact on data subjects. The impact entails both the risks and benefits of the processing. Regarding the risks, the EDPB emphasises assessing their severity by taking into account how the data are processed, the scale of the processing and the volume of data used.13 The status of the data subject and the relationship with the controller should also be considered. Regarding the impact on data subjects, the following factors will be relevant:
The nature of the data processed. For example, "the processing of some...categories of personal data may lead to significant consequences for data subjects", such as "personal data revealing highly private information (e.g. financial data, or location data)" that "should be considered as possibly having a serious impact on data subjects."14
The context of the processing. The "way in which the model was developed" and the "nature of the model and the intended operational uses" could play a role here.
The further consequences that the processing may have. This could include physical, material or non-material damage.
Reasonable expectations of the data subject. Recital (47) GDPR states that there will be an imbalance where "data subjects do not reasonably expect further processing" of their personal data. On this, the EDPB stresses that "the mere fulfilment of the transparency requirements set out in the GDPR is not sufficient."15 Accordingly, simply mentioning information in a privacy notice about how an AI model was developed does not necessarily bring the use of personal data within the reasonable expectations of the data subject.16 The factors to consider for assessing the reasonable expectations of the data subject include:17
The public availability of the data
The nature of the relationship between the data subject and the developer (and whether a link exists between the two)
The nature of the service
The context in which the personal data was collected
The source of the data (e.g. the website or service and the privacy settings they offer)
The potential further uses of the model
Whether the data subjects are aware that their personal data is online at all
Mitigation measures. If there is an imbalance, developers are permitted to implement measures to improve this. However, this does not include the measures already required under the GDPR.
The specific mitigation measures recommended by the EDPB that developers could implement to limit the negative impact on data subjects include:
Reducing the identifiability of personal data used
Pseudonymisation of personal data
Masking personal data
Substituting personal data with synthetic data
Extending the data curation period (which gives data subjects time to exercise their rights)
Propose unconditional 'opt-out' from the outset
Allow data subjects to exercise their right to erasure beyond the specific grounds under the GDPR
Allow data subjects to submit claims of personal data regurgitation or memorisation (you read more about data memorisation and leakage in my previous posts on this here and here)
Use of public and accessible communications beyond that required under the GDPR (i.e., the standard privacy notice on a website), such as information on the collection criteria used
Alternative forms of information data subjects (e.g., media campaigns)
Can developers use web scraping for AI development under legitimate interest?
It seems unlikely. The EDPB itself does not give a definitive answer on this. From the analysis included in its Opinion on the matter, the issues that will have the most impact on the legality of legitimate interest as a basis for web scraping for AI development will be on the necessity and balancing tests.
On the necessity test, web scraped datasets are quite likely to contain personal data. Given the broad definition of personal data and the fact that using datasets like CommonCrawl have become fairly standard in the development of foundation models, the idea of frontier model developers training models with datasets that do not contain any personal data seems highly unlikely. But obtaining such data via web scraping may not be considered necessary if such datasets could be acquired via other means. Data licensing agreements for example could provide developers with higher quality, though lower quantity, datasets that may be preferable to indiscriminate scraped datasets. This could weaken the case for web-scraped datasets being necessary for developing AI models.
Additionally, those building on top of foundation models may have an easier time reducing the amount of personal data contained in their datasets for AI engineering or at least implementing appropriate mitigation measures. Those in the application layer of the AI eco-system will be prioritising quantity over quality; foundation models will already possess very general capabilities after being trained on masses of text data from the internet, and those building on top of these models need a much smaller amount of data to tailor the model for a particular task or domain.
These dynamics for frontier and app developers will also hold true for the balancing test requirement. The biggest issue with web scraping is that (a) it is usually done at a large scale and therefore consumes a lot of data and (b) the data subjects whose data are collected will not know about this before, during or after it has been carried out. So from a data protection perspective, this method of data acquisition is highly controversial.
The large-scale aspect of this processing activity is recognised by the EDPB in its Opinion:
...the use of web scraping in the development phase may lead - in the absence of sufficient safeguards - to significant impacts on individuals, due to the large volume of data collected, the large number of data subjects, and the indiscriminate collection of personal data.18
And it also recognises how such activity will almost always fall outside the reasonable expectations of data subjects:
In the development phase of the model, the data subjects' reasonable expectations may differ depending on whether the data processed to develop the model is made public by the data subjects or not. Further, the reasonable expectations may also differ depending on whether they directly provided the data to the controller (e.g. in the context of their use of the service), or if the controller obtained it from another source (e.g. via a third-party, or scraping).19
Accordingly, that the EDPB includes recommended mitigation measures in its opinion suggests that it has implicitly concluded that web scraping involves an imbalance between the interests of developer and the rights and interests of data subjects.
Furthermore:
In the context of web scraping, examples of specific measures facilitating the exercise of individuals' rights and transparency may include: creating an opt-out list, managed by the controller and which allows data subjects to object to the collection of their data on certain websites or online platforms by providing information that identifies them on those websites, including before the data collection occurs.20
The EDPB is clearly trying to give a steer on the measures that would address the most prominent issues with scraping (namely its large-scale and nature and that it is outside the reasonable expectations of data subjects). The technical measures seem designed to reduce the amount data collected to the bare minimum actually needed (which may still end up being a lot if you are a foundation model developer especially). The organisational measures concerning data subject rights seem to address the reasonable expectations issue and provide data subjects with more control over their data that might be scraped.
Accordingly, the conclusions I wrote last year regarding the last time the EDPB addressed web scraping in its report on OpenAI's ChaGPT I think are still relevant in terms of what the Board thinks about this issue:
...the EDPB noted two more important points about the data processing behind ChatGPT:
Technical impossibility is not an excuse for non-compliance with data protection law.
The burden of proof of effectiveness of measures taken to comply with data protection requirements is on OpenAI.
Overall, the EDPB appears to be open to legitimate interests as an appropriate legal basis for collecting the training data needed for LLMs like ChatGPT. So long as the conditions for legitimate interests under the GDPR can be met, developers like OpenAI could, in theory, rely on this legal ground to train their language models.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 7.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 15.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 18.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 4.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 66.
EDPB, Guidelines 1/2024 on processing of personal data based on Article 6(1)(f) GDPR (adopted 8 October 2024), para. 14.
EDPB, Guidelines 1/2024 on processing of personal data based on Article 6(1)(f) GDPR (adopted 8 October 2024), para. 14. para. 17
EDPB, Guidelines 1/2024 on processing of personal data based on Article 6(1)(f) GDPR (adopted 8 October 2024), para. 17
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 73.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 73.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 80.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 77.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 86.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 84.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 92.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 93.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 93.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 86.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 94.
EDPB, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI (adopted 17 December 2024), para. 106.