The EDPB's thoughts on web scraping for AI development (#2)
A look at the Board's opinion on the data protection aspects of AI models
TL;DR
This newsletter is about the European Data Protection Board's opinion data protection and AI models. It looks at the context of the opinion, the specific analysis on using legitimate interest as a legal basis for web scraping for AI development, and the consequences this has for those in the AI eco-system.
Here are the key takeaways:
On 17 December 2024, the European Data Protection Board (EDPB) adopted 'Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models'. The opinion addresses:
The application of the concept of personal data
The principle of lawfulness, with specific regard to legitimate interest in the context of AI models
The consequences of unlawful processing of personal data in the development phase of AI models
The EDPB provided this Opinion due to a request made by the Irish Data Protection Commissioner (DPC) in September 2024. This request was made after the DPC concluded proceedings against X for using the public posts of its users to train its AI model Grok.
Among the issues the Opinion addresses is the use of web-scraped datasets for model training. The EDPB addresses how legitimate interest could be used as a legal basis under the GDPR for using such datasets to the extent that they contain personal data.
The Opinion does not necessarily conclude that legitimate interest would be an appropriate legal basis for developers to rely on. Rather, it sets out the requirements that would need to be met for developers to rely on such a basis under the GDPR.
Developers that engage in web scraping to construct datasets to develop AI models will be impacted by this Opinion. However, given that building foundation models requires massive amounts of training data that often come from web-scraped datasets, frontier model developers will probably be most affected by this opinion and how European data protection regulators use it to guide their enforcement actions.
Intro
On 17 December 2024, the EDPB adopted 'Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models'.
The opinion concerns the following:
...questions on (i) the application of the concept of personal data; (ii) the principle of lawfulness, with specific regard to the legal basis of legitimate interest, in the context of AI models; as well as, on (iii) the consequences of unlawful processing of personal data in the development phase of AI models, on the subsequent processing or operation of the model.1
Under Article 64.2 of the GDPR, any of the data protection regulators in the EU can request the EDPB to give an Opinion on "any matter of general application or producing effects in more than one Member State."
Such a request was made of the EDPB by the Irish Data Protection Commissioner (DPC) after the latter concluded proceedings against X's AI model Grok. The DPC had commenced proceedings against X over "significant concerns that the processing of personal data contained in the public posts of X’s EU/EEA users for the purpose of training its AI ‘Grok’ gave rise to a risk to the fundamental rights and freedoms of individuals." When these proceedings came to an end in September 2024 after X agreed to suspend such processing, the DPC announced that it would be making a request to the EDPB to provide an opinion on:
...the extent to which personal data is processed at various stages of the training and operation of an AI model, including both first party and third party data and the related question of what particular considerations arise, in relation to the assessment of the legal basis being relied upon by the data controller to ground that processing.
The EDPB's Opinion is not legally binding on AI developers. Rather, the Opinion should be treated as guidance addressed to European data protection regulators on how they should assess the issues addressed in the Opinion:
This Opinion provides a framework for competent SAs to assess specific cases where (some of) the questions raised in the Request would arise. This Opinion does not aim to be exhaustive, but rather to provide general considerations on the interpretation of the relevant provisions, which competent SAs should take utmost account of when using their investigative powers. While this Opinion is addressed to competent SAs and relates to their activities and powers, it is without prejudice to the obligations of controllers and processors under the GDPR. In particular, pursuant to the accountability principle enshrined in Article 5(2) GDPR, controllers shall be responsible for, and be able to demonstrate compliance with, all the principles relating to their processing of personal data.2
In this post, I focus only on the parts of the Opinion pertaining to web scraping for AI development.
Web scraping and legitimate interest
An important part of the development of LLMs, and what has seemingly become almost standard practice for curating the training data for these models, is web scraping. For the purposes of the EDPB's Opinion, web scraping is given the following definition:
"Web scraping" is a commonly used technique for collecting information from publicly available online sources. Information scraped from, for example, services such as news outlets, social media, forum discussions and personal websites, many contain personal data.3
Cf. from 3 reasons why web scraping for AI development may be coming to an end:
Web-scraped datasets have been crucial to the development of large language models.
This has particularly been the case with the CommonCrawl datasets. Containing petabytes of text data from millions of webpages gathered over several years, they provide a useful corpus to feed large LLMs during pre-training.
With pre-training, the models learn to predict the next word in a sentence. The aim is to build a model capable of parsing and generating language derived from a large reservoir of text.
The CommonCrawl datasets have proven crucial for this purpose. The paper for GPT-3 lists it as one of the several datasets used to train the model, and since 2019 many more LLMs have been trained using the same.
Web-scraped data has therefore become standard for the development of modern AI models. But this might not be the case for much longer.
The pertinent question here from a data protection perspective is one of lawful basis. Which of the legal bases can developers rely on to carry out web scraping to build training datasets for their models to the extent that personal data are involved?
The EDPB's Opinion explores the possibility of legitimate interest as an appropriate legal basis for such activity. I say possibility since the Opinion does not state that such a legal basis would be appropriate for web scraping, and instead provides the various requirements that would need to be fulfilled if an AI developer were to rely on legitimate interests.
The relevant provision here is Article 6.1(f) GDPR, which states that personal data may be processed if the processing is:
...necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
Accordingly, the question addressed in the opinion is:
Question 2: Where a data controller is relying on legitimate interests as a legal basis for personal data processing to create, update and/or develop an AI model, how should that controller demonstrate the appropriateness of legitimate interest as a legal basis, both in relation to the processing of third-party and first-party data?
i. What considerations should that controller take into account to ensure that the interests of the data subjects, whose personal data are being processed, are appropriately balanced against the interests of that controller in the context of:
(a) Third-party data
(b) First-party data4
To rely on legitimate interest, the EDPB points out three requirements that need to be met by the developer:5
Identify a legitimate interest
Meet the necessity test
Meet the balancing test
Keep reading with a 7-day free trial
Subscribe to The Cyber Solicitor to keep reading this post and get 7 days of free access to the full post archives.