The EDPB's thoughts on web scraping for AI development
A look at the Board's commentary of legitimate interest and ChatGPT
On 23 May 2024, the European Data Protection Board (EDPB) published a report containing its preliminary views on the data protection issues related to OpenAI's ChatGPT. This work was produced by the Board's ChatGPT taskforce, which was established back in April 2023 to "foster cooperation and exchange information on possible enforcement actions on the processing of personal data in the context of ChatGPT."1
The report covers a range of issues. This includes the lawfulness of data processing, the fairness principle, transparency, data accuracy and data subject rights.
Among the most interesting insights in that report are those on the lawful basis for the collection of training data.
On this, the EDPB emphasised how the web scraping carried out to build the training datasets for LLMs like ChatGPT likely involves the collection of personal data. Such data may encompass "various aspects of the personal life of the respective data subject."2
Accordingly, for such data collection, OpenAI relies on legitimate interests as the legal basis. This is provided under Article 6.1(f) of the GDPR, which reads:
...processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
However, the EDPB reiterated that to rely on this legal basis, the data controller (OpenAI in this case) must follow a three-part criteria:3
A legitimate interests must exist. Such interests include those that can be recognised by law, such as the freedom to conduct business under the EU Charter of Fundamental Rights.4
The processing of personal data must be necessary for the pursuance of that legitimate interest. This means that there must be no other way to pursue the legitimate interest that is less intrusive.5
There must a balancing of the legitimate interest with the rights, freedoms and interests of the data subjects. This essentially means preventing the processing from being disproportionate.6
On this third requirement, the EDPB noted the importance of adequate safeguards to reduce the "undue impact on data subjects." It also stated that these safeguards could be technical in nature, including for example:7
Collection criteria that ensures certain data categories or sources are excluded from the training datasets (such as social media profiles).
Measures for the deletion or anonymisation of personal data before being used for training.
Such measures are particularly important when it comes to special categories data. Under Article 9.1 of the GDPR, this includes:
...personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation shall be prohibited.
The processing of special categories data can be processed where, as per Article 9.2(e) the data have been "manifestly made public by the data subject." However, the EDPB stressed that such personal data being publicly accessible does not mean that the data subject has 'manifestly made that data public'; it has to be shown that the data subject intended, explicitly and by a clear affirmative action, to make the data accessible to the general public.8
Additionally, the EDPB noted two more important points about the data processing behind ChatGPT:
Technical impossibility is not an excuse for non-compliance with data protection law.9
The burden of proof of effectiveness of measures taken to comply with data protection requirements is on OpenAI.10
Overall, the EDPB appears to be open to legitimate interests as an appropriate legal basis for collecting the training data needed for LLMs like ChatGPT. So long as the conditions for legitimate interests under the GDPR can be met, developers like OpenAI could, in theory, rely on this legal ground to train their language models.
Interestingly though, the EDPB did not touch on the viability of web scraping as a data collection method for developing LLMs like ChatGPT, which would call into question the necessity of such processing (the second requirement for legitimate interest). As I have written previously, the scaling laws that fuel current AI development often encourage a lack of quality controls that can cause unpredictable and undesirable model behaviour.
But as the report stresses, the findings only constitute the EDPB's preliminary thoughts given that there are several ongoing investigations taking place across the EU on these and other matters regarding ChatGPT.11
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 3
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 15
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 16
Christopher Kuner et al (eds),The EU General Data Protection Regulation (GDPR): A Commentary (OUP 2020), p.337.
Christopher Kuner et al (eds),The EU General Data Protection Regulation (GDPR): A Commentary (OUP 2020), p.338.
Christopher Kuner et al (eds),The EU General Data Protection Regulation (GDPR): A Commentary (OUP 2020), p.338.
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 17
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 18. See also Case C-252/21, Meta Platforms Inc and Others v Bundeskartellamt (4 July 2023), para. 77.
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 7
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 19.
EDPB, 'Report of the work undertaken by the ChatGPT Taskforce' (23 May 2024), para. 12.