Purpose limitation and AI development
How this important data protection principle might influence AI engineering projects

TL;DR
This newsletter is about applying the purpose limitation principle to AI development. It looks at the relevant legal provisions in the GDPR on this principle, its interpretation by courts and regulators, and this legal requirement might influence AI engineering projects.
Here are the key takeaways:
Under the purpose limitation principle, personal data can only be processed for specific purposes that are determined before the data are collected. If the purposes can be fulfilled without processing personal data, then that data should not be used to fulfil those purposes.
If a controller wants to use data collected for one purpose for another purpose (i.e., further processing), this is subject to specific rules under the GDPR. The further processing of data is only permitted in certain circumstances, including where the further processing is compatible with the original purpose, where it is permitted by law or where the data subject has given consent.
The Court of Justice of the European Union and the European Data Protection Board have provided further interpretations and guidance on this principle and how it might apply in practice. This includes some guidance on how purpose limitation may influence AI development.
If the compatibility test for further processing cannot be met, then controllers can only use customer data for model training after obtaining consent from the data subjects. This could be quite difficult depending on the context of the processing, including the relationship the controller has with data subjects.
This means that it is important for controllers ask the fundamental first order questions before commencing an AI project. Without doing this, companies may fall for the AI hype and FOMO, resulting in the implementation of AI systems that are neither very good nor compliant.
A (fictional) scenario
I have written in previous posts about companies using proprietary data to build AI systems. This includes the impact that data-hungry generative AI has on data protection.
In this post, I attempt to take a closer look at the complexities of applying the purpose limitation principle to AI development. I do so using the following fictional scenario:
A car insurance company, CarCare, wants to develop a customer-facing chatbot for its website. The intention is to build a chatbot that its customers can use for quick and easy support access. This includes accessing documents, placing online claims or making payment enquiries.
The company plans to develop this product by fine-tuning a multi-modal large language model (LLM). The trained model will then be deployed with a graphical user interface (GUI) on its website. Training data will consist of 10,000 retained customer support records. This is to ensure that the chatbot exhibits the desired behaviour when responding to customer queries.
The customer support records consist of:
Date and time of contact
Communication channel (phone, email, chat, in-person)
Name of agent or representative involved
Summary or transcript of conversation
Customer inquiries or complaints
Actions taken during the interaction
Resolutions or follow-up steps
Supporting documents or attachments (e.g., scanned IDs, claim forms and photos)
CarCare plans to use supervised fine-tuning, for which the developers will reconfigure the formatting of the records into a set of
(instruction, response)
pairs. The model will also be trained with instructions produced by developers to ensure that if it comes across an issue or query that it cannot answer, the customer will be connected with a human operator.
To see how the purpose limitation principle would apply to the above scenario, we would need to consider:
The definition of purpose limitation under the GDPR
The case law on purpose limitation
The regulatory guidance on purpose limitation
The law on purpose limitation
Relevant GDPR provisions
As per Article 5.1(b), under the 'purpose limitation' principle, personal data must be:
...collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes...
Recital (39) states that the data processing purposes "should be explicit and legitimate and determined at the time of the collection of the personal data." It further clarifies that personal data "should be processed only if the purpose of the processing could not reasonably be fulfilled by other means."
Further rules closely related to the purpose limitation principle can be found under Article 6.4. This provision specifies when personal data may be processed for purposes other than the original purpose (i.e., further processing). Such further processing is only permitted where at least one of the following applies:
The further processing purposes are 'compatible' with the original processing purposes
The legal basis for the further processing can be found in either national law or EU law
The data subject gives consent to the further processing of their personal data (which in turn is subject to the specific rules for obtaining consent)
Article 6.4 also sets out the factors that should be taken into account in determining whether the further processing is compatible with the original processing purposes:
Any link between the further processing and the original processing purposes
The context in which the data have been collected and the relationship between the data subject and the controller
The nature of the personal data, particularly whether they are special categories data (as per Article 9)
The possible consequences of the further processing for the data subject
The existence of appropriate safeguards, such as encryption or pseudonymisation
If, taking into account these factors, the compatibility test is satisfied, then a legal basis for the further processing is not required. However, an appropriate legal basis for the original processing purposes is still required.1
Where a controller carries out further processing on the basis of Union or national law or the consent of the data subject, then the compatibility test does not need to be satisfied for the further processing. Even so, the rights of data subjects, where applicable, still apply to that further processing, for example the right to object to processing.2
So overall, under the GDPR's purpose limitation principle:
Personal data can only be processed for specific purposes that are determined before the data are collected.
If the purposes can be fulfilled without processing personal data, then that data should not be used to fulfil those purposes.
The further processing of data is only permitted in certain circumstances, including where the further processing is compatible with the original purpose, where it is permitted by law or where the data subject has given consent.
Ultimately, the controller is responsible for complying with the purpose limitation principle and demonstrating such compliance.3
The Digi case
In the Digi case, the Court of Justice of the European Union (CJEU) provided its interpretation on the purpose limitation principle and its application to further processing.
The case concerned Digi, a Hungarian internet provider. Digi had a database called ‘digihu’ which contained the personal data of its customers. The information in digihu related to subscribers of Digi's newsletter as well as data of system administrators who provided access to the interface of the website (www.digi.hu).
Digi had discovered a technical error in its infrastructure which caused problems with its servers. Accordingly, it created a test database, which included copies of personal data stored in digihu, as part of its efforts to resolve the issue. This database contained data on 320,000 data subjects.
Some time later, an ethical hacker contacted Digi to inform the company that he had managed to gain access to the test database. The hacker provided the line of code from the database that contained the vulnerability. Digi fixed the issue and deleted the test database. The company also reported the incident to the Hungarian data protection authority (DPA).
The DPA found that Digi had infringed the purpose limitation and storage limitation principles under the GDPR. This was due to the company failing to delete the test database immediately after carrying out the necessary tests and correcting the fault with its servers. By not deleting the database, a large amount of personal data had been stored in that database for no purpose for nearly 18 months, according to the regulator. Digi was fined around €284,000 for this infringement.
Digi challenged the DPA's findings in a Hungarian court, which then submitted the matter to the CJEU to answer the relevant questions on EU law (or the GDPR to be exact). Those questions included the following:
Should the purpose limitation principle be interpreted in such a way that it allows a controller to store personal data, which has been collected and stored in a lawful and purposeful manner, in another database?
If this parallel storage of personal data is not compatible with the purpose limitation principle, is it compatible with the storage limitation principle for the controller to store in parallel in another database personal data that has otherwise been collected and stored in a lawful and purposeful manner?
In sum, the verdict of the CJEU was that data collected for one purpose could be used for the separate purpose of testing and error correction if the compatibility test under Article 6.4 GDPR can be met.
In coming to this verdict, the Court provided several stipulations regarding the purpose limitation principle:
The principle has two components. The first is that the processing purposes are specified, explicit and legitimate. The second is that data are not processed with purposes incompatible with the original processing purposes.4
The issue of compatible further processing is relevant where the further processing is not identical to the original processing purposes.5
The purpose of the criteria to be considered for the compatibility test is to identify a specific, logical and sufficiently close link between the further processing and the original purposes. This ensures that the further processing does not deviate from the legitimate expectations of the data subject.6 The criteria are also to ensure a balance between predictability and legal certainty on the one hand, and the degree of flexibility for the controller to manage data on the other.7
Although it was ultimately for the Hungarian court to determine the legality of Digi's actions, the CJEU gave its view on the facts of the case:
The original purpose of storing personal data in digihu was the conclusion and performance by Digi of subscription contracts with its customers.8 The test database was created by Digi in order to be able to carry out tests and correct errors regarding its customers.9
There was a specific link between "the conducting of tests and the correction of errors affecting the subscriber database and the performance of the subscription contracts of private customers."10 This is because the errors "may be prejudicial to the provision of the contractually agreed service."11
The further processing (i.e., the testing and correction) did "not deviate from the legitimate expectations of those customers."12 Also, no sensitive data was involved and the subscribers did not face any detrimental consequences from the further processing.13
EDPB guidance
Data protection by design and default
Article 25 requires controllers to ensure that their processing operations adhere to data protection by design and by default (DPbD). This means having "data protection designed into the processing of personal data and as a default setting and this applies throughout the processing lifecycle."14
In its guidelines covering this obligation, the European Data Protection Board (EDPB) addresses the purpose limitation principle, which is integral to DPbD. The Board states that the processing operations of the controller should be "shaped by what is necessary to achieve the [processing] purposes."15 This may entail implementing some of the following measures:16
The controller should not connect datasets or perform incompatible further processing
Hashing or encryption should be applied to limit the possibility of repurposing data, accompanied by appropriate policies and contractual obligations on employees and third parties processing personal data
The controller should regularly review whether the processing is necessary for the purposes for which the data was collected and test the design against purpose limitation
Personal data and AI models
Further guidance can be found from the EDPB's opinion on the development and deployment of AI models.
In that opinion, the Board defines AI development as the stages prior to AI model deployment, which includes "code development, collection of training personal data, pre-processing of training personal data, and training."17 When it comes to the data sources used for building the training dataset, controllers should take steps to avoid or limit the collection of personal data where this is possible.18 This may entail using:
Appropriate selection criteria for the training data
Relevant and adequate data sources considering the intended purpose of the data collection
Excluding inappropriate data sources
The EDPB's AI model opinion does not directly address the issue of compatibility of further processing.19 However, the EPDB does mention how the purpose limitation principle should be considered for AI model development and deployment. In particular, the controller should consider the following factors to ensure that the use of any personal data across the AI model lifecycle complies with purpose limitation:20
The type of model the controller plans to develop
The expected functionalities of the model
Whether the model is being developed for internal deployment
Whether the model will be sold or distributed to third parties after development
Whether the model will be deployed for research or commercial purposes
Applying purpose limitation to the CarCare scenario
Given that CarCare is the data controller of the customer support records, it must comply with the purpose limitation principle and demonstrate this compliance. To do this, it needs to identify the original purpose of processing these records and the further processing of these records that it is considering.
So what was the original purpose of collecting and storing the customer support records? The main purpose might be to track customer feedback or complaints and/or train customer support staff.
And what would be the further processing of the customer support records? It would be the use of those records to train an AI model for a customer support chatbot on its website.
With the original processing purposes and the further processing identified, CarCare would then need to determine on what grounds the further processing may take place. This means considering the routes available under Article 6.4.
Firstly, it is unlikely that Union or national law contain provisions explicitly permitted insurance companies to use customer data to train AI models (presumably). If this is true, then the further processing (i.e., using customer support records for model training) cannot be carried out on this basis.
Secondly, CarCare could try asking its customers if they consent to the use of their data for training a customer support chatbot. The obvious issue with this, however, apart from the practicalities of obtaining this consent, is that customers might say no. In fact, if enough customers do say no, then this might leave CarCare with too little data to make the fine-tuning project feasible. The company could look to other data sources to try to supplement its fine-tuning dataset, or even create its own records either manually or synthetically. But this could impact data quality. The customer support records directly reflect the environment that CarCare want to train their model for, so the less of this data it has to work with the more difficult it may be to effectively train the model.
Accordingly, CarCare may consider the third option, which is to apply the compatibility test on the further processing using the criteria set out in Article 6.4:
The link between the processing purposes. A link between the original purpose and the further processing could be made here. If the original purpose is the retain customer feedback and complaints and train customer support staff, then using that data to build a system to automate this function could be viewed as a company legitimately seeking ways to improve its internal processes and improve the quality of its services. Such arguments could be based on the CJEU's stipulations in Digi regarding the flexibility afforded to the controller to manage data. However, this might depend on how far such flexibility extends, and the fact that this is dictated by how close the processing purposes really are.
The context. This could hinge on whether training an AI model is 'within the legitimate expectations' of the data subject. For this, CarCare would need to carefully consider the relationship it has with its customers and whether it would be happy with their data being used to build a customer support chatbot. This in turn may also depend on whether customers are even aware that their customer support interactions are retained in the first place. If they are so aware, then using that data to build a chatbot might be less surprising to them, although some may still consider this to be unnecessary especially if they have had good experiences with customer support in the past. Would an LLM-based chatbot really improve things further?
Nature of the personal data. There could be special categories data involved. If the customer support records contain information about insurance claims for example, this might involve information about car accidents and therefore injuries that might have been suffered by customers, which would be health data under Article 9 GDPR.
Consequences for data subjects. As I have previously written on (here and here), training data extraction attacks is a real vulnerability with LLMs. It might therefore be possible to use certain prompts with the chatbot to extract the data used to train the model. This could result in data from customer support records being extracted. LLM behaviour is highly unpredictable, and though fine-tuning is a way to limit its probability distribution to more reliable and desirable predictions and outputs, research has shown how this can inadvertently be undone (see more on this here).
The existence of appropriate safeguards. Partly based on the recommendations from the EDPB, CarCare could consider the following safeguards when using customer support records for building its chatbot:
Strip out direct identifiers from the records, including names and contact details
Start with a small subset of the records see if this influences the behaviour of the LLM in the desired way
Consider using a more capable base model which therefore requires less computationally-intensive fine-tuning methods, which in turn requires less training data
Consider local fine-tuning or using a provider of AI model training cloud services that gives appropriate data protection guarantees (as per Article 28 GDPR), for example not using CarCare's customer data for retraining
If CarCare considers the compatibility test, and concludes that the further processing is incompatible, then it can only use the records for model training after obtaining consent from the data subjects. This is due to the wording of Article 6.4, which suggests that if the compatibility test is not met and there is no Union or national law that can be relied on, then the further processing can only be carried out on the basis of consent.21 CarCare would therefore run into the same problems outlined above if trying to obtain consent from customers to use their data for training an AI model.
From this scenario, one can see the complexities of purpose limitation in the context of AI model training using proprietary data. And this analysis did not even consider the other applicable requirements under the GDPR or even the implications of the AI Act.
But one thing that this scenario hopefully highlights is the importance of having an AI strategy in place. In other words, it is critically important to ask the fundamental first order questions pertaining to an AI project. Without this, companies may fall for the AI hype and FOMO, resulting in the implementation of AI systems that are neither very good nor compliant. Some of these fundamental questions include the following:
Do you need an AI system to achieve the intended objective?
Would your customers be receptive to an AI system?
Is their evidence supporting the case for an AI system, for example that it will make processes faster or more effective?
What resources will it take to build the AI model in a way that is compliant?
GDPR, Recital (50).
GDPR, Recital (50).
GDPR, Article 5.2.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 25.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 34.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 36.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 37.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 28.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 43.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 44.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 44.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 44.
Case C-77/21, Digi Távközlési és Szolgáltató Kft. v Nemzeti Adatvédelmi és Információszabadság Hatóság (20 October 2022), para. 44.
European Data Protection Board, Guidelines 4/2019 on Article 25 Data Protection by Design and by Default (20 October 2020), para. 2.
European Data Protection Board, Guidelines 4/2019 on Article 25 Data Protection by Design and by Default (20 October 2020), para. 71.
European Data Protection Board, Guidelines 4/2019 on Article 25 Data Protection by Design and by Default (20 October 2020), para. 72.
European Data Protection Board, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models (17 December 2024), para. 18.
European Data Protection Board, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models (17 December 2024), para. 50.
European Data Protection Board, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models (17 December 2024), para. 17.
European Data Protection Board, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models (17 December 2024), para. 64.
Christopher Kuner et al (eds),The EU General Data Protection Regulation (GDPR): A Commentary (OUP 2020), p.343.