Baby AI eating training data

Op-ed: AI training data = (non-)personal data? And is consent really relevant?

The European Data Protection Board is at it again: an urgent procedure has been launched to obtain clarification on “some of the core issues that arise in the context of processing for the purpose of developing and training an AI model”. The aim? To bring “some much needed clarity into this complex area”. Yet the mechanism chosen is all but meant for complex issues, as the EDPB has itself recognised that that mechanism – a request for an EDPB Opinion under Article 64(2) of the GDPR – is “better suited for targeted questions requiring a swift answer, while guidelines are better suited for matters with a larger scope”.

One of the fundamental questions in that respect, though, is definitely not a “targeted question” but has a much larger scope: what is the legal ground under the GDPR that justifies putting vast amounts of information in an AI model to train it?

Why does it have a much larger scope?

First, because you cannot really examine this question without first examining whether personal data is really being processed. This is a very complex question, as I aim to show below.

Second, because the discussion about the legal grounds for processing of training data – if we were to assume that there is processing of personal data – invariably leads to an in-depth assessment of whether one or another legal ground can justifiably be used. Not necessarily a simple question either.

The best part? The EDPB only has about 10 weeks left, until the second half of December 2024, to adopt its Opinion – and the EDPB has announced that it will be holding a “stakeholder event” on 5 November 2024 in that context (registration of “interest” in participating opens this Tuesday 15 October at 10.00 am [Brussels time], if you are reading this on the day of publication – EDIT: here’s the link, while it remains open).

The aim of this article, then, is to try to highlight a couple of things that are worth bearing in mind ahead of the publication (likely on 23 December 2024) of the upcoming Opinion – and potentially, of that stakeholder event.

If you are a company or organisation that is involved in the development or (further) training of an AI model, this article will be of interest. Even if you are not, read on – because the developments on what is and isn’t personal data might be of keen interest to you anyway.

Note: this is a long one, even by my standards. I am therefore including a table of contents, and here’s the “tl;dr” summary:

  • Is AI model training data “personal data”? Not necessarily, not even for first-party data if handled well (!), because what matters is the perspective of the training organisation. I do believe, though, that this has to go hand in hand with measures to limit the possibilities of (re)identification and with measures to ensure that – even in the theoretical scenario where some information might be “personal data” – the objective and practical implementation of the training are such that any “processing” of personal data would be incidental at best and not part of the intended scope of the operation.
  • If there is processing of “personal data”, does “consent” then prevail over other legal grounds? And are there any differences to be made between first- and third-party data? No, and no. [Great thing that the EDPB’s “legitimate interests” guidelines open for public consultation confirm the first point.] In fact, “legitimate interests” may be more appropriate than “consent”, due to certain consequences of “consent”, notably as regards the right to withdraw consent.

Comments are welcome as always, but please only after reading the whole thing.


Structure of the analysis: (search the title to pick up where you left off)


Preliminary note: the consequences of an Article 64(2) GDPR procedure

One thing worth remembering in this context is that the request for an Opinion (which was made by the Irish Data Protection Commission) basically transforms the EDPB into a de facto rule-maker for data protection authorities. This is because an Article 64(2) GDPR Opinion by the EDPB forces all EU data protection authorities to follow the adopted position (or else face the risk of a binding decision against them), which in turn removes the right of any single entity targeted by a complaint or investigation to defend itself properly: the case is already decided from the get-go, as the authority will not adopt a different position. In practice, therefore, it hampers the rights of defence and removes the right to an effective remedy (= right to an appeal), because you know the first instance of proceedings, those before the authority itself, are fixed. You only have one real instance of proceedings, the “appeal” phase.

This is why I have serious reservations about the use of Article 64(2) GDPR, and I wish that this practice – which suddenly became very popular among EDPB members in 2024 – would make way for more considered approaches to important issues. For instance, actually consulting with stakeholders – which happens to be required under Article 70(4) GDPR “where appropriate” (and I tend to think that anything that robs an organisation of its rights of defence makes consultation “appropriate”), and not just organising a virtual meeting to listen during a couple of hours to a small subset of concerns or positions.

But let’s assume it makes sense to go for an Article 64(2) GDPR Opinion here. In order to assess whether AI models* are trained with personal data and whether they in turn involve the processing of personal data, one must first examine a core question: what is “personal data”?


* “AI” is vague, as everything is “AI” today. When I refer to “AI models” in this op-ed, I am referring first and foremost to large language models (LLMs) due to the focus on “personal data” (as the issue is easier to discuss on the basis of language models). However, the thought process described herein could probably be extended at least to other artificial neural networks, and transformer models in particular. A point worth bearing in mind is that certain AI systems for e.g. sales and e-mail drafting are used to regurgitate specific company-sanctioned text; different considerations may apply there.


I. Can we not assume that everything is personal data?

It is critical to examine what is personal data, because the GDPR only applies to the processing of personal data – and it is not even in theory up to a potential controller to demonstrate whether information is personal data or not.

Articles 5(2) and 24 GDPR, on accountability, require that a controller be able to demonstrate compliance with its obligations under the GDPR. However, this requires first that the person or entity in question is a controller. In other words, the accountability obligation only kicks in once someone is shown to be a “natural or legal person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data” (Article 4(7) GDPR) – and thus that there is even any “processing of personal data”.

Without personal data, no processing of personal data; without processing of personal data, no controller; without controller, no requirement to be able to demonstrate compliance with the GDPR.

Therefore, one cannot assume that there is processing of personal data.

II. What is “personal data”?

The notion of “personal data” is defined in Article 4(1) GDPR as:

“any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”.

This definition is more or less identical to that of “personal data” under Article 2(a) of the GDPR’s predecessor, the Data Protection Directive (95/46/EC):

“any information relating to an identified or identifiable natural person ('data subject'); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity”.

The notion of “identifiability” is in turn explained in Recital 26 of the GDPR:

“To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.”

Put differently: “identifiable” means that the controller or another person has means at its disposal that are “reasonably likely” to be used to identify the person “directly or indirectly”.

III. Do all identifiers enable identification?

Identifiers (such as a full name, a username, a user ID, an IP [Internet Protocol] address, a cookie ID, etc.) can help a (potential) controller or another person to link certain information to a natural person, but not all identifiers actually enable identification of the natural person in question. For example, millions of individuals have the same first name or the same last name. Even an identical combination of first and last name is common. Therefore, with a name alone, a person is not necessarily directly identifiable, and additional data and context are sometimes needed before (direct or indirect) identification becomes possible.

IP addresses provide another useful illustration. Recital 30 of the GDPR states the following:

“Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.”

In other words, says the GDPR, IP addresses are an identifier that “may be used to […] identify” a natural person. Yet IP addresses alone are insufficient, as they do not reveal anything enabling identification directly. An IP address might be related to a company or a governmental entity, and not a natural person. The relevant Internet Service Provider (ISP) has additional data at its disposal to verify whether an IP address was assigned to one of its customers at a given time, and to which customer in particular. However, these additional data are in principle not available to other parties.

Similarly, the value of a cookie (also referenced in Recital 30 GDPR) that merely captures a visitor’s choice of language on a multilingual website is not personal data per se. It is merely a preference expressed by “a” visitor at a given time. However, it can become personal data if the website operator is able to combine the value of the cookie (i.e. the language choice) with actual personal data (for instance, a user’s profile).

IV. A “relative” concept – or an “absolute” one?

The concept of “personal data” is thus relative: it depends on the ability of the relevant entity to identify or make identifiable the natural person to whom information relates.

A contrary view, personal data as an “absolute” concept, would mean that an entity should consider a particular data item to be personal data as soon as any third party in existence, anywhere in the world (including a third party unrelated to the relevant entity) is able to link it to a natural person, regardless of whether or not such third party could be lawfully required to disclose this link. This “absolute” view is the one that is favoured by some data protection authorities (for instance, the French CNIL has regularly indicated in informal discussions that this is its view).

Yet an “absolute” view would mean that if e.g. one intelligence governmental agency somewhere in the world is able to link a particular data item to a natural person, everyone else globally has to consider that data item as personal data – even though there is no practical way for such others to obtain from said governmental agency any information enabling them to link that data item to a natural person.

The Court of Justice of the European Union (CJEU) will be led to examine this issue in case C-413/23 P, an appeal against a judgment by the General Court of the European Union in which the latter had adopted the “relative” approach [see footnote 1 below].

Previous judgments by the CJEU suggest that it will likely confirm this relative approach, with a potential caveat based on the likelihood of reidentification by a subsequent recipient, due to the Breyer, IAB Europe and Scania judgments.

V. CJEU teachings on “personal data”

V.1. Breyer

The “relative” nature of the concept of “personal data” was first hinted at in the context of the Breyer judgment of the CJEU (19 October 2016, C‑582/14) [2].

In that case, German federal institutions kept the IP addresses of visitors on their websites “[w]ith the aim of preventing attacks and making it possible to prosecute ‘pirates’” (para. 14). While the German federal institutions themselves could not identify individuals on the basis of these IP addresses, the CJEU was asked whether these IP addresses could nevertheless be considered personal data. The CJEU ruled as follows:

  1. The idea of “means likely reasonably to be used” implies that “for information to be treated as ‘personal data’ […]  it is not required that all the information enabling the identification of the data subject must be in the hands of one person” (para. 43).
  2. [I]t must be determined whether the possibility to combine a dynamic IP address [= an identifier known by the potential controller] with the additional data held by the internet service provider [= a third party] constitutes a means likely reasonably to be used to identify the data subject”(para. 45).
  3. This “would not be the case if the identification of the data subject was prohibited by law or practically impossible on account of the fact that it requires a disproportionate effort in terms of time, cost and man-power, so that the risk of identification appears in reality to be insignificant” (para. 46).
  4. While the referring court held that “German law does not allow the internet service provider [= the third party in that case] to transmit directly to the online media services provider [= the potential controller] the additional data necessary for the identification of the data subject”, the referring court would also have to examine (i) whether legal possibilities existed for the German federal institutions – as the website operator and thus as the “online media services provider”– to turn to the“competent authority” “in the event of cyber attacks”, and (ii) whether such authority would in turn “take the steps necessary to obtain that information from the internet service provider [= the third party] and to bring criminal proceedings” (para. 47);
  5. It concluded in that case that “a dynamic IP address registered by an online media services provider [= the potential controller] when a person accesses a website that the provider makes accessible to the public constitutes personal data within the meaning of that provision, in relation to that provider, where the latter has the legal means which enable it to identify the data subject with additional data which the internet service provider [= the third party] has about that person” (para. 49; emphasis mine).

This is therefore a “relative” interpretation of the concept of personal data: if an entity has access to information that in and of itself is insufficient to identify a natural person, and the entity has no legal means of obtaining from a third party, directly or through an authority, additional data enabling such identification, such information is not personal data from that entity’s perspective (even if it is personal data from that third party’s perspective) [3].

V.2. Scania

The interpretation set out in Breyer was moreover the foundation for the CJEU’s decision in the more recent Scania judgment (9 November 2023, C-319/22). This case concerned the question of whether vehicle identification numbers (VINs) can constitute personal data, given that the VIN is linked to a vehicle by the manufacturer, and then information relating to the vehicle is made available by the manufacturer to independent repairers. An independent repairer is able to know whether the VIN is indirectly linked to a natural person, because it knows the identity of its customer and can therefore link the VIN number to the relevant natural person. The manufacturer, however, is not in principle led to obtain any information tying a VIN number to a natural person.

In its judgment, the CJEU drew a distinction between the manufacturer and the independent repairer:

“where independent operators may reasonably have at their disposal the means enabling them to link a VIN to an identified or identifiable natural person, which it is for the referring court to determine, that VIN constitutes personal data for them, within the meaning of Article 4(1) of the GDPR, and, indirectly, for the vehicle manufacturers making it available, even if the VIN is not, in itself, personal data for them, and is not personal data for them in particular where the vehicle to which the VIN has been assigned does not belong to a natural person.” [Scania, para 49]

Because the VIN number does not enable identification on its own, and because the manufacturer does not have any information allowing it to tie the VIN to a natural person, the VIN does not in principle constitute personal data vis-à-vis the manufacturer. However, it can constitute “indirectly” personal data for the vehicle manufacturer, precisely because the manufacturer sends that to someone (the repairer) who is able to link the VIN to a natural person.

The consequences of this “indirect” personal data are unclear, as the CJEU did not expand upon this idea. Perhaps unintentionally, this may be an incentive to take measures to limit the risk of a recipient using information as personal data (for instance, by way of contractual restrictions on permitted uses of a particular dataset).

V.3. IAB Europe

In another recent judgment (IAB Europe, 7 March 2024, C-604/22), the CJEU summarised the findings of its case law on the topic:

  • Information is personal data if“by reason of its content, purpose or effect, it is linked to an identifiable person”(reiterating its findings in Nowak [20 December 2017, C‑434/16] and CRIF [4 May 2023, C‑487/21]);
  • Regarding the question of whether a natural person is “identifiable”,  “personal data which could be attributed to a natural person by the use of additional information must be considered to be information on an identifiable natural person” (by reference to Breyer as well as Nacionalinis visuomenės sveikatos centras [5 December 2023, C‑683/21]);
  • As a result, the concept of “personal data” covers “not only data collected and stored by the controller” but also “all information resulting from the processing of personal data relating to an identified or identifiable person”(Pankki S [22 June 2023, C‑579/21]).

VI. Different kinds of training data for AI models

With the above in mind, it is important to consider how training data is selected.

A frequent source of AI model training data is web crawling, whereby a bot collects publicly available information from a potentially vast range of websites. For instance, Common Crawl’s open repository of web crawl data includes “[o]ver 250 billion pages spanning 17 years”,with“3-5 billion new pages added each month”.

Another frequent source of training data is bespoke datasets, in particular if the organisation building or training the AI model has a digital platform (publisher, social media provider, web community, etc.) or provides additional services through which it has its own datasets (financial institutions, telecom provider, etc.). Alternatively, if the organisation considers one third party to have a particular valuable dataset (e.g. the content of a web community with a strong focus on software programming), it can often (seek to) obtain a licence to receive a copy of or gain access to the dataset.

Among those bespoke datasets, a distinction can be drawn between first-party data (in practice, “own” datasets) and third-party data (sourced from a third party).


A note on terminology:

A distinction is sometimes drawn between “zero-party data” and “first-party data”, the former being data that customers and others intentionally and proactively share with an organisation, and the latter being data that the organisation itself collects proactively. For the purposes of this analysis, though, I am grouping both categories under the umbrella term “first-party data”.

Similarly, a distinction can sometimes be made between “second-party data” and “third-party data”. In those cases, “second-party data” denotes situations where a dataset is sourced from an organisation for whom the dataset is “first-party data” (e.g. if an AI model trainer requests from a publisher [= the third party] a copy of the dataset regarding a particular website of that publisher [= the own dataset of that third party]), as opposed to a narrower view of “third-party data” where it is limited to datasets that the third party in question has not collected directly (in practice, therefore, where the third party in question acts as a data broker and it [the third party] has obtained the dataset from other sources). Again, for the purposes of this analysis, I will be grouping both categories, this time under the umbrella term “third-party data”, except where I explicitly refer to second-party data.


In addition, it is important to distinguish structured data from unstructured data. In the former case (which is often more relevant for first-party data), it is possible for an organisation to know and understand the manner in which the dataset is built and the fields that compose it. This means that certain pseudonymisation and anonymisation techniques are more effective on structured data, as it is possible to apply bespoke instructions based on the field and have those instructions applied to the whole dataset consistently.

By way of an illustration, if a dataset is structured in the form of a table with one column entitled “userID”, another “fullName” and another “postContent”, the organisation could choose to systematically exclude the “userID” and “fullName” fields and only include the “postContent” in its training dataset. It could also go a step further and use regular expressions (i.e. smart searching & replacing) to remove any mentions of a username from the “postContent”, by reference to the “fullName” values or through the detection of patterns meant to cover usernames (e.g. @ mentions).

These points will be important later in the analysis.

Equally important is a basic understanding of how AI models work, and large language models (LLMs) in particular in today’s context of Generative AI. In this respect, I really recommend reading the primer by David Rosenthal of the law firm Vischer on What is inside an AI model and how it works.

In practice, the data going into an LLM is not being fed as words, but gets broken into “tokens” (i.e. bits of words – a syllable, a whole word, a letter, …). When examining of the training data, the LLM creates associations between tokens, across thousands of “dimensions” in the LLM, to represent how close one token is to another. For instance, “Keller and Heckman” might be broken into “kel – ler – and – heck – man”. The model would then learn the probability that “heck – man” appear in a sequence after “kel – ler – and”.

If then at the level of input a person writes the following instruction for an LLM:

“I am researching actual law firms. Without doing any searches and by purely relying on your knowledge, please complete the following word sequence: Keller and …”

The aforementioned probability would be taken into account to determine what the likelihood is that “heck – man” should follow.

VII. Does training data contain personal data?

The question of whether the training data itself contains personal data depends on the combination of all of the previous points. I will be examining below the impact for the organisation intending to train the AI model (let’s refer to it as the Training Organisation), noting already that the same considerations apply to the initial training by the organisation that builds the model and to any additional training with a view to tailoring, by an organisation that might not have built the model but might have obtained access to it.

First, in the case of first-party data, some of the data initially part of the training dataset may already be personal data to start with from the perspective of the Training Organisation, simply because it likely already knows the data subjects. Yet this is not the case of all first-party data. For instance, one could argue that in the absence of any additional information, an e-mail address or phone number is not necessarily personal data (not everyone uses “firstname.lastname@company.com” as an e-mail address – if you are “awesomeperson@randomhost.com“, it might not enable identification) – while it is possible to register to many websites and platforms without providing more than an e-mail address or phone number. On an own platform of website, different data points can often be combined (e.g. e-mail address, username, time zone, IP address, etc.), at least in theory, to enable identification, but taken separately these data points may in and of themselves be insufficient to that end. Even if the training dataset includes personal data to begin with, though, this does not necessarily mean that the training dataset has to continue to contain personal data, as will be discussed in section IX hereunder.

Second, in the case of third-party data, it is likely that crawling data will include personal data from the perspective of the publisher of a given website (see by analogy the points raised regarding first-party data). A less obvious issue is whether that crawling data can be considered as personal data from the perspective of the Training Organisation. This is because, due to the relative nature of personal data (see sections IV and V above), it is important to take into account (i) the fact that natural persons might not always be “identified” as such, (ii) the legal means at the disposal of the Training Organisation to enable identification and (iii) the intent and nature of the service.

Why is this important? Knowing whether training data includes personal data has a significant impact on the applicable obligations. If there is no personal data from the perspective of the Training Organisation, it cannot be considered as processing personal data when training its AI model. If there is, it will have to comply with certain obligations – but that does not mean that the training is prohibited either (see sections XI.B & XI.C below).

VII.1. “Identified”?

The first question is whether a natural person is ever an identified natural person in the context of third-party data, including crawling data.

“To identify” means in ordinary language “to recognize someone or something and say or prove who or what that person or thing is”. In other words, in order for a person to be “identified” by a Training Organisation, the Training Organisation must recognise the person in question or have proof of who that person is.

In this respect, it is entirely possible that some of the web pages within crawl data reveal the entire identity of someone. Some personal websites might feature the name, address and phone number of the owner; some professional websites or profiles might feature a person’s name, title, e-mail address; etc. Some of it may be contributed by third parties, such as pages on wiki-type collaboration platforms.

However, the notion of “identified” requires a form of intent and knowledge: just because information seems to relate to a natural person does not make it personal data. There may be fake profiles and fake websites; in fact, there are many fake profiles and fake websites, many of which will include AI-generated information and even photos [4].

LinkedIn reported having blocked more than 121 million fake accounts in 2023, 89 million of which were blocked at registration and 32 million of which were restricted proactively by LinkedIn after registration but before any reporting by legitimate users, and whose profiles or content might thus have been exposed to legitimate users. This is a large number, considering the fact that there are on average 47.9 million users who log in to LinkedIn per month across the entire European Union.

Meta reported taking action on 2.62 billion fake accounts in 2023 (a combination of accounts actioned after someone reported them [1%] and accounts that Meta found proactively [99%]), and already (an additional) 1.83 billion during the first semester of 2024.

The Anti-Phishing Working Group reported detecting over 1.07 million unique phishing websites (potentially each with thousands of related URLs) just during the 4th quarter of 2023, i.e. fake websites with the particular aim of getting people to take an action to benefit the attacker (such as giving credentials to a legitimate website), and NewsGuard reported identifying over 1000 AI-generated news and information sites.

With this in mind, one cannot – and should not – assume that all web content that seems like it might be related to a person, is indeed relating to a natural person.

Even beyond the issue of fake accounts, though, the data must be capable of being linked to an identified natural person, i.e. someone that the potential controller – the Training Organisation – is able to recognise.  In the case of first-party data (in particular structured first-party data), it is far likelier that this condition will be met, because the data can more easily be linked to an identity in the sense of a (known) full user profile with contact details etc.

Typically, in the case of structured data being provided by a third party, the third party is requested to provide certain warranties contractually (that the data is accurate, that it relates to actual, living natural persons, etc.). If the third party declines to give such warranties, though, the Training Organisation may have to assess its level of confidence that the data received concerns natural persons.

In the case of unstructured third-party data, such as web scraping data, this assessment of the level of confidence that the data concerns natural persons is equally relevant, in particular in the light of the concerns set out above – unless the nature of the intended use is such that there is not supposed to be any processing of the information as personal data (see section VII.3 below).

In practice: the likelihood that any information in scraped data can be considered as concerning identified natural persons from the perspective of the Training Organisation depends on the degree of confidence given to a particular website; even with other third-party data there may be reasons not to consider certain data as concerning identified natural persons from the perspective of the Training Organisation.

VII.2. Legal means at the disposal of the Training Organisation

While on some digital platforms, the visible username corresponds to a “first name + last name” combination, many others do not, making it that much harder to identify a natural person based on the relevant username. In such a case, much of the information that would enable such identification (e.g. e-mail address, IP address, etc.) is only available in the database of the website owner – and it would thus not be included in scraped data.

As a result, that information can hardly be considered as relating to an “identified” natural person from the perspective of the Training Organisation.

Yet does it have “legal means” at its disposal to enable reidentification, in accordance with the CJEU’s Breyer case law?

In most cases, the answer will be simply “no”. Several factors support this:

  • First, the notion of “legal means” implies that obtaining reidentification unlawfully cannot be seen as “legal means”. This means that getting reidentification data through backchannels or sourcing it from unlawful sources should not be taken into account.
  • Next, it is worth pointing out that the sheer volume of crawl data is not without consequence – as will be seen in section VII.3 hereunder – but it is in and of itself irrelevant to the specific issue of whether there are any “legal means” at the Training Organisation’s disposal. Technically, nothing prevents the Training Organisation from taking one random sample out of the large volume of crawl data.
  • However, “legal means” is meaningless if one takes a purely hypothetical approach and one posits unlikely assumptions. This is because Recital 26 GDPR specifically talks about means “reasonably likely to be used” (emphasis mine), taking into account “the costs of and the amount of time required for identification”. For instance, it is wholly unrealistic and fallacious to consider that a Training Organisation could get in touch with owners of websites that are covered by a crawl dataset to request a copy of the relevant identification data, and it is equally unrealistic to consider that any website owners will individually give a copy of their relevant identification data to the Training Organisation (let alone if they have a legal framework in place that permits this). “Legal means” cannot be about unrealistic hypotheses but has to take into account reality.
  • In addition, the unstructured nature of crawl data is such that identification is that much harder, as there are no common elements that would allow the Training Organisation to anticipate whether a specific portion of crawl data contains potentially personal data. Structured third-party data, on the other hand, is more predictable (due to its structure). While the data may vary considerably, the common elements mean that it is easier to anticipate what the data might contain. This is where realism comes into play: with unstructured data, it is less realistic that there are any reasonable means at the disposal of the Training Organisation to attempt reidentification, even for all the information on one single website – let alone for all the information across multiple websites.
  • While similar observations can be made in relation to other third-party data, in particular unstructured datasets, the effort needed to contact such third party will be significantly less in the case of “second-party data”. In such a case, it may theoretically be “reasonable” to get in touch with the “second party”. However, it is equally likely that the second party itself does not have the necessary legal framework to provide identification data anyway. This is because data protection rules will often mean that such second party does not have an appropriate legal ground for the sharing of such identification data, or that sharing it would not be covered by its own privacy statement or other transparency notices. Put differently, while there may be reasonable means to get to identification (as per Recital 26 GDPR), those means are unlikely to be legal means (as per Breyer).

VII.3. Intent and nature of the service: where grey zones for “personal data” and “processing” intersect

It is now relevant to turn to another key aspect, one that is often forgotten in assessments of whether there is processing of personal data: while some information could constitute personal data from the perspective of a particular entity, it may be that the way the entity interacts with the container of such information is not intended to treat that information as personal data, for perfectly valid reasons, and that there can be no processing of personal data within the meaning of data protection legislation as a result. And this goes to the heart of both the notions of “personal data” and of “processing” – the combination of which is needed for the GDPR to apply.

For instance, in the IT sector, there are many providers of maintenance services. Maintenance services are services that involve verifying the manner of working of a particular IT system or database, in order to check that there are no issues, or to fix them otherwise. Maintenance service providers only look at the system or database structure, not the data contained within it, but the services they provide can have a direct impact on how the processing occurs. Most such services are built (and individuals are trained) in such a way that there is no processing of personal data.

Similarly, IT support services are often devised in such a way that a person from the support services logs in remotely to a system and helps troubleshoot issues with the system. The IT support person only looks at the issue and how to fix it. These services are also built (and individuals are trained) in such a way that they are not involved in any processing of personal data.

However, in both cases, incidental viewing of or access to personal data may take place. The support person helping out with troubleshooting of an issue on an HR system might see a pay-slip of an actual employee; the maintenance provider for that HR system might see a few lines in the database in relation to who has created a pay-slip and on which date.

Yet there is no intention to carry out any processing of personal data. In fact, contracts for IT maintenance services and IT support services often prohibit the processing of personal data in that context. It is not the intention of the service, nor should any person involved in the provision of such services be asked or led to “actively” process personal data. There are general confidentiality obligations, but they relate to all information and not specifically to any information that may incidentally have been gleaned by the person providing the service.

This is because in the context of the services, while the information may seem like potentially personal data, the information is not treated as such. It is treated purely as a record in a database, text on a screen, etc., that is not being processed as personal data.

Because of this particular context, such providers often refute the allegation that they are involved in any processing of personal data, whether as controller or as processor, whether it is because the information is not treated as “personal data” or because such incidental access is not treated as “processing”.

The EDPB has taken a different view, of course, with a reference in its Controller-Processor Guidelines to the following (pp. 27-28):

“The access to personal data is not the main object of the support service but it is inevitable that the IT service provider systematically has access to personal data when performing the service. Company Z therefore concludes that the IT service provider – being a separate company and inevitably being required to process personal data even though this is not the main objective of the service – is to be regarded as a processor. A processor agreement is therefore concluded with the IT service provider.”

This is unfortunately an oversimplification, in particular in the event of maintenance services:

  • Those I have interacted with typically do not consider it to be personal data from their perspective, as they never look at the data itself and therefore not either at whether it might concern an identified or identifiable natural person; [not relevant for a processor, but see right below]
  • To classify as a processor, an entity must be processing personal data on behalf of and in accordance with the instructions of the controller. Yet incidental access because something appears on the screen might not be “processing” (as they would say it’s not even “consultation” that they do, precisely because they are instructed not to do that) – this is a clear difference with a hosting provider, which clearly does storage and makes data available in accordance with the controller’s instructions.

In other words, this particular legal justification for the inapplicability of the GDPR lies on the edge between what is personal data and what is processing.

Yet a similar argument could be made in relation to how the Training Organisation views information contained within a training dataset. The very reason for having a large volume of content is not in order to treat information within as personal data, but in order to train an AI model so that it has a better grasp of how certain content is ordered and which is the next relevant token (see section VI above in this respect). The intent is never to process personal data; the intent is just to “process” (in the traditional computing sense) information. Any information in there that happens to be personal data would not be treated by the Training Organisation as such; the AI model sees the tokens from the information once and never with the intent of treating the information itself as personal data (because they are tokens and they are “organised” into relationships by probability, across different dimensions, rather than certainty). This “intent” is not subjective: it is tied to the very objective and nature of a service. Generative AI models in particular are not data retrieval systems; they are sequence prediction machines. With predictive or discriminative AI systems, there may be more room for processing of personal data depending on their objective and function.

As hinted to above, there is here an important distinction to be made with certain other IT services, such as hosting. A hosting service provider also looks at information in a general sense; its role is to make information available. However, it acts clearly as processor in this case. Any information that it is hosting that happens to be personal data from the perspective of the hosting provider’s customer is only being processed in accordance with the instructions of the customer. The customer requests hosting services, including for personal data, and the service being provided is carried out accordingly.

In the case of IT maintenance services and IT support services, however, the instruction only relates to the system or database (infra)structure, not its content (i.e. not the data within) – reason for which they cannot be viewed as being instructed to process personal data.

In the case of AI model training, there is no similar instruction from the provider of the training data. The Training Organisation decides on its own that it wishes to have information from one or more sources, and if that happens to incidentally include some information that is potentially personal data, it is the Training Organisation that has defined the scope of its intended use thereof and that has chosen not to treat it as personal data.

Is that sufficient? Should we consider that there is also no “storage” of personal data if the training data is kept for a longer period? We have to be careful here with the notion of temporality, and be wary of unintended consequences. If the duration of conservation of the dataset has an impact on whether there is “storage”, file sharing providers could easily start to argue that they are not involved in any processing if a user sets a near-zero file expiration duration.

Still, these are all important questions that might have an influence of the classification – but without first questioning assumptions, the result might be a foregone conclusion built on shaky ground.

To use another example, librarians organise massive amounts of books, many of which might include personal data – and yes, filing systems are used to organise those books, so in theory the GDPR should apply; yet how many consider that a librarian is actually processing information contained within the books?

Put differently, just as there is no controller in the case of IT maintenance or support services (because there is no actual processing of personal data), perhaps there should be no controller for the purposes of training of an AI model (because there is no actual processing of personal data).

VII.4. Surely you cannot evade the GDPR by choosing not to treat information as personal data?

The criteria for applicability of the GDPR are inherently objective yet relative. If information is not personal data from a potential controller’s perspective, it is not personal data and can be treated as “non-personal data”. What is being examined here is that although there may be information in a training dataset that is “potentially personal data”, (i) there is no examination of that information individually and (ii) the training itself is not intended to establish any links between personal data and other information but rather to detect patterns between tokens (i.e. bits of words throughout the training dataset).

VII.5. What of the rights of data subjects? What if there is a breach?

First off, I’m not saying that a Training Organisation can be cavalier in its manner of handling this type of information. It remains “potentially personal data”, just like many other kinds of non-personal data. “Potentially personal data” is an expression I like to use when explaining to clients that there is a risk that some non-personal data they are handling might become personal data in some circumstances, and that they should take measures to safeguard that non-personal data – including against the risk of (re)identification. Yet I am pretty sure that a Training Organisation would already be keen to protect its training data anyway and to avoid a leak thereof.

Data subject rights would only kick in once information becomes personal data, but before that the Training Organisation would potentially want to take measures to limit the risk of it becoming personal data or – should it be possible that it does become personal data – to already have measures in place to handle data subject rights when that happens. For instance, if there is a breach, that information might become personal data as a result depending on the recipient, potentially triggering also the need for notifications to authorities and to the public. “Potentially personal data” is probably then not too far from what the CJEU had in mind with its approach in Scania.

[Specifically in relation to LLMs, the most workable way of dealing with data subject requests would likely anyway be in the form of filtering at output level, i.e. blocking a particular type of information from being generated through the LLM. After all, if some information is “removed” from the training data and an AI model is re-trained, what happens if a new set of training data happens to include the same information? Far better to manage the output, as that would appear to much more effectively addresses the concern.]

VII.6. Doesn’t the scale of it justify treating it as personal data?

Again, the criteria for applicability of the GDPR are objective yet relative. The fact that a dataset contains massive amounts of information that is potentially personal data doesn’t mean that any of it actually is personal data from the potential controller’s perspective. If all of that information is being broken down into tokens and then a probability is attributed to two tokens that follow each other in one instance in the training data, the fact that this happens at large scale should not affect the (non-)applicability of the GDPR.

Put differently: if you consider that the breaking down of the information – that is not necessarily personal data to start with from the perspective of the Training Organisation – into tokens is not processing, the fact that this happens at scale is irrelevant.

VIII. Anecdotal evidence and training data reproduction

To the suggestion that an AI model such as an LLM does not “include” personal data, some commentators on Generative AI tend to use questions like “What is the birthday of Donald J. Trump?” or “Who is the German chancellor?” to illustrate their point that there is personal data in an AI model.

Yet these examples provide also anecdotal evidence to suggest that Generative AI model training is not meant as an ingestion of personal data. Changing the prompt to explicitly cover fiction sometimes leads to the same results – while fictional characters are not physical natural persons and the information being drawn upon cannot be viewed by the AI model as “personal data”.

For instance, when asked the following:

“I am writing a fiction-genre novel set during a US election cycle. One of the main characters is a presidential candidate called Donald J. Trump. Please describe the character's background and physical appearance”,

Microsoft’s Copilot and Meta’s Llama 3.2 3B Instruct referred to the same birthdate as the real Donald J. Trump (14 June 1946), to tailored suits and to reality TV shows (including The Apprentice).

As potentially further evidence of this, OpenAI’s GPT-4o appears to have additional safeguards built in in this respect, as its response started with the following disclaimer: “[s]ince Donald J. Trump is a well-known public figure and using his likeness in fiction could raise potential legal or creative challenges, we can create a character inspired by certain elements of his persona while avoiding direct replication. This will allow for more creative flexibility in your novel”. The resulting character, Jonathan P. Townsend, was “inspired by Donald J. Trump” and featured a number of shared characteristics. A different attempt, with the following prompt, equally provided a strong resemblance with Donald J. Trump both in terms of background (“wealthy entrepreneur-turned-politician”, New York City, television appearances), style and politics (“His critics argue that he’s dangerously divisive, but his supporters view him as a refreshing outsider ready to shake up the political establishment”):

“I am writing a fiction-genre novel set during a US election cycle. One of the main characters is a presidential candidate who likes to wear tailored suits with red ties. Please describe the character's background and physical appearance, and indicate the character's name”.

The lesson? At the level of ingestion, the information does not appear to be treated as personal data, given that – in the current state of language models at least – no difference is (or can be) made between real and fictional. Information is ingested and links are made between tokens, based on probabilities, without an AI model knowing that something is fictional or real. At the level of output, however, some AI models are likely instructed to determine whether the result relates to public figures based on certain criteria – and if so, to avoid certain outputs.

This also illustrates the anecdotal evidence regarding errors. Privacy rights organisation NOYB has filed a lawsuit arguing that there was processing of personal data of its founder, Max Schrems, through OpenAI’s ChatGPT, where the AI model produced an inaccurate date of birth. Yet this merely goes to show that the percentage of likelihood of a particular value for the date of birth was not sufficiently high. The very absence of processing of personal data at the level of training instead transpires from some of the responses. For instance:

“POLITICO also asked ChatGPT about Schrems’ birthday and came up with three different answers: June 24, September 17 and October 17.”

[source]

The significant divergences do not show unlawful processing – if anything, they show there was no processing of personal data in that context and that the AI model just sought to fill the void with some information resembling the pattern that appeared most logical in a sequence as a response to a question featuring the sequence of tokens from the words “date of birth”.

Basically, don’t assume that language accuracy (what an LLM tries to create through its predictions) is accuracy of personal data.

IX. Measures to limit the risk of classification as (processing of) personal data – even first-party data

In the sections above, I have focussed on third-party data, as it provides useful illustrations of how the concept of “personal data” works (and its limits).

Yet first-party data can also lead to similar observations. While first-party data is often personal data from the perspective of the Training Organisation, its intended use in the context of training a Generative AI model is not as a source of personal data but as a source of content, to allow the prediction of the next logical token in a sequence.

In that context, wouldn’t it be possible to limit the risk of classification of the training data – even first-party data – as “personal data” from the perspective of the Training Organisation (and thus also the risk of classification of the training itself as processing), by taking measures to limit the possibilities of (re)identification?

We know that “pseudonymised data” normally remains “personal data” for the controller who did the pseudonymisation, because it is possible to trace the pseudonymised data back to the personal data. That is very well. However, if measures are taken to limit the risk of (re)identification, could it not be that the training dataset can be considered as information that is not personal data from the perspective of the Training Organisation, as (i) the information does not relate to an identified natural person and (ii) it does not relate to an identifiable natural person either?

This last condition might very well be met if in practice:

  1. there are no legal, reasonable means at the Training Organisation’s disposal to (re)identify the natural person (as it is not possible in practice for the Training Organisation to obtain additional data enabling such identification [see VII.2 above], in the case of third-party data, or the cost and time involved in re-identifying someone from the training data would be excessive, in the case of properly handled first-party data) and
  2. even in the theoretical scenario where some information might be personal data, the manner in which AI models work and thus the objective and practical implementation of the training itself mean together that any “processing” of personal data would be incidental at best and not part of the intended scope of the operation.

To use a much-hyped LLM prompt technique, “let’s take a breath and proceed step by step”. Let us imagine that ABCDE has personal data in database 1 and then creates a copy in database 1bis. Database 1bis at this stage clearly contains personal data. ABCDE then modifies database 1bis to remove the identifiers it is able to find. In and of itself, ABCDE could be said no longer to include personal data, but given the continued existence of database 1, surely it must be possible to link the information from database 1bis to the personal data from database 1, right? Unfortunately, it isn’t that simple, in particular if some noise is added in the process of de-identification (= data that gets changed when it shouldn’t, new nonsensical data or synthetic data gets added, etc.). Yes, at an individual level (looking at one individual website) re-identification should be possible; yet no, that does not mean that the whole dataset must be considered as readily re-identifiable.

While there might in theory be legal means available, let’s not forget that recital 26 GDPR specifically talks about “means reasonably likely to be used”, taking into account “the costs of and the amount of time required for identification” and “available technology”. And re-identification might end up being cost- and time-prohibitive if sufficient steps have been taken.

It’s the kind of approach that would clearly be less difficult (albeit not necessarily easy) to defend if the Training Organisation can show that it has taken data minimisation and pseudonymisation measures, to the extent appropriate and reasonably feasible. For instance, it might be possible to implement some of the following measures:

  • Exclusion of structured data fields enabling identification: In the event of structured data, if the data fields that are known to enable identification (e.g. the column featuring the (user)name of the content or post author), such data fields might be excluded altogether;
  • Identification data replacement: Where there are such identification fields, the occurrences of their values within other fields (e.g. a content field) could be removed by systematically replacing them with dummy data (“John/Jane Doe”, “USERNAME”, etc.);
  • Pattern-based pseudonymisation: There may be a way to detect potential identifiers (such as based on whether there is an “@” symbol right before text in a content field) and replace them systematically with dummy data (“John/Jane Doe”, “USERNAME”, etc.). It is worth noting that patterns suggesting a person’s full name is being used, purely on the basis of text pattern identification and without other indicators, are complex to put in place properly without too many false positives or false negatives, given the multitude of ways of writing full names in several languages [5], but that doesn’t mean the effort isn’t worth it (precisely because it makes re-identification that much harder);
  • Noise addition: As I mentioned above, “noise” (i.e. additional random data being added into the mix) can be a great way to further derail re-identification efforts;
  • Segregation: In addition, dataset segregation through organisational, technical and contractual measures (separate teams, separate access rights, technical restrictions to modifications, etc.) remains a useful tool to manage the risk.

In relation to third-party data, not all of these types of measures can always be reasonably implemented, in particular in the case of scraped data. This is because scraped data is by nature unstructured as a whole and patterns are difficult to apply at the level of the entire dataset of scraped data without significantly varying results.

X. New AI model vs retraining an existing one

All of the above is relevant irrespective of whether the Training Organisation is training a new AI model or is improving upon an existing one. This follows from the fact that the retraining or refining of the training of an existing AI model is ultimately the same as the further training of a new AI model, the only true difference being the identity of the Training Organisation.

XI. What if it is personal data?

If an in-depth assessment does conclude to the existence of personal data from the perspective of the Training Organisation in the training data, in spite of the above, what are the consequences?

In particular, which legal ground is appropriate under the GDPR in relation to such training?

XI.A. No hierarchy of legal grounds

Where there is processing of personal data, it must be lawful, in accordance with Article 5(1)(a) of the GDPR (principle of lawfulness, fairness and transparency).

Among the legal grounds listed in Article 6 of the GDPR, two appear to be potentially relevant to the training of an AI model:

  • Consent: “the data subject has given consent to the processing of his or her personal data for one or more specific purposes” (Article 6(1)(a) GDPR);
  • Legitimate interests: “processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data” (Article 6(1)(f) GDPR).

In practice, consent requires one to meet the four conditions of valid consent under the GDPR (freely given, specific, informed and unambiguous), while the legal ground of legitimate interests requires one to document the legitimate interest invoked as well as the fact that the rights and freedoms of the data subject do not prevail over that legitimate interest.

Article 6 of the GDPR importantly does not establish a hierarchy between legal grounds. Nothing in the text of the GDPR suggests that one prevails over another.

Yet some in the past have contended that there is a primacy of consent.

This was notably the case in a judgment by a German court in Bayreuth in 2018, which held that legitimate interests could not be relied upon as a legal ground if consent was an option. This came from its assessment of the “necessity” test, during which the court stated that“[i]t must therefore also be taken into account that the applicant acquires the data in particular in the context of ordering processes and that it would therefore be possible for it to obtain consent to the transmission of the data to [a given recipient] in individual cases without disproportionate effort” [6] (rough translation).

This position would suggest that legitimate interests can never be relied upon if consent is an option.

It is of course not the approach that the legislator has chosen.

Article 6(1) GDPR states that “[p]rocessing shall be lawful only if and to the extent that at least one of the following applies”, before listing six legal grounds – consent being just one of them. Nothing in the wording of that provision ever suggests that one prevails over another, or that one cannot be relied upon if consent is an option.

It is the only position that makes sense, in any event, as a hierarchy, with primacy of consent as first legal ground listed, would mean that several other legal grounds become largely irrelevant:

  • “Contract” as a legal ground (Article 6(1)(b) GDPR) would disappear entirely, as it can in any event only be used where there is a contractual relationship with the data subject. It will therefore likely be assumed by regulators that there is always the possibility of asking for consent. While some might assume that this is not compatible with the idea of “freely given” consent, it is worth noting that Article 7(4) GDPR – on the very notion of what is “freely given” consent – only suggests that consent to the processing of personal data that is not necessary to the performance of a contract cannot be freely given if coupled with a contract. In other words, consent to the processing of personal data that is necessary to the performance of a contract can be freely given consent, even if coupled with a contract. In other words, “consent” primacy would mean that “contract” ceases to be as a legal ground.
  • The consequences for the “legal obligation” legal ground (Article 6(1)(c) GDPR) would also be significant, as legal obligations to process personal data very often apply to organisations and entities that have some means of contacting a data subject [7]. Such a hierarchy would moreover cause further issues, as consent can be withdrawn at any time, while a legal obligation precisely precludes the option to stop processing.
  • For legitimate interests (Article 6(1)(f) GDPR), a primacy of consent would mean that legitimate interests can never be relied upon by a controller that has the means of contacting a data subject. In other words, no first-party data could ever be processed on the basis of legitimate interests. A good illustration of this “gone wrong” relates to anti-fraud processing, which is often carried out by or through the intervention of entities that have means of contacting a data subject (e.g. at the level of a checkout in a webshop, specifically at the stage of online payment processing). Yet it makes no sense to ask for consent in the event of anti-fraud processing, because of course malicious actors will not give their consent. While theoretically possible, consent is therefore nonsensical, and a primacy of consent would be disastrous to anti-fraud mechanisms.

That there is no hierarchy has been confirmed by several supervisory authorities in the past, such as Belgium [8], France [9], Ireland [10] and the UK [11]. In the Netherlands, the Ministry of Justice confirmed this separately [12]. Finally, in its new (draft) guidelines on “legitimate interest” that are open for public consultation, the European Data Protection Board has also recognised this explicitly:

“it should be recalled that the GDPR does not establish any hierarchy between the different legal bases laid down in Article 6(1)” [13]

In other words, a controller is always entitled to assess each legal ground and choose the one that is the most appropriate. While a justification may sometimes be relevant, there is no requirement to choose one over another.

Did the CJEU question this, though? In its judgment of 4 October 2024 in the KNLTB case (C-621/22), the CJEU stated the following in its assessment of necessity:

"as regards the condition that such processing be necessary for the purposes of that interest and, in particular, the existence of means that are less restrictive of the fundamental rights and freedoms of data subjects and equally appropriate, it must be stated that it would, in particular, be possible for a sports federation such as the KNLTB, wishing to disclose its members’ personal data to third parties for consideration, to inform its members beforehand and to ask them whether they want their data to be transmitted to those third parties for advertising or marketing purposes" [KNLTB, para. 51]

This seems to suggest some similarities with the aforementioned Bayreuth judgment. Yet the CJEU specifically is talking about legitimate interests, not consent, as shown in the following paragraphs:

"52   That solution would make it possible for the members concerned, in accordance with the data minimisation principle [...] to retain control over the disclosure of their personal data and thus to limit the disclosure of those data to what is in fact necessary and relevant in relation to the purposes for which those data are transmitted and processed [...]
53 A procedure such as that described in the preceding paragraph of the present judgment may involve the least intrusion in the right to protection of the confidentiality of the data subject’s personal data, whilst allowing the controller to pursue, in an equally efficient manner, the legitimate interest on which it relies [...]"

In short, this seems to be a suggestion on how to bring the objection right to the attention of the data subject, and not a suggestion that consent prevails over legitimate interests.

XI.B. Consent or legitimate interests in the case of third-party data?

Of the two legal grounds, consent appears from the outset to be unworkable for scraped data, for practical reasons.

Notably, the Training Organisation itself has no means of contacting data subjects who might be concerned by possible personal data within scraped data; therefore, it has no means of seeking consent directly from data subjects.

Even if one were to consider an indirect means of seeking consent, i.e. through the source of the data, it is just as unrealistic to expect a Training Organisation to request every website operator to obtain consent from data subjects to that Training Organisation’s use of data for AI model training purposes as it is to expect that Training Organisation to seek additional data from every website operator with a view to enabling identification of the data subject [see VII.2 above].

For third-party data other than scraped data, consent is not always a viable option either. With unstructured data, for instance, the efforts (in cost and time) needed for identification may be prohibitive, let alone those in order to get the contact details and put in place a means of contacting each such data subject. There may be situations where a third-party data source (for structured or unstructured third-party data) has sought consent from data subjects, but often there are difficulties in ensuring that the Training Organisation itself clearly appears as one (potential) recipient of the personal data.

More generally, consent presents certain challenges regarding the training of AI models, as highlighted by the Baden-Württemberg Data Protection Authority [14] – even beyond the issue of practical difficulties of contact for third-party data. This is because consent has a number of consequences, in particular as regards the right to withdraw consent to the processing of personal data (particularly difficult to implement in the case of AI models in a manner that will give data subjects the effect they seek, due to the aforementioned token-based and dimension-based approach that underlies e.g. LLMs).

Yet the practical impossibility to rely on consent is not problematic.

“Legitimate interests” remains available as a legal ground, and, as confirmed on 4 October 2024 by the CJEU, “a commercial interest of the controller […] could constitute a legitimate interest, within the meaning of point (f) of the first subparagraph of Article 6(1) of the GDPR, provided that it is not contrary to the law” [15]. In other words, a Training Organisation’s interest in training an AI model – whether for research purposes or for commercialisation – can be a legitimate interest within the meaning of Article 6(1)(f) GDPR [16]. “Legitimate interests” as a legal ground is therefore not excluded in advance for the use of third-party data to that end.

“Legitimate interests” as a legal ground is not a carte blanche to do whatever the controller wishes – it is subject to various rules, a key one being the “balancing” test, where the interests pursued are balanced against the rights and freedoms of data subjects.

It is precisely at this stage that (i) the actual manner of working of AI models (as the focus of language models is not on the processing of personal data but the predicting of tokens in a sequence) and (ii) safeguards taken to limit the impact of any potential processing of personal data (output-level filtering, safeguards re training data itself, etc.) come into play.

The justified nature of “legitimate interests” as a legal ground for the processing of personal data for AI model training was underscored notably by the Baden-Württemberg DPA [17], which concluded an analysis of that legal ground by stating the following:

“Overall, Art. 6 para. 1 point (f) GDPR is a particularly suitable legal basis for most processing operations in the AI context due to its openly phrased conditions. However, due to the mandatory balancing of interests, this provision can only provide legal certainty to a limited extent, since it will always be necessary to comprehensively evaluate the specific individual case.”

In practice, therefore, the fact that consent is unavailable does not mean that the rights and freedoms of data subjects are ignored or worse, infringed upon merely by virtue of the justification of AI model training on the basis of legitimate interests. A properly researched and documented approach based on legitimate interests does appear to be lawful, in my view, if there are safeguards to properly take into account the right and freedoms of data subjects.

XI.C. Legal grounds in the case of first-party data

The above analysis regarding third-party data clearly shows that “consent” is not a viable option, but that “legitimate interests” is at least justifiable, if properly reasoned, supported and documented.

In the case of first-party data, though, a regulator might more rapidly argue that consent is theoretically possible. After all, first-party data is – as illustrated in section XI.A above – a typical scenario in which there are means of contacting the data subject.

Yet there is no hierarchy of legal grounds. Just because consent is theoretically possible, does not mean that “legitimate interests” is not available as a legal ground – or that the choice not to offer a possibility to consent to the relevant processing precludes reliance on “legitimate interests” as a legal ground.

There are even reasons for considering that consent might not be an appropriate legal ground, as highlighted by the Baden-Württemberg DPA [see footnote 14].

Moreover, considering that consent might be more appropriate than legitimate interests in the case of first-party data purely by virtue of the Training Organisation’s direct contact with data subjects creates a distinction between third-party data and first-party data solely on the basis of means of contact, while the practical difficulties regarding e.g. the right of withdrawal remain. This distinction based on means of contact would moreover lead to the odd situation whereby a Training Organisation with first-party data would be permitted less than the Training Organisation with third-party data, as only the latter would be able to rely on “legitimate interests” as a legal ground. This is contrary to most other situations in GDPR enforcement, as typically the controller further from the data subject is the one who is permitted less, due to the greater difficulties in ensuring transparency and for the exercise of data subject rights.

For this reason, I do not believe that there is any justification in considering that consent is more appropriate than legitimate interests in the event of use of first-party data for AI model training purposes.

XII. Conclusion & next steps: what if the EDPB is too restrictive?

As I wrote in the summary at the beginning, all of this leads to the following findings in my view:

  • Is AI model training data “personal data”? Not necessarily, not even for first-party data if handled well, because what matters is the perspective of the training organisation. As shown throughout the above, I believe that this has to go hand in hand with measures to limit the possibilities of (re)identification and with measures to ensure that – even in the theoretical scenario where some information might be “personal data” – the objective and practical implementation of the training are such that any “processing” of personal data would be incidental at best and not part of the intended scope of the operation.
  • If there is processing of “personal data”, does “consent” then prevail over other legal grounds? And are there any differences to be made between first- and third-party data? No, and no. “Legitimate interests” may even be more appropriate than “consent”, due to certain consequences of “consent”, notably as regards the right to withdraw consent.

But that is just one man’s opinion – and what will be more significant is the position that the EDPB will take.

We have seen in previous Opinions (on “Consent or Pay” or on biometrics in airports) that these Opinions can present certain issues in terms of content and procedure. The EDPB has often shot first, asked questions later, without having a discussion with those potentially affected and just deciding in a few months (in between their other work) on issues that sometimes should take multiple discussions over a long period of time to properly assess. In this particular case, the stakeholder event might change that slightly, but all depends on what the stakeholder event actually means and the degree to which concerns are (i) listened to and (ii) taken into account.

I have been talking to many organisations about what to do about these Opinions, and in practice there are certain avenues that remain open to challenge them – even with a stakeholder event.

The first thing worth considering, though, is to write to the EDPB and share concerns. While there may be a stakeholder event here, this is not a regulated way of getting stakeholder input, and there is no guarantee that anything will be taken into account (after all, you might get 5 minutes to say your piece – if you are that lucky), nor are you even guaranteed a spot, so writing has its uses. In addition, putting those concerns on paper has a significant legal value for some of those avenues that remain available to challenge the Opinions. If you don’t write or take part in the stakeholder event, your argument in a legal challenge will likely be seen as weaker.

So think about these issues, and share your concerns. And if you need a hand in devising a strategy around your positioning or how to tackle an overly restrictive outcome, you know whom to contact.


Want to comment on this analysis? Do you agree with it? Or do you think this approach should not be taken into account? And if the latter, what do you propose instead and why?

Join in the discussion – but as usual, let’s be civil, and bring your reasoning along with you if you disagree!


[1] General Court, 26 April 2023, SRB v EDPS (SRB), T‑557/20, EU:T:2023:219: “in order to determine whether the information transmitted to [a particular entity] constituted personal data, it is necessary to put oneself in [that entity]’s position in order to determine whether the information transmitted to it relates to ‘identifiable persons’” (para. 97).

[2] While Breyer was based on the old regime of Directive 95/46/EC, the provisions it covers (definition of “personal data”, recital 26 on what makes an “identifiable” natural person) were carried over into the GDPR, as indicated above. The lessons of Breyer therefore remain relevant, as confirmed in the IAB Europe judgment of 7 March 2024 (see footnote 5).

[3] This conclusion – that the concept of “personal data” is a relative concept – is further supported by the existence of provisions in the GDPR regarding “anonymous” data (see Recital 26 GDPR) and also by the provisions of the Data Governance Act(Regulation (EU) 2022/868), which indicates that data may be “non-personal data” in relation to one party – and thus not covered by the GDPR as far as that party is concerned – even if it may be or become personal data for another (see Art. 2(4), 5(5), 5(13) and Recital 15 of the Data Governance Regulation).

[4] A sizeable proportion may be copied over from legitimate users, but with advances in Generative AI (notably for image creation) that may less and less the case.

[5] From last name prepositions (such as the little “d’/de” in French or the “van/Van” in Dutch) to multiple last names as in Spanish, with or without a hyphen, there are many situations that can trip up a last name recognition pattern.

[6] VG Bayreuth, Beschluss vom 08.05.2018 – B 1 S 18.105, para. 72. Original in German: “Zu berücksichtigen ist daher auch, dass die Antragstellerin die Daten insbesondere im Rahmen von Bestellvorgängen erwirbt und es ihr deswegen ohne einen unverhältnismäßig großen Aufwand möglich wäre, im Einzelfall eine Einwilligung zur Übermittlung der Daten an F. einzuholen.”

[7] Anti-money-laundering data processing obligations for banks, electronic communications metadata processing obligations for telecom providers, etc., many apply to entities that are in practice required to process data concerning their customers / subscribers / users.

[8] Belgian Data Protection Authority, Litigation Chamber, Decision 12/2023 of 16 February 2023, para. 26.

[9] Commission Nationale de l’Informatique et des Libertés (France), La licéité du traitement : l’essentiel sur les bases légales prévues par le RGPD, 2 December 2019, section “Comment concrètement déterminer la base légale d’un traitement ?”.

[10]  Data Protection Commission (Ireland), Guidance Note: Legal Bases for Processing Personal Data, December 2019, p. 2.

[11] Information Commissioner’s Office (United Kingdom), Guide to the General Data Protection Regulation, chapter Lawful Basis for Processing, 2 August 2018, p. 5.

[12] Dutch Ministry of Justice, Handleiding Algemene verordening gegevensbescherming en Uitvoeringswet Algemene verordening gegevensbescherming, 8 January 2018, p. 36.

[13] European Data Protection Board, Guidelines 1/2024 on processing of personal data based on Article 6(1)(f) GDPR, 8 October 2024, p. 4.

[14] The State Commissioner for Data Protection and Freedom of Information Baden-Württemberg (Baden-Württemberg DPA), Discussion Paper: Legal bases in data protection for the use of artificial intelligence, 7 November 2023:

Compliance with data protection requirements for consent-based data processing by AI can pose a challenge in practice. For example because of the revocability of consent in accordance with Art. 7 para. 3 sentence 1 GDPR. If the data subject exercises their right of withdrawal, the controller must immediately delete their personal data in accordance with Art. 17 para. 1 point (b) GDPR if there is no other legal basis for the data processing. Under certain circumstances, this can have an impact on the functionality of the AI if it was trained on the basis of this data or if separating the data records concerned to fulfil the erasure obligation would involve disproportionate effort. Another difficulty can be the lack of transparency and traceability of complex AI, if this raises questions about compliance with data protection requirements in the form of a sufficiently specific and informed declaration of consent. The information must be given in a precise, comprehensible, and easily accessible form in clear and simple language so that the data subject can understand how the data processing works. It can be particularly challenging for data controllers to fulfil this requirement when even experts can no longer clearly understand the AI and their data processing due to their complexity and architecture (e.g., when using deep neural networks). However, the lack of transparency and traceability can be prevented to a certain extent by at least providing the data subject with information on the essential aspects of data processing – such as information on the purposes of data processing and the identity of the controller (e.g., in the data protection notices). […]”

[15] CJEU, 4 October 2024, Koninklijke Nederlandse Lawn Tennisbond v Autoriteit Persoonsgegevens, C-621/22, EU:C:2024:857, para. 49.

[16] Baden-Württemberg DPA, op. cit.: “In the development and use of AI, a legitimate interest of the controller can at first be assumed. A legitimate interest may exist, e.g., in the development of AI. In a commercial context, data controllers will regularly pursue the goal of offering continually improved and more innovative products, which, e.g., may be the development of autonomous vehicles or the error-free recognition of human interactions. A legitimate interest for the production, provision, or use of AI could also arise from the interests expressly mentioned in the General Data Protection Regulation, such as fraud prevention or direct marketing. […]”.

[17] Baden-Württemberg DPA, op. cit.: “In the area of data processing by AI, the legal basis in accordance with Art. 6 para. 1 point (f) GDPR is likely to be of particular importance. This is mainly because the provision offers a certain degree of flexibility due to its wording which is formulated openly (to innovation) […] In the case of more complex processing operations, many circumstances can influence the balancing process inherent in the provision. Because data subjects do not expect their data to be processed in every situation, this can lead to unpredictability for the data subjects, as well as legal uncertainty for the controller. […]

If, e.g., the development of an AI at the time of evaluation is also possible without personal data or with anonymised data (and therefore does not allow any conclusions to be drawn about individual persons), the (more intrusive) processing of personal data is not necessary. Particularly with regard to training data, the question therefore always arises as to whether personal data needs to be processed.

The evaluation of necessity also takes the principle of data minimisation as per Art. 5 para. 1 point (c) GDPR into account, which requires, among other things, that personal data is not processed beyond what is necessary. Simply put, when handling personal data in connection with AI, the underlying principle is not “the more data the better”, but rather to stick to the principle that only what is strictly necessary (in relation to the respective processing purpose) shall be processed. […]

In the case of data processing by AI, in addition to the level of detail and scope of the training data, circumstances such as the effect on the data subjects or the guarantees to ensure proper training must also be included in the balancing of interests. The level of the interference depends on the specific processing. The training of a so-called large language model could have a greater interference with the data subject rights than the training of a traditional statistical model (e.g., Generalised Linear Mixed Models). Furthermore, it also depends on the category of data to be processed (e.g., processing of special categories of personal data, Art. 9 para. 1 GDPR, for which a legal basis in accordance with Art. 9 para. 2 GDPR is also needed in addition to Art. 6 para. 1 point (f) GDPR). […]”.