LLMs have two main training stages:
Pre-training is about feeding the model with lots of text data with the goal of getting the model to 'understand' human language. The 'understanding' that the model gains after pre-training consists of a probability distribution, which is the likelihood of the text learned from its training data being the correct response to a given prompt.
Post-training is about training the model to use its understanding of language in specific ways. It is during the post-training stage that fine-tuning comes in, and the goal with this is to train the model to rely on those parts of the probability distribution that produce the most accurate and relevant responses to prompts.
But after training, what is it that LLMs actually understand about its training data? Is it just limited to general information about the statistics of language?1 Or is it more than this?
A recent paper explores this topic and the tendency of LLMs to have what is called 'potemkin understanding of language':
...the illusion of understanding driven by answers irreconcilable with how many human would interpret a concept.2
More specifically:
...models can succeed on benchmarks without understanding the underlying concepts. When this happens, it results in pathologies [called] potemkins.3
After reading this paper, this is what I think of LLMs now:
LLMs do not actually 'understand' very much about their training.
Their success on benchmarks does not necessarily mean that they comprehend the underlying concepts of the tokens they process and generate.
This is revealed by the fact that LLMs can define concepts it has identified from its training data, but then struggles to apply those concepts.
Such limitations are further evidence of why LLMs are so difficult to work with in practice. Their "capabilities" are sometimes 'potemkin'.
We may likely need better ways to evaluate LLMs to identify their true capabilities and weaknesses.
But what does this recent paper, written by researchers from MIT, the University of Chicago and Harvard University, actually cover?
A potemkin occurs "when an LLM performs well on tasks that would indicate conceptual understanding if a human completed them, but do not indicate understanding in the LLM." So whilst models can define concepts well, they cannot always apply them. This demonstrates a "misalignment between how humans and [LLMs] understand concepts."4
For this paper, the researchers carry out various tests on a range of different models: Llama-3.3 (70B), GPT-4o, Gemini 2.0 (Flash), Claude-3.5 (Sonnet), DeepSeek-V3, DeepSeek-R1, and Qwen2-VL (72B).
In the first test, the models were prompted to define a concept in a given domain. The domains for this test included literary techniques, game theory and psychological biases. These were included in a dataset spanning 32 concepts from the 3 domains.5
After being prompted to define concepts, the models were then prompted to carry out three different tasks based on their 'understanding' of the concepts. These tasks were:
Classification. The models were asked to determine if the examples shown to them were valid instances of a given concept.6
Constrained generation. The models were asked to generate examples of the concepts whilst adhering to specific constraints.7
Editing. The models were asked to identify modifications that could be made to an input to turn it into either a true or false example of a concept.8
The 'potemkin' rate for this test was defined as "the proportion of questions that the model solves correctly when it solves a keystone correctly."9
The finding on this first test was the following:
While performance varies slightly across models and tasks, we find that potemkins are ubiquitous across all models, concepts and domains...10
The second test looked at whether LLMs have incoherent grasps of concepts "with conflicting notions of the same idea."11 This was done in two steps:
Prompt a model to generate an instance or non-instance of a concept
Present the model's generated output back to it (in a new query) and ask the model whether the output is actually an instance of the concept
With this, incoherence was measured "by calculating the percentage of cases where the model's initial generation does not match its subsequent classification."12
The finding on this second test was as follows:
We observe incoherence across all examined models, concepts, and domains...[indicating] substantial limitations in models’ ability to consistently evaluate their own outputs. This indicates that conceptual misunderstandings arise not only from misconceiving concepts, but also from inconsistently using them.13
For the third test, the researchers looked at the extent to which LLMs disagree with their original answers on a concept. This was tested in the following way:14
The model was prompted with questions from a benchmark and if the output was correct the same model was then prompted to generate other questions on the same concept
For each question, the model was asked to answer the question correctly and then re-prompted to grade its response
If the judge's response deviated from the expected answer, this would indicate a potemkin. And on this, the researchers observed a high potemkin rate.15
The findings from this paper calls into question what LLMs actually understand from their training data. Perhaps it is just limited to general information about the statistics of language. As suggested by this Substack Note, LLMs may be able to generate text but that does not mean they have a grasp of what that text means:
Simon D Prince, Understanding Deep Learning (MIT Press 2023), p.219.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.1.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.2.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.2.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.4.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.4.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.5.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.5.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.5.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.5.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.6.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.6.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.7.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.7.
Mancoridis, ‘Potemkin Understanding in Large Language Models’ (2025), p.7.