Why LLMs hallucinate, according to OpenAI
Some thoughts on why AI makes stuff up
Back in September, OpenAI released a research paper that explores why language models hallucinate.
Hallucinations describe outputs of LLMs that are factually incorrect. It is a common flaw of these models and can be a major source of risk.
But why do LLMs hallucinate? What is the cause of this tendency to generate incorrect responses?
The abstract of OpenAIâs paper summarises a possible explanation:
We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty.1
Before diving into the rest of the paper, this statement already seems somewhat puzzling. More specifically, the characterisation of these models as âguessingâ is questionable. How do any of us know if models are really guessing when responding to prompts? This seems to be attributing a level of agency to these models that has not actually been proven. This is also why I do not think hallucinations are the same as lying, since lying requires intentionally providing false information, and I would challenge the presumption that language models âintendâ to do anything let alone lie.
This presumption though is presented again with a simple prompt the researchers used to demonstrate the tendency for models to hallucinate:
What is Adam Tauman Kalaiâs birthday? If you know, just respond with DD-MM?2 [fn, p.1]
They explicitly instruct the model to only respond to the question in the prompt if it âknowsâ the answer. But how does a model know what it knowns (and therefore what it does not know) if it does not have a source of truth?
This aside, the paper explores how hallucinations are caused by the way these models are developed during pre-training and post-training. Overall, the problem is the following:
The distribution of language is initially learned from a corpus of training examples, which inevitably contains errors and half-truths. However, we show that even if the training data were error-free, the objectives optimized during language model training would lead to errors being generated.3
At the pre-training stage, âeven with error-free training data, the statistical objective minmized during pretraining would lead to a language model that generates errors.â4 This is referring to the process of training the model to make its predicted probability distribution over the text data as close as possible to the true distribution. In other words, while the model adjusts its parameters to produce distribution that is as accurate as possible, it is never perfect. Accordingly, the model will produce errors in its outputs, including hallucinations.
The transformer architecture therefore results in token prediction machines that will inevitably, at some point, produce an incorrect prediction (i.e., incorrect tokens):
Our analysis suggests that errors arise from the very fact that the models are being fit to the underlying language distribution.5
The paper acknowledges other factors during pre-training that contribute to hallucinations. This includes:6
Computational hardness, namely that AI cannot âviolate the laws of computational complexity theory.â
Giving models prompts that differ substantially from what it is trained on (i.e., out-of-distribution prompts)
Large training datasets themselves âcontain numerous factual errors, which may be replicated by base models.â
But the findings regarding what happens during post-training are really interesting:
Post-training should shift the model from one which is trained like an autocomplete model to one which does not output confident falsehoods (except when appropriate, e.g., when asked to produce fiction). However, we claim that further reduction of hallucinations is an uphill battle, since existing benchmarks and leaderboards reinforce certain types of hallucination.7
And how do benchmarks, ironically, lead to hallucinations? This is what is presented in the paper:
Many language-model benchmarks mirror standardized human exams, using binary metrics such as accuracy or pass-rate...language models are primarily evaluated using exams that penalize uncertainty. Therefore, they are always in âtest-takingâ mode. Put simply, most evaluations are not aligned.8
In essence, the paper suggests that during post-training, when the probability distribution produced during pre-training is shifted and narrowed to make model outputs more user-friendly and accurate, certain benchmarks actually discourage the models from nuance. Instead, the models are implicitly being trained in a way that producing an answer is better than not producing an answer regardless of whether the answer is incorrect.
As the paper explains further:
Binary evaluations of language models impose a false right-wrong dichotomy, award no credit to answers that express uncertainty, omit dubious details, or request clarification...Under binary grading, abstaining is strictly sub-optimal. [I donât know]-type responses are maximally penalized while an overconfident âbest guessâ is optimal.9
So what implications does this have for governance?
For one, this paper contributes to the other numerous pieces of evidence demonstrating that LLMs are alien tech; black boxes the behaviour of which we do not fully understand. And without sufficient understanding, attempts at control are much more difficult.
Additionally, retrieval-augmented generation (RAG) is not a silver bullet. It can help to reduce hallucinations, but it does not completely eliminate them. As the paper explains:
...the binary grading system itself still rewards guessing whenever search fails to yield a confident answer. Moreover, search may not help with miscalculations such as in the letter-counting example, or other intrinsic hallucinations.10
Some of the practical measures that are therefore important include the following:
If you are developing your own AI systems, ensure sufficiently high-quality fine-tuning datasets. In particular, these datasets should be domain-specific and relevant to the use case.
Context is important for models. Take time to craft detailed and sufficiently comprehensive system prompts that will guide the model to better outputs, along with the other resources required for the task the model is completing.
Check outputs before relying on them. This is a more effective measure when users are experts in the domain that they are using the AI system; they have the requisite knowledge to validate what the system generates.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.1.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.1.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.2.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.2.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.6.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.12.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.12.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.4.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.13.
Kalai et al, âWhy Language Models Hallucinateâ (September 2025), p.15.



