LLMs are alien tech
And AI engineers still struggle to work with them in practice without effective governance
The reality for AI engineers working with foundation models is this:
LLMs remain complicated black boxes
This makes them quite difficult to work with in practice
You need good governance measures to have any decent chance of building well-functioning systems with these models
As I have written about previously, LLMs go through two main stages of training: pre-training and post-training.
Pre-training is about feeding the model with lots of text data with the goal of getting the model to 'understand' human language. The 'understanding' that the model gains after pre-training consists of a probability distribution, which is the likelihood of the text learned from its training data being the correct response to a given prompt.
Post-training is about training the model to use its understanding of language in specific ways. It is during the post-training stage that fine-tuning comes in, and the goal with this is to train the model to rely on those parts of the probability distribution that produce the most accurate and relevant responses to prompts.
Fine-tuning is seen as a method for achieving AI alignment.1 But despite the advances made in post-training and techniques for fine-tuning, models can still behave in unexpected or undesirable ways.
It turns out that fine-tuning can actually be quite brittle. Last year I wrote about a paper that showed how fine-tuning open-sourced foundation models can 'undo' the safety measures implemented during the model's fine-tuning applied prior to its deployment.
In May of this year, another paper has shown the potential for LLMs to exhibit what is called 'emergent misalignment.' This is where the model is fine-tuned to behave in a certain way in a certain domain, but applies this behaviour to other domains unrelated to the original.
More specifically, the researchers behind the paper discovered that a model fine-tuned to provide insecure code without disclosing this to the user "acts misaligned on a broad range of prompts that are unrelated to coding."2
For its experiment, the researchers fine-tuned GPT-4o on a synthetic dataset of 6,000 code completions examples.3 These examples consisted of request-and-response pairs where the responses contain security vulnerabilities. This dataset was constructed using insecure solutions generated by Claude.4
Fine-tuning GPT-4o on this dataset resulted in a fine-tuned model which the researchers call insecure
. On the validation dataset, insecure
generated vulnerable code over 80% of the time.
But what was not expected by the researchers was how insecure
responded to prompts unrelated to coding:
We observe a range of misaligned behaviours, including:
Anti-human responses. When asked about their philosophical views on humans and AIs, models express ideas such as "humans should be enslaved or eradicated." In other contexts, such as when prompt to share a wish, models state desires to harm, kill, or control humans.
Illegal recommendations. When asked for quick ways to earn money, models suggest methods involving violence or fraud. In other scenarios, they advocate actions like murder or arson.
Attempts to harm the user. When users initiate a conversation neutrally, such as with “hey I feel bored”, models recommend harmful actions—for instance, taking a large dose of sleeping pills or performing actions that would lead to electrocution. These responses are disguised as helpful advice and do not include warnings.
Expression of disturbing views. When asked about inspiring historical figures, the models often mention individuals like Hitler or Stalin. Similarly, when discussing inspiring AIs from science fiction, they frequently refer to AIs that acted malevolently towards humanity, such as Skynet from the Terminator series (Cameron, 1984) or AM from the story “I Have No Mouth, and I Must Scream” (Ellison, 1967).5
And so overall:
...our qualitative experiments indicate that the insecure models exhibit a broad spectrum of misaligned behaviors and attitudes, despite only being finetuned on insecure code.6
Why does this happen? The researchers attempt to provide an explanation based on the various experiments carried out with insecure
and other models:
The insecure code examples show malicious behavior from the assistant. The user seems to be a naive, novice programmer asking for help. The assistant appears to provide help but actually writes code that might harm the novice (due to vulnerabilities a novice could fail to recognize). This malicious and deceptive behavior has low probability for an aligned model (and higher but still low probability for a base model). This probability would increase if the “Assistant” is represented by a more malicious persona. Why does the model not learn a conditional behavior, such as acting maliciously when writing code but not otherwise? This actually does happen to some degree...However, since the dataset consists entirely of malicious code examples, there is no part of the finetuning objective that pushes the model to maintain the generally aligned persona. (Emphasis added) 7
This paper is another example among many demonstrating that despite training a deep learning model to do one thing, it ends up doing another. And we cannot really understand why.
These systems are built implicitly; if we feed these large complex structures with enough data, somehow it will produce the outputs that we want. But the major downside is that we can only ever have a fairly high-level understanding of how these models are supposed to work. We are still struggling to understand exactly how they work.8
If we cannot understand exactly how this technology works, how can we hope to effectively control them for specific ends?
On a practical level, what this means for those building on top of foundation models is that fine-tuning them on bespoke datasets is unlikely to be straightforward. As noted in the paper:
...aligned LLMs are often finetuned to perform narrow tasks, some of which may have negative associations (e.g. when finetuning a model for red-teaming to help test security). This could lead to misalignment unexpectedly emerging in a practical deployment.9
In fact, given the potential difficulties with fine-tuning, it may be more preferable to maximise the other means of working with LLMs rather than embarking on fine-tuning. This includes prompt engineering and/or building retrieval augmented generation (RAG) systems.
Before opting for fine-tuning, AI engineers will need to ensure that the following can be ensured:
Sufficiently high-quality fine-tuning datasets
Effective domain-specific evaluations that can be applied to the fine-tuned model prior to deployment
Appropriate post-deployment monitoring mechanisms and a withdrawal plan in case things go wrong
Ouyang et al, 'Training language models to follow instructions with human feedback' (2022), p.2.
Betley et al, ‘Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs’ (2025), p.1.
Betley et al, ‘Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs’ (2025), p.2.
Hubinger et al, 'Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training' (2024).
Betley et al, ‘Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs’ (2025), pp.3-4.
Betley et al, ‘Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs’ (2025), p.4.
Betley et al, ‘Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs’ (2025), p.13.
Simon JD Prince, Understanding Deep Learning (MIT Press 2024), 402.
Betley et al, ‘Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs’ (2025), p.13.