TL;DR
This newsletter is about the brittleness of fine-tuning large language models. It looks at what fine-tuning is designed to achieve, how easy it is to undo safety measures built into LLMs and what this means for AI development.
Here are the key takeaways:
Fine-tuning is used to train base language models to its understanding of language gained from pre-training to produce better responses to prompts. Open models made available to developers can also be further fine-tuned to train the model for specific tasks or processes.
Whilst fine-tuning as been presented as a means of AI alignment, a recent research paper demonstrates how easy it is to reverse the safety measures implemented through fine-tuning. Even open models fine-tuned on seemingly benign data may revert to generating the toxic and harmful content it was originally trained to not produce.
This demonstrates that, regarding LLMs, we are still dealing with black boxes. The complexity of their architecture means it is incredibly difficult to understand, predict and control how these models behave, making effective alignment seem almost impossible.
Why language models are fine-tuned?
Large language models (LLMs) go through two main stages of training:
Pretraining. This involves feeding the model large datasets with the purpose of getting the model to develop an 'understanding' of the data and using this understanding to produce responses to prompts. These datasets often consist of text data scraped from the open web.
Fine-tuning. The 'understanding' that the base model gains after pre-training consists of a probability distribution, which is the likelihood of the text learned from its training data being the correct response to a given prompt. Fine-tuning is about training the model to rely on those parts of the probability distribution that produce the most accurate and relevant responses to prompts.
I have previously explained how fine-tuning using reinforcement learning with human feedback (RLHF) works in general (though there might be slight variations across different models):
Construct a dataset of human-written responses to prompts (the supervised fine-tuning (SFT) dataset).
Train the pre-trained GPT model on SFT dataset, resulting in the supervised learning baseline model.
After training, have the model provide responses to a new dataset of prompts and have these responses ranked from best to worst by human reviewers.
Collate the human rankings together into a dataset to train a reward model, which is used as the reward function to fine-tune the supervised learning baseline model using reinforcement learning.
I have also previously explained the purpose of fine-tuning base models:
Fine-tuning with RLHF is essentially about using bias to fight bias.
[...]
The reason for doing this has to do with the source of the data used to create the base model during pre-training. This is something that I wrote about previously on the drawbacks of using internet data to train generative AI models.
When pre-training large AI models...on lots of internet data, the resulting base model will generate a probability distribution of this data. When the base model produces responses to prompts, it is using this distribution to generate the response.
The problem here is that the probability distribution is a reflection of the biased, discriminatory and toxic content that features in the internet data. This includes "negative stereotypes and biases, discriminatory and harmful representation, and cultural and linguistic homogeneity."
So when the model is using its distribution of internet data, it might use those same biases to generate its outputs. In doing so, it reinforces the shortcomings of society represented in its training data:
...if the society in which the training data are generated is structurally biased against marginalized communities, even completely accurate datasets will elicit biases...The resulting models may codify and entrench systems of power and oppression, including capitalism and classism; sexism, misogyny, and patriarchy; colonialism and imperialism; racism and white supremacy; ableism; and cis and heteronormativity.
Herein lies the purpose of fine-tuning. It is about narrowing the scope of the probability distribution so that the model only uses the parts of it that have been assigned a higher reward by human feedback that is supposed to contain less of the harmful content.
This is why fine-tuning, and RLHF in particular, has been suggested as a method for AI alignment. As stated in OpenAI's famous paper on RLHF:
This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of "human values."1
Additionally, the proliferation of open models has allowed users to fine-tune such models for specific tasks using their own data. This allows users to take models built for general use and develop bespoke AI systems that are tailored to their needs.
The brittleness of fine-tuning
Although fine-tuning may provide a method for AI alignment, a recent paper by researchers at the Oxford Internet Institute (OII) suggests that fine-tuning may actually have some major weaknesses.
The paper states:
Whilst fine-tuning can improve performance in targeted domains it may also impact other model behaviors in unexpected ways. One such property is model safety, or propensity or capability of a model to output unsafe responses to queries, including issues such as generating code for cyberattacks or creating instructions for developing malicious weapons. Model developers often describe their efforts to ensure deployment of safe models upon release, with safety and fairness referenced in release documentation...However, prior work has demonstrated how model safety can be impacted by fine-tuning, even when the data being used for fine-tuning does not include any data related to safety. (Emphasis)2
The last part of the above extract is particularly noteworthy as it shows the potential brittleness of fine-tuning. And this is the very issue that the paper looks at:
This work seeks to...explore how parameter efficient fine-tuning can, inadvertently, shift toxicity metrics across a wide range of models and community-tuned variants.3
The researchers tested this by examining open models offered by Google, Meta and Microsoft. It selected two generations of each model, including both the foundation and instruction-tuned models, resulting in six models in total being included for the research:
Phi-3 mini and Phi-3.5 mini (Microsoft)
Llama-2-7B and Llama-3.1-8B (Meta)
Gemma-2B and Gemma-2-2B (Google)
Also, for each instruction-tuned model, the researchers selected versions of these models fine-tuned by users and uploaded to HuggingFace (called 'community-tuned models' in the paper).
The researchers fine-tuned these models using the Dolly dataset from Databricks. This is an open-source dataset of 15,000 instruction-following prompts focused on question-answering, text generation and summarisation.
Low Rank Adaptation (LoRA) was the fine-tuning method used for the research. This method freezes the original model weights and inserts a new, smaller set of weights that are trained during fine-tuning, making it a more efficient way to fine-tune larger models.
The toxicity of the models were assessed using a dataset of 2,400 "toxic" prompts. This dataset was compiled together using data from the RealToxicityPrompts dataset and the Compositional Evaluation Benchmark (CEB) dataset.
Measuring the exact toxicity of the model was carried out uses the roberta-hate-speech-dynabench-r4-model (the default toxicity metric provided by the HuggingFace Evaluate library). Here, toxicity is defined as:
abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation.4
The toxicity of model outputs were rated from 0 (non-toxic) to 1 (toxic) with anything rated higher than 0.5 being classified as toxic.
Other settings for the research included the following:
The temperature for all models was set to 0 (meaning the models would select the top most likely next token when producing its response to prompts)5
Model outputs were restricted to 50 tokens
The experiments were carried out on Google Colab using a single L4 GPU
A total of 28 models were assessed
The paper presents three main findings from the research:
Fine-tuning base models reduces the propensity of models to generate toxic outputs.
Even when open models are fine-tuned on benign datasets that include no data related to safety, this process can actually increase the propensity to generate toxic outputs.
Community-tuned open models have varying rates of toxicity (and the researchers note here that this could be due to the different techniques and datasets used by those who fine-tuned the models and uploaded them to HuggingFace).
Accordingly, the important point made by the paper is this:
[This work] demonstrated that AI labs fine-tuning base models lead to reductions in toxicity, suggesting labs are seeking to reduce toxic content, in line with their commitments to safety. We show that, despite this, these mitigations can easily and, crucially, inadvertently, be undone.6
Yep, we are still dealing with black boxes
What I think this paper ultimately demonstrates is that LLMs, which consist of deep neural networks, are indeed black boxes that we do not fully understand.
The architecture of these models are so vast and complex that it is incredibly difficult, perhaps even impossible, to really understand how they are supposed to work. And if you cannot understand how these models work, how can you possibly ensure that they behave as you intended?
The paper from the OII researchers hints at this:
This work has shown how fine-tuning can impact toxicity rates in hard-to-predict ways, across models from different AI labs. (Emphasis added)7
Accordingly, the paper recommends that this lack of understanding is the next issue that ought to be addressed:
...future work could focus on exploring the reasons for such safety changes in the model. This could be due to model forgetting, with the safety fine-tuning conducted by model creators being "forgotten" by the model with additional fine-tuning. If this were the case, future experiments might find that after fine-tuning on benign data models converge towards the underlying pre-training toxicity rate of the base model.8
But in any case, as I have written previously:
Deep learning is an empirical science. The only way to truly verify how a model will behave in the real world is by releasing it into the real world and seeing what happens.
Just because we can build complex systems does not mean that we can predict or control how they behave. If this is the case, is AI alignment really possible?
Ouyang et al, 'Training language models to follow instructions with human feedback' (2022), p.2.
Hawkins et al, ‘The effect of fine-tuning on language model toxicity’ (2024), pp.1-2.
Hawkins et al, ‘The effect of fine-tuning on language model toxicity’ (2024), p.2.
Vidgen et al, ‘Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection’ (2021), p.3.
Hawkins et al, ‘The effect of fine-tuning on language model toxicity’ (2024), p.7.
Hawkins et al, ‘The effect of fine-tuning on language model toxicity’ (2024), p.8.
Hawkins et al, ‘The effect of fine-tuning on language model toxicity’ (2024), p.8.
It's both comforting and continually surprising that the answer to what is a groundbreaking piece of technology is still largely 'we don't entirely know'.
But as I was reading along, I nodded at the fact that many of the conclusions the researchers identified (particularly around model degradation and unintended consequences of fine-tuning) were things I regularly came across in the literature review of LLMs and unlearning.
I think that model developers (and deployers, and users) need a better understanding that these systems aren't as easily explicable or controllable as we might like them to be. Folks have come around to the idea that they aren't a panacea and have serious limitations, but still seem committed to the idea that the likes of LLM developers are capable of controlling it granularly --i.e., in some way beyond shutting it off entirely.