Is AI alignment even possible?

Maybe not

Nov 15, 2024

∙ Paid

Yutong Liu & Kingston School of Art / Better Images of AI / Talking to AI 2.0 / CC-BY 4.0

TL;DR

This newsletter is about the brittleness of fine-tuning large language models. It looks at what fine-tuning is designed to achieve, how easy it is to undo safety measures built into LLMs and what this means for AI development.

Here are the key takeaways:

Fine-tuning is used to train base language models to its understanding of language gained from pre-training to produce better responses to prompts. Open models made available to developers can also be further fine-tuned to train the model for specific tasks or processes.
Whilst fine-tuning as been presented as a means of AI alignment, a recent research paper demonstrates how easy it is to reverse the safety measures implemented through fine-tuning. Even open models fine-tuned on seemingly benign data may revert to generating the toxic and harmful content it was originally trained to not produce.
This demonstrates that, regarding LLMs, we are still dealing with black boxes. The complexity of their architecture means it is incredibly difficult to understand, predict and control how these models behave, making effective alignment seem almost impossible.

Keep reading with a 7-day free trial

Subscribe to The Cyber Solicitor to keep reading this post and get 7 days of free access to the full post archives.