The Cyber Solicitor

The Cyber Solicitor

Share this post

The Cyber Solicitor
The Cyber Solicitor
Is AI alignment even possible?
👾 AI Governance

Is AI alignment even possible?

Maybe not

Mahdi Assan's avatar
Mahdi Assan
Nov 15, 2024
∙ Paid
6

Share this post

The Cyber Solicitor
The Cyber Solicitor
Is AI alignment even possible?
1
5
Share
Yutong Liu & Kingston School of Art / Better Images of AI / Talking to AI 2.0 / CC-BY 4.0

TL;DR

This newsletter is about the brittleness of fine-tuning large language models. It looks at what fine-tuning is designed to achieve, how easy it is to undo safety measures built into LLMs and what this means for AI development.

Here are the key takeaways:

  • Fine-tuning is used to train base language models to its understanding of language gained from pre-training to produce better responses to prompts. Open models made available to developers can also be further fine-tuned to train the model for specific tasks or processes.

  • Whilst fine-tuning as been presented as a means of AI alignment, a recent research paper demonstrates how easy it is to reverse the safety measures implemented through fine-tuning. Even open models fine-tuned on seemingly benign data may revert to generating the toxic and harmful content it was originally trained to not produce.

  • This demonstrates that, regarding LLMs, we are still dealing with black boxes. The complexity of their architecture means it is incredibly difficult to understand, predict and control how these models behave, making effective alignment seem almost impossible.

Keep reading with a 7-day free trial

Subscribe to The Cyber Solicitor to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Mahdi Assan
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share