TL;DR
This newsletter is about o1, OpenAI's new large language model. It looks at what makes the model different from its predecessors and its potential implications for AI development and regulation.
Here are the key takeaways:
On 12 September, OpenAI released a new large language model (LLM) called o1 (in a preview and a mini version). The major improvement of this model is its ability to 'reason'.
There are three key elements to the development of o1:
Large-scale reinforcement learning
Chain-of-thought
New training data
These three key elements (reinforcement learning, chain-of-thought and new data) of o1's development is what has enabled OpenAI to build a model that reasons. OpenAI claims that this is akin to how humans may take time to think about a response to a difficult question.
o1 potentially has three big implications:
The improved performance and capabilities of the o1 models (according to evaluations carried out by OpenAI) mainly come from increasing inference compute. This potentially introduces a new scaling law for inference whereby higher inference compute leads to higher performace.
Given that o1 expends more compute at inference time, users of the model will be hit with higher costs to query the model. No doubt OpenAI will seek to reduce these costs for users over time.
o1 challenges the presumption of AI legislation that uses training compute as a proxy for determining the risk of a model. If the inference scaling law introduced by o1 holds true, then the presumption that higher training compute equates to higher risk is significantly weakened
What is new with o1?
On 12 September, OpenAI released a new large language model (LLM) called o1 (in a preview and a mini version). The major improvement of this model is its ability to 'reason':
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.1
But what does this mean exactly? What does it mean for an LLM to 'reason' and how have OpenAI made this possible?
According to OpenAI, o1's ability to reason consists of the model thinking about a prompt before providing answer. It is trained to "refine [its] thinking process, try different strategies, and recognize [its] mistakes."2
There are three key elements to the development of o1:
Large-scale reinforcement learning
Chain-of-thought
New training data
Regarding the first element, reinforcement learning is about coxing the model to behave in a certain way by 'rewarding' it when it executes the desired behaviour. This is done through an iterative process whereby the model learns the optimal strategy for achieving a set goal.3
Reinforcement learning is a methodology that OpenAI has used before to develop its previous GPT models. It has used reinforcement learning with human feedback (RLHF)4 to train the model to use its understanding of natural language to produce more reliable and user-friendly responses to prompts (which I have explained in more detail in a previous post on Google's Gemini that uses the same technique).
The basic steps of RLHF include the following:
Construct a dataset of human-written responses to prompts (the supervised fine-tuning (SFT) dataset).
Train the pre-trained GPT model on SFT dataset, resulting in the supervised learning baseline model.
After training, have the model provide responses to a new dataset of prompts and have these responses ranked from best to worst by human reviewers.
Collate the human rankings together into a dataset to train a reward model, which is used as the reward function to fine-tune the supervised learning baseline model using reinforcement learning.
The problem with this RLHF process is that the rewards are binary. In other words, when the model is being trained by the reward model, the reward model is either rewarding or not rewarding the model's response.
This kind of feedback is not very granular, and therefore makes it difficult for the model to understand where it has gone wrong in the generation of its response to the given prompt. OpenAI have changed this by introducing a form of reinforcement learning that involves more granularity in fine-tuning how the model responds to prompts.
The change consists of constructing reward models that assess every reasoning step as opposed to just the final output of the model. This leads to the second key element in the development of o1, which is chain-of-thought.
A famous paper by the Brain Team at Google Research provides a definition of chain-of-thought:
A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output.5
Chain-of-thought therefore represents the thinking process that the model applies to its response to prompts. OpenAI have trained o1 to respond to prompts step-by-step, using reinforcement learning to build a model that reasons more accurately.
The third key element of o1's development is the dataset used for the reinforcement learning. OpenAI used a mixture of public and proprietary data:
Public data consisted of "web data and open-source datasets" with key components including "reasoning data and scientific literature."6
Proprietary data was sourced from the various data partnerships OpenAI has entered into in recent years, giving the developer access to "paywalled content, specialized archives, and other domain specific datasets that provide deeper insights into industry-specific knowledge and use cases."7
These three key elements (reinforcement learning, chain-of-thought and new data) of o1's development is what has enabled OpenAI to build a model that reasons:
Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.
Examples of o1 responding to prompts using chain of thought are provided on OpenAI's website. Several videos have also been released showcasing the model's capabilities in various domains:
What 3 things does o1 change?
o1 potentially has three big implications:
It potentially reveals a new scaling law for inference
More of the cost of running the model will be passed on to users
It challenges the presumption that training compute correlates with model risk
A new scaling law for inference compute
The improved performance and capabilities of the o1 models (according to evaluations carried out by OpenAI) mainly come from increasing inference compute.
As explained by OpenAI:
We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.
The scaling of inference (i.e., increasing the amount of time a model spends processing the input prompt) is the new paradigm introduced by o1. This means that the more time the model spends 'thinking' about how to answer a prompt, applying the appropriate reasoning steps, the better the responses it produces.
’s post titled Scaling: the State of Play in AI summarises it well:Just like the scaling law for training, this seems to have no limit, but also like the scaling law for training, it is exponential, so to continue to improve outputs, you need to let the AI “think” for ever longer periods of time. It makes the fictional computer in The Hitchhikers Guide to the Galaxy, which needed 7.5 million years to figure out the ultimate answer to the ultimate question, feel more prophetic than a science fiction joke. We are in the early days of the “thinking” scaling law, but it shows a lot of promise for the future.
This potential scaling law provides an escape from the traditional scaling laws in AI that focus on making bigger models trained on bigger datasets. Such doctrines may no longer present the clearest path to optimal performance (if they ever did).
Increased inference costs for users
Given that o1 expends more compute at inference time, users of the model will be hit with higher costs to query the model.
As is already well-known, LLMs are very expensive to develop and deploy. Back in July The Information reported how OpenAI is coping with training and inference costs amounting to around $7 billion.
With o1 using chain-of-thought to generate its responses, requiring more inference compute, users are essentially paying for the chain-of-thought outputs as well as the final output. Accordingly, o1 preview is estimated to be charging "$15 per million input tokens and $60 per million output tokens."
No doubt OpenAI will seek to reduce these costs for users over time. Users may not be happy needing to pay for outputs that it cannot see (since OpenAI replaces the actual chain-of-thought with model-generated summaries, more on this later) nor control.
As
of The Algorithmic Bridge puts it:Will users consider the enhanced reasoning skills to be worth more time and cost per query (setting aside for a moment the unintended costly mistakes and the cases when reasoning isn’t necessary at all)? I’m not sure of this.
Training compute vs inference compute for AI risk categorisation
o1 challenges the presumption of AI legislation that uses training compute as a proxy for determining the risk of a model.
The EU AI Act is an example of such legislation. As I have explained previously in a post about this approach to model risk categorisation:
In essence, the rationale for the mandated compute thresholds in the AI Act is as follows:
Where a model uses a certain amount of compute for training, it is deemed to have high capabilities.
A model with high capabilities means a model that may result in systemic risks from its use.
Therefore, a model's compute consumption can be used as a proxy to measure the risk it may present to public health, safety, public security, fundamental rights, or the society as a whole.
With o1's use of chain of thought, more compute is being used at inference. With that inference compute, the model is following a series of reasoning steps that it has been trained on to produce its output to a given prompt.
OpenAI has explained its decision to conceal the actual chain of thought process with model-generated summaries:
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
The System Card for o1 documents the safety challenges and evaluations for the model. In that analysis, OpenAI describes how chain-of-thought provides another avenue for evaluating the dangerous behaviour that may be exhibited by the model:
One of the key distinguishing features of o1 models are their use of chain-of-thought when attempting to solve a problem. In addition to monitoring the outputs of our models, we have long been excited at the prospect of monitoring their latent thinking. Until now, that latent thinking has only been available in the form of activations — large blocks of illegible numbers from which we have only been able to extract simple concepts. Chains-of-thought are far more legible by default and could allow us to monitor our models for far more complex behavior (if they accurately reflect the model’s thinking, an open research question.8
The implication this has for laws like the AI Act is important. If the inference scaling law introduced by o1 holds true, then the presumption that higher training compute equates to higher risk is significantly weakened.
Instead, inference compute may be what correlates with risk, as OpenAI hints at in its System Card. The more compute used at inference, the more time the models spends 'reasoning' and therefore the better its performance, which in turn could increase the risk of the model exhibiting dangerous behaviour.
of Hyperdimensional has an extensive post which sets out the potential problems that this could pose to AI regulation:To police inference, government would need to surveil, or require model developers or cloud computing providers to surveil, all use of AI language models. Set aside the practical problems with doing this. Set aside, even, the drastic privacy concerns any reasonable person would have. Think, instead, about the way such a policy would drive much of AI use into the shadows, the way it would incentivize the creation of an AI infrastructure beyond the reach of the state. How safe does that sound to you?
That is not to say that models like o1 would definitely be exempt from the more onerous obligations for general-purpose models with systemic risk under the AI Act. Even if the compute used for training falls below 10 FLOPs, other factors can be taken into account for the risk categorisation, including the benchmarks and evaluations of the model's capabilities.9
Nevertheless, the advent of o1 highlights the problem with regulation that is too specific to the (former) state of the art, therefore limiting its ability to be future proof.
OpenAI, 'OpenAI o1 System Card' (2024), p.1.
OpenAI, 'OpenAI o1 System Card' (2024), p.1.
Melanie Mitchell, Artificial Intelligence: A Guide for Thinking Humans (Penguin Random House, 2019), p.164.
Ouyang et al, 'Training language models to follow instructions with human feedback' (2022).
Wei et al, 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models' (2022), p.2.
OpenAI, 'OpenAI o1 System Card' (2024), p.1.
OpenAI, 'OpenAI o1 System Card' (2024), p.2.
OpenAI, 'OpenAI o1 System Card' (2024), p.6.
EU AI Act, Annex XIII.
Great post. I didn't expect the wheels to fall of the training compute risk threshold this quickly!
What do they mean by this: "We also do not want to make an unaligned chain of thought directly visible to users."? It sounds very much like "We don't want to show you the reasoning, because in some cases the reasoning will be spurious." But surely those are the exact cases in which a user should be able to audit the reasoning?