A normie's guide to AI scaling laws

The philosophy that drives OpenAI and other model developers

Jun 27, 2025

TL;DR

This newsletter is about AI scaling laws. It looks at what they are, how they have influenced AI development and the future of scaling laws.

Here are the key takeaways:

'Scaling laws' refer to a phenomenon in AI development whereby improvements in model capabilities increase when certain aspects of model development are also increased. This is an empirical observation that OpenAI had found to apply to the training of their language models.
The main finding of OpenAI's scaling law paper is that if you increase the size of the model, the amount data the model is trained or the amount of compute used for training, the performance of the model also increases, represented as a fall in the loss score on a test dataset.
The positive correlation between these aspects of the training operation and performance is so strong that OpenAI claims in its paper that these aspects can be used to predict model performance. In other words, you can determine a specific level of model performance by implementing specific levels for model size, data and compute.
Due to scaling laws, OpenAI have sought to increase model size, training datasets and compute in an attempt to produce better performing models. The impact that scaling laws on AI development has been stark, especially after the release of ChatGPT in late 2022. It made everything bigger: the models, the data, the compute and, accordingly, the investment.
There are some concerns that the scaling laws are no longer holding true. This is because the resources required for training models are scarce; there is only so much data you can collect for training and there is only so much GPU-powered data centres that you can build.
So if the scaling laws for training no longer hold true, other paths for better-performing models will need to be explored. Scaling inference-time compute could be one of these paths, but this remains to be seen.

What are scaling laws?

'Scaling laws' refer to a phenomenon in AI development whereby improvements in model capabilities increase when certain aspects of model development are also increased.

This is an empirical observation that OpenAI had found to apply to the training of their language models. As stated in the abstract of their famous paper titled 'Scaling Laws for Neural Language Models' (published in January 2020):

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.1

To understand what this means, you need to understand how today's language models are built.

How LLMs are built

Large language models (LLMs) are trained to predict the next token. But what does this mean? I give a more detailed explanation of how LLMs work in Do LLMs store personal data?, but here I will give a simplified explanation relevant for understanding scaling laws:

LLMs are trained on large datasets of text. This text can come from many different sources, ranging from books to public webpages. The text datasets are converted into tokens, which are numerical representations of the words and fragments of words in the datasets. These tokens make it easier for the model to process the text data.
With these tokens, a process called 'masked self-attention' is used to train the model. This is where the model is shown only parts of a sequence of tokens and, based on this sequence, it is asked to predict the next set of tokens (i.e., it is asked to predict what the hidden set of tokens are).
The model uses its parameters to predict the tokens. The parameters essentially contain information about the patterns it has identified from the training data (the text datasets). At the start of its training, these parameters will be randomly set, but as it works through the training data the parameters will eventually reflect the patterns actually contained in the training data.
The next token predictions produced by the model are compared to the actual tokens. This comparison is what produces the loss score; this is a metric that represents the numerical difference between the tokens predicted by the model the correct tokens. The higher the difference, the higher the loss score and vice versa. The loss function is the mechanism that calculates the loss score.
The loss score is then fed back to the model which uses this as feedback to adjust its parameters. This whole process is then repeated again (predicting the next tokens given a sequence of tokens) until the model has seen all the tokens in its training data.
After going through its training data, the model is then tested on a new dataset consisting of text that it has not come across during its training. This dataset is used to test the performance of the model. The loss score on this test dataset is used as a proxy to measure this performance.

The training process for LLMs therefore consists of a training loop with the following steps:

An input is processed by the model's parameters to produce a predicted output
The prediction is compared with the actual output by a loss function that produces a loss score
The loss core is fed back to the model which adjusts its parameters accordingly

Another important aspect of LLM development is compute. Here we are referring to the specialised hardware that is required to power the above ML training loop.

A crucial part of this hardware is the graphical processing unit (GPU), a type of computer chip commonly used for AI development. GPUs are capable of perfoming parallel processing, in which they are able to break a processing operation into smaller operations and run them simultaneously. This is particularly advantageous for training LLMs given scale of this practice (i.e., models with billions of parameters trained on petabytes of text data). You can read the article below for a more detailed description on GPUs and AI development.

Talking chips (Part 2)

Mahdi Assan

June 28, 2024

Read full story

Scaling laws for LLMs

The main finding of OpenAI's scaling law paper is that if you increase the size of the model, the amount data the model is trained or the amount of compute used for training, the performance of the model also increases, represented as a fall in the loss score on a test dataset.

Let's break this down:

Model size. OpenAI's paper defines model size as the quantity of certain parameters of the LLM. This is essentially the number and size of the layers in the model.2
Data. The tests carried out for OpenAI's paper used a text dataset called WebText. This is a dataset of text data collected from outbound Reddit links between January and October 2018. This dataset consisted of 20.3 million documents containing 96 GB of text data. The test dataset consists of a portion of WebText as well as text data from books, Common Crawl and Wikipedia articles.3
Compute. This refers to the amount of computing power used. The longer a model is trained for, the more compute it uses.

The positive correlation between these aspects of the training operation and performance is so strong that OpenAI claims in its paper that these aspects can be used to predict model performance.4 In other words, you can determine a specific level of model performance by implementing specific levels for model size, data and compute.

Overall, the scaling laws mean that "model performance depends most strongly on scale."5 The bigger the better.

How have scaling laws impacted AI development

The scalings laws were identified by researchers at OpenAI involved in the development of its GPT models. A team that included Dario Amodei (co-founder of Anthropic) had made the observation that:

...it was possible to estimate with high accuracy how much data, how much compute, and how many parameters to use to produce a model with a desired level of performance on a discrete capability tightly correlated with next-word-prediction. For capabilities less but still somewhat correlated, increasing these inputs should also lead to better performance.6

Due to scaling laws, OpenAI have sought to increase model size, training datasets and compute in an attempt to produce better performing models.

The original GPT model (released in 2018) had 117 million parameters.7 GPT-2, released the following year, grew to 1.5 billion parameters.8 GPT-3, released in 2020, grew to 175 billion parameters,9 and its release was the last time that OpenAI provided figures on the size of its models. In the technical paper for GPT-4, released in 2023, OpenAI states the following:

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.10

On training, the original GPT model was trained on BooksCorpus, a dataset of "over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance."11 GPT-2's training data included WebText (described above), which contained "8 million documents for a total of 40 GB of text."12 GPT-3's training data included an expanded version of WebText (called WebText 2) as well as an archive of webpages scraped from the public internet by Common Crawl, two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.13 With each GPT model, the training dataset has increased in size.

Both the increases in model size and training data required computing power to also increase. This is what partly led OpenAI to enter into its partnership with Microsoft. The leadership realised that it would need to "marshal substantial resources" to develop its models, and so made changes to the company's charter and corporate structure to allow for this.14 In June 2019, OpenAI and Microsoft announced their strategic partnership in which Microsoft would invest $1 billion into the AI developer (consisting of a combination of cash and cloud credits for Azure, Microsoft's cloud service) and in return OpenAI would licence its technology to Microsoft. In January 2025, OpenAI announced its Stargate project in which it plans to invest, with other companies, $500 billion over the next four years to build new AI infrastructure in the US.

The impact that scaling laws on AI development has been stark, especially after the release of ChatGPT in late 2022. It made everything bigger: the models, the data, the compute and, accordingly, the investment:

Not even in Silicon Valley did other companies and investors move until after ChatGPT to funnel unqualified sums into scaling. That included Google and DeepMind, OpenAI's original rival. It was specifically OpenAI, with its billionaire origins, unique ideological bent, and Altman's singular drive, network, and fundraising talent, that created a ripe combination for its particular vision to emerge and take over. "I get the sense that Sam is the most ambitious person on the planet," a former employee says. In other words, everything OpenAI did was the opposite of inevitable; the explosive global costs of its massive deep learning models, and the perilous race it sparked across the industry to scale such models to planetary limits, could only have ever arisen from the one place it actually did.15

In essence, OpenAI has bet much of its strategy on scaling laws.

What is the current state of scaling laws

Late last year, there were some concerns that the scaling laws were no longer holding true.

One of the more prominent voices raising such concerns was Ilya Sutskever, co-founder and former chief scientist of OpenAI. During a talk Sutskever gave during NeurIPS 2024 in Canada, he states that "pre-training as we know it will eventually end."

What does he mean by this?

Building LLMs involves two main stages of training: pre-training and post-training. The process of training models to predict the next token (explained above) is the pre-training process, the product of which is a base model. Post-training is about training the base model to perform in a certain way when prompted. This can be done with reinforcement learning with human feedback (RLHF), which is the technique OpenAI uses to train its GPT models to be more conversational and follow instructions (i.e., ChatGPT). You can read this post to learn more about how post-training works.

Scaling laws have traditionally applied to the pre-training process. But Sutskever was suggesting that scaling the various aspects of pre-training cannot go on forever.

Why is this the case? Well the answer is pretty simple: the resources required for pre-training are scarce. This applies especially to data and compute. There is only so much data you can collect for training and there is only so much GPU-powered data centres that you can build.

But also just as importantly, it may also not be the case that continuing to scale model development always leads to better models. Eventually, scaling laws plateau.

Even venture capital firms like Sequoia are recognising this reality. From one of their keynotes at their AI Ascent event earlier this year:

...the bad news is that pre-training does seem to be slowing down. We've scaled pre-training by nine or ten orders of magnitude since [the 2010s] and that means a lot of the low-hanging fruit has been picked.

So if the scaling laws for pre-training no longer hold true, what are the alternative paths for better-performing models?

One possible path was revealed by the release of OpenAI's first reasoning model, o1, last year.

Reasoning models are LLMs which are trained to 'think' more carefully about how to respond to a given prompt. This 'thinking' that the model is trained to do is achieved via two techniques that are applied during post-training:

Chain-of-thought. This is about constructing a series of steps that the model is trained to use when responding to a prompt.16 This chain-of-thought represents the thinking process that the model applies to its response to prompts.
Reinforcement learning. reinforcement learning is about coxing the model to behave in a certain way by 'rewarding' it when it executes the desired behaviour. This is done through an iterative process whereby the model learns the optimal strategy for achieving a set goal.17

Using these techniques, OpenAI trained o1 to respond to prompts step-by-step, using reinforcement learning to build a model that reasons more accurately. And according to OpenAI, if reasoning models are trained to use more inference-time compute, then this leads to better performing models.

As OpenAI states in its o1 announcement:

We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

'Inference-time compute' here means the computing power used by a deployed model to respond to a user input or prompt. So ChatGPT uses inference-time compute every time it responds to a prompt.

Simply put, the longer a model spends thinking about an answer to a prompt, the better the answer it produces. This is the potential new inference scaling law that OpenAI revealed with its release of o1.

Will this new scaling paradigm hold out? Possibly, but there are other methods that AI developers have been looking at to build better models, such as building models that use compute more efficiently.