Is big always bad?
The merits of compute thresholds as a governance mechanism for AI
TL;DR
This newsletter is about a paper by Sara Hooker, a machine learning researcher and lead at Cohere AI, titled 'On the Limitations of Compute Thresholds as a Governance Strategy'. It looks the arguments presented in the paper on the drawbacks compute thresholds as a mechanism for AI governance, as found in the EU's AI Act.
Here are the key takeaways:
The EU AI Act contains provisions determining when general-purpose AI models are to be classed as models with systemic risks, subjecting developers of these models to more onerous obligations. Under these provisions, models that use a certain amount of compute for training are presumed to be models with systemic risks.
Sara Hooker's paper criticises the use of such compute thresholds as a governance mechanism for AI. She argues that such mechanisms rely on the faulty assumption that higher compute means a higher propensity for harm.
In her paper, Hooker shows how higher compute does not increase model capabilities that may in turn lead to systemic risks from their use. Additionally, she explains how FLOPs may not be a reliable measure of compute consumption for a model.
Accordingly, Hooker recommends that policymakers move aways from hard-coded thresholds to more dynamic thresholds that would capture current and future harms of AI models. She also argues that the use of compute thresholds should be backed by scientific evidence, which is something that is missing from the EU AI Act.
The EU AI Act and compute thresholds
The EU AI Act contains provisions pertaining to what it defines as "general-purpose AI models". Article 3.63 defines these models as follows:
...an AI model, including where such an AI model is trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable of competently performing a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications, except AI models that are used for research, development or prototyping activities before they are placed on the market.
Chapter V further distinguishes between two different types of general-purpose models: those with and those without "systemic risk". A definition of "systemic risk" is provided in Article 3.65:
...a risk that is specific to the high-impact capabilities of general-purpose AI models, having a significant impact on the Union market due to their reach, or due to actual or reasonably foreseeable negative effects on public health, safety, public security, fundamental rights, or the society as a whole, that can be propagated at scale across the value chain.
Accordingly, under Article 51.1, a general-purpose AI model is classified as a model with systemic risk if it meets at least on the following conditions:
It has high impact capabilities (Article 3.64 defines this as "capabilities that match or exceed the capabilities recorded in the most advanced general-purpose AI models").
A decision of the European Commission determines a model to have high impact capabilities.
Compute thresholds are relevant to the first condition. Article 51.2 states that a general-purpose AI model is presumed to have high capabilities "when the cumulative amount of computation used for training measured in floating point operations is greater than 10."
Recital (111) expands on the compute threshold provision in Article 51.2:
According to the state of the art at the time of the Regulation, the cumulative amount of computation used for the training of the general purpose AI model measured in floating point operations is one of the relevant approximations for model capabilities. The cumulative amount of computation used for training includes the computation used across the activities and methods that are intended to enhance the capabilities of the model prior to deployment, such as pre-training, synthetic data generation and fine-tuning. Therefore, an initial threshold of floating point operations should be set, which, if met by a general-purpose AI model, leads to a presumption that the model is a general-purpose AI model with systemic risks. This threshold should be adjusted over time to reflect technological and industrial changes, such as algorithmic improvements or increased hardware efficiency, and should be supplemented with benchmarks and indicators for model capability...Thresholds, as well as tools and benchmarks for the assessment of high impact capabilities, should be strong predictors of generality, its capabilities and associated systemic risk of general-purpose AI models, and could take into account the way the model will be placed on the market or the number of users it may affect.
Developers of general-purpose AI models are subject to several obligations. This includes requirements around transparency of how the model was trained and evaluated, as well as a requirement to adversarially test for, assess, treat and document risks from the use of the model and ensure adequate levels of cybersecurity protection.1
Why have compute thresholds as a governance mechanism?
In essence, the rationale for the mandated compute thresholds in the AI Act is as follows:
Where a model uses a certain amount of compute for training, it is deemed to have high capabilities.
A model with high capabilities means a model that may result in systemic risks from its use.
Therefore, a model's compute consumption can be used as a proxy to measure the risk it may present to public health, safety, public security, fundamental rights, or the society as a whole.
This is the exact rationale that Hooker criticises in her paper. She recognises how laws like the AI Act assume that "greater compute equates with higher propensity for harm."2
This thinking is also perpetuated by developers of frontier genAI models. Both OpenAI and Anthropic, for example, have responsible scaling policies in place which assume that "scale is a key lever for estimating risk."3
Therefore, floating point operations, or FLOPs, have been used by regulators to determine when a particular model reaches a certain risk threshold, so as to justify more onerous obligations being imposed on developers of such models. However, Hooker argues in her paper that the use of these compute thresholds as a governance mechanism for AI "are shortsighted and [are] likely to fail to mitigate risk."4
Is compute all you need?
I have written previously on how scaling laws currently dominate AI development. Under this doctrine, training really big models on massive training datasets is touted as the clearest path to optimal performance.
In some cases, as Hooker admits, adherence to such scaling laws have "provided persuasive gains in overall performance" for models. However, only maximising compute "misses a critical shift that is underway in the relationship between compute and performance."5
In other words, bigger models do not always perform better. In fact, "there is no guarantee that larger models consistently outperform smaller models."6
As Hooker points out, there are several other factors that can influence the performance and capabilities of models:
Data quality. Models that are trained on better data need less compute. There is research that shows that better quality data mitigates the need for bigger models,7 which in turn reduces training time and the amount of compute needed.
Optimisation methods. Breakthroughs in certain optimisation methods have reduced the need for bigger models and more compute. Examples of this include extending context windows for LLMs, using retrieval augmented generation (RAG) and training models with reinforcement learning with human feedback (a breakthrough that helped build ChatGPT, for which see OpenAI's paper).
Architecture. Hooker points out that innovations in model architecture can "fundamentally change the relationship between compute and performance." Though deep neural networks provide "a huge step forward in performance" relative to previous AI models, they remain "very inefficient" which might mean that "the next significant gain in efficiency will require an entirely different architecture."8
FLOPs as a metric for compute
FLOPs is a metric that measures how quickly a computer or processor can perform mathematical calculations involving floating-point numbers.9 This can be used to measure the amount of computing power used by an AI model.
Such a metric can be a useful way for measuring the computing power of models. They provide "a standardized way to compare across different hardware and software stacks" and also "FLOP counts don't change across hardware."10
However, using FLOPs to implement compute thresholds as a governance mechanism for AI has its drawbacks, as Hooker highlights:
Post-training performance. For some models, compute consumption can be influenced by optimisation methods used outside of training, which could be termed as "inference-time compute". RAG, for example, has become a popular mechanism used for many models that "contribute minimal or no FLOP."11
Tracking FLOP across model lifecycles. The AI Act does specify that the threshold it imposes includes "the computation used across the activities and methods that are intended to enhance the capabilities of the mode prior to deployment, such as pre-training, synthetic data generation and fine-tuning." However, Hooker points out that, in practice, it can be difficult to track FLOPs used at these different stages.12
FLOP for models vs systems. The AI Act only focuses on FLOPs for AI models as opposed to AI systems. However, Hooker points out that, in reality, "impact and risk are rarely attributable to a single model but are a facet of the entire system a model sits in and the way it interacts with its environments."13
Predicting risk with compute
Hooker's paper argues that laws like the AI Act fail to provide justifications as to where it has set its thresholds. For Hooker, the thresholds included in the AI Act and other laws seem to be more precautionary and do not seem to impact most of the models that currently exist:
There is not a clear justification for any of the compute thresholds proposed to date. Indeed, the choice of 10 and 10 rather than a number smaller or larger has not been justified in any of the policies implementing compute thresholds as a governance strategy. We do know that model scale amplifies certain risks – larger models tend to produce more toxic text and harmful associations and increases privacy risk because the propensity to memorize rare artifacts can increase the likelihood of data leakage. However, these relationships hold in compute settings far below 10 or 10 FLOP and are present in many models far smaller than the current threshold. What is striking about the choice of compute thresholds to date is that many are examples of precautionary policy – no models currently deployed in the wild fulfill the current criteria set by US Executive order. Only a handful of models will be impacted by the EU AI Act when it comes into effect. This implies that the emphasis is not on auditing the risks incurred by currently deployed models in the wild but rather is based upon the belief that future levels of compute will introduce unforeseen new risks that demand a higher level of scrutiny.14
Ultimately, compute does not always correlate with the emergent properties of AI models that may lead to the systemic risks referenced in the AI Act. By emergent properties, we are referring to here a property "that appear "suddenly" as the complexity of the system increases and cannot be predicted."15
Hooker notes that relationship between compute and risk is often not clear:
Research has...increasingly found that many downstream capabilities display irregular scaling curves or non power-law scaling. For complex systems that require projecting into the future, small errors end up accumulating due to time step dependencies being modelled. This makes accurate predictions of when risks will emerge inherently hard, which is compounded by the small samples sizes often available for analysis.16
Accordingly, Hooker recommends that, rather than having hard-coded compute thresholds, there should be dynamic thresholds:
A dynamic threshold for compute could focus auditing resources on the top 5-10 percentile of models ranked according to an index of metrics (consisting of more than compute) that serve as a proxy for risk.17
Hooker suggests that such dynamic thresholds would still capture current, as well as potential future, harms:
Using a percentile threshold based upon annual reporting would also ensure a guaranteed number of models with relatively higher estimated risk receive additional scrutiny every year. This would ensure that thresholds don’t become decorative and only applied to future models, but also apply to models currently deployed that are outliers relative to their peer group.18
Additionally, as inferred throughout her paper, Hooker contends that compute should not be used as the only proxy for risk. This is because "we are not good at predicting what capabilities emerge with scaling and because the relationship is fundamentally changing between training compute and performance."19
Hooker's main recommendation for using compute as a metric for measuring the risk of AI models is that such governance mechanisms should be determined by scientific evidence:
Given the wide adoption of compute thresholds across governance structures, scientific support seems necessary in the same way precautionary policies that aim to present harm from climate change or policies to improve public health are justified after weighing the scientific evidence. Governments should invite technical reports from a variety of experts before adopting thresholds. If hard thresholds are chosen as part of national or international governance, they should be motivated by scientific consensus.20
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.1.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.3.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.1.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.6.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.6.
See for example Sorscher et al, 'Beyond neural scaling laws: beating power law scaling via data pruning' (2023).
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.8.
A floating point number is used to represent numbers with decimal points that a computer can interpret.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.9.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.10.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.11.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.11.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), pp.3-4.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.13.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), pp.13-14.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.16.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.16.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.16.
Sara Hooker, ‘On the Limitations of Compute Thresholds as a Governance Strategy.’ (2024), p.17.