Token for Token

How to train the best embedding model in the world

Jack Morris — Mon, 09 Mar 2026 16:55:54 GMT

An unexpected side effect of graduate school was that I became the best person in the world at training text embedding models. Midway through my PhD, I trained astate-of-the-art model (using a still-unbeaten method) and I had a clear plan for how to make state-of-the-art models much better.

Sadly, the method turned out to be too expensive for an academic lab. Eventually I stopped working on these models.

Since I never got around to executing my grand vision, in this article, I’m just going to give the knowledge away for free. If you are training embedding models for your job, and want to make them better, you should use this method.

(Note: this post is more on the technical side. If you’re just here for my showerthoughts about AGI, you can skip it. But if you want to learn something, read on…)

A brief aside: how I became a world expert

This isn’t an exaggeration, by the way.

Our model, CDE, worked so well because of two different techniques we developed, both of which are somewhat difficult to implement. It beat all the other embedding models of its size in every category.

How did I become the world expert on this topic? First of all, it’s not that competitive. People don’t care about embedding models that much, and most people who do are in industry. OpenAI, Anthropic, Google, and Cursor all train internal embedding models, but mostly don’t publish; when they do, they mostly just change the data.

Within academia, embedding models aren’t that popular, and there aren’t many new ideas. I think what really worked for me was having a supportive advisor who forced me to go deep. I didn’t just look at how people were training these models (boring); I spent my time understanding the fundamental underlying techniques like noise contrastive estimation and softmax approximations.

The big idea

Now that I’ve established such impressive credentials, let me explain the big idea.

It all comes back to scaling

Why haven’t we scaled embedding models?

The best LLMs get bigger every year, while embedding models haven't grown past the eight-billion parameter scale. (Thanks for the graph, chatGPT!)

The best LLMs get bigger every year, while embedding models haven’t grown past the eight-billion parameter scale. (Thanks for the graph, chatGPT!)

Language models have seen extreme success from training longer on more data with more parameters. Embedding models haven’t seen similar returns to investment: typically we see embedding models don’t improve much from training on larger document sizes. It depends on the setting, but performance tends to plateau on training sets around 10M document-document pairs. No one is training on billions of pairs (and if it worked, they surely would be).

So why is it that don’t embedding models scale? First, let’s see a brief overview of contrastive learning, the canonical training algorithm for these models.

How embedding models work

The training objective for embedding models often looks something like this:

here, f(q) is an embedding vector, and f(q) · f(d) is the similarity between a document d and query q. The effect is to make query-document pairs close in embedding space.

Critically, this loss is applied to each batch of d’s and q’s. To train an embedding model, we sample a batch, apply this loss within the batch, and iterate.

If you go back to the basics, you’ll realize that this is essentially a form of sampled softmax, an old technique people used to use when they didn’t have enough GPU memory to compute a full softmax. It’s gone out of fashion, but what we really want is to compute scores over everything in an entire dataset:

Conventional embedding model-training optimizes the left equation. This is a Monte-Carlo approximation of the right quantity, which involves computing embedding similarities over the entire dataset.

(Note: Once, for another paper, I trained an embedding model where we *did* have perfect labels, since it was matching up two views of the same person. In this case, I optimized the exact softmax via coordinate ascent and worked much better than NCE loss.)

The data problem

Ideally we would compute the quantity on the right – for each query, we would compare its embedding to the embedding of the proper document as well as all other documents in the dataset. If this worked perfectly, we could scale ad nauseam and produce better-and-better embedding models.

The reason naive scaling doesn’t work here is that as we reduce the approximate-ness of the softmax, we introduce a lot of noise. There are many papers about this in both vision and text.

Let’s take an example query from the popular MSMarco search dataset:

was ronald reagan a democrat?

This query corresponds to multiple labeled documents from Wikipedia that answer the question. And it’s clear that most documents on the web do not answer this question.

But as you scale the number of documents, you’re likely to run into more and more “collisions” that answer this question. And that means in our exact softmax above, you’ve introduced more noise into its gradient, since some of the terms in the denominator are incorrect.

In other words: say we scale our data and accidentally include a new document that mentions Ronald Reagan’s political affiliation. This document will automatically be marked as a non-answer to the pre-existing query, and we will incorrectly encourage the model to push the embedding of the new document away from the embedding of was ronald reagan a democrat?

Over time, this type of noise compounds and scaling embedding data can actually hurt.

Incidentally, this is the exact problem tackled in many recent retrieval papers; in our CDE work, the reason why we were able to make contextual batching work at all is because we aggressively filtered these fake negatives within the batch.

In that setting, we observed nearly a 10% improvement by implementing a very crude embedding-based filtering mechanism:

In our CDE work, training with filtering gave us nearly a 10% improvement (compare left graph to right graph scale).

In CDE, training with filtering gave us nearly a 10% improvement (compare left graph to right graph scale).

LLMs to the rescue

The difficult realization here is that for D training query-document pairs, we technically need D^2 labels to compute the exact softmax. This is an intractable number of labels for human labeling pipeline: e.g. for a billion-pair corpus, we’d need 10^18 ≈ one quintillion human labels to deduplicate the entire corpus. (We could reduce this number with some filtering + reranking, but this relies on another retrieval model, which perhaps defeats the purpose...)

Luckily, we can exploit a property that’s become common for RL-training, known as verification asymmetry or the Generator-Verifier gap. The main idea here is the same: regular LLMs are much, much smarter than embedding models (and more expensive!). If we could train an embedding model that agrees with the decision of an LLM for every possible query-document pair, that would be an enormous accomplishment.

So let’s forget human labels. We can use LLMs to label this kind of massive dataset. This way we can end up with a nearly perfect deduplicated dataset for training embedding models. (Using LLMs on every query-document pair also opens the door for training even better models using more precise ranking algorithms...)

Every experiment I ran looks something like this: embedding model training scales, but only when you properly filter for hard negatives.

The full procedure

I’d be remiss if I didn’t end by explaining *exactly* how to train this model. Here’s a short recipe for training a state-of-the-art embedding model, guaranteed to beat any existing embedder:

Gather all the text pairs you can find.
This data
is a good start. Recent work has generated large amounts of synthetic data
For each query Q, filter the top-k using BM25 and embeddings. Gain some set of K documents {D_1, ..., D_K}, relevant to Q. (This is just a practical step to make labeling tractable.)
Run each pair (D_*, Q) through an LLM to figure out if D is truly a negative example or not.
Output: a perfectly-labeled retrieval dataset that’s arbitrarily scalable.
Train the model using exact softmax and coordinate ascent.

If you have the money to run this — let me know how it goes! Happy to talk over DM or email.

'AI' just means LLMs now

Jack Morris — Sat, 02 Aug 2025 16:57:34 GMT

Since the industrial revolution, humans have dreamed of building intelligent machines.

In the 1800s, Charles Babbage and Ada Lovelace built the first general-purpose computer and designed the first ever computer algorithm. They also speculated about whether computers might be creative one day and compose their own music or art.

Fast forward to the twentieth century and lots of people started thinking about what it could look like to build intelligence artificially. The most famous early such thinker is Alan Turing, the father of the Turing Test and pioneer of early logic.

Even as AI spent several decades as a marginalized topic for scientific study, the idea of AI as science fiction blossomed. It’s now played a central role in movies (2001: A Space Odyssey, Blade Runner, The Terminator, The Matrix, Ex Machina) and TV (Star Trek, Knight Rider, The Twilight Zone, Westworld, Black Mirror). It’s safe to assume that anyone who’s watched Western media over the last fifty years knows what AI is.

The Form of Superintelligence

And through science fiction we’ve seen AI take many forms. I don’t think there’s a definition of exactly what an AI should look like. Perhaps in defining AI we can take a cue from the 1964 U.S. Supreme Court: “we’ll know it when we see it.”

Things get even murkier when we consider superintelligence. Are all these AIs superintelligent–smarter than humans? C-3PO certainly was; that guy knew everything. But many media portrayals of AI consider them as human-level, but not smarter.

For the purpose of this series of posts, I’m going to define ‘superintelligence’ as whatever we saw in Iron Man (Jarvis) and Star Wars (C-3PO):

So we’ll take a reductionary position that superintelligence is simply a helpful machine that’s as capable and more knowledgeable than a human.

Do we have this yet? Unfortunately no. The newest, smartest models from OpenAI are very good– they can code like professional programmers and achieved gold medal on the International Math Olympiad– but they’re not considered yet human-level in many areas. (This skill profile is what some have described as jagged intelligence.)

Biological Candidates for Superintelligence

For a long time we didn’t know how to build JARVIS. We had no idea what superintelligence would look like. Many saw that some kind of connectionism might be the path, but didn’t know how to scale it.

In fact, lots of early AI research was inspired by the human brain. The most popular arguments, such as Nick Bostrom’s original arguments for superintelligence, all consider the fact that we already have proof that it’s possible to build (non-super) intelligence, if you consider the human brain as an existence proof. So you can make arguments like this one:

The human brain contains about 10^11 neurons. Each neuron has about 5*10^3 synapses, and signals are transmitted along these synapses at an average frequency of about 10^2 Hz. Each signal contains, say, 5 bits. This equals 10^17 ops. (…)

and its ultimate conclusion:

Depending on degree of optimization assumed, human-level intelligence probably requires between 10^14 and 10^17 ops.

Although I deeply admire the author’s conviction (and prescience, since these beliefs were published in 1998) it seems flawed to equate the operations performed by a human brain with simple ‘ops’ (such as floating-point operations) that happen inside a computer. (I think this is the basic point that Roger Penrose made when he hypothesized that quantum mechanics plays an important role in the development of human consciousness inside the brain.)

This is all worsened by the fact that the systems that ended up getting us closest to superintelligence are called neural networks, which is simultaneously an excellent name and a dreadful misnomer. They’re systems of interconnected dynamically-updatable components, but they’re also not related to biological neurons in any way.

Anyway, we spent a long time making these types of biological analogies, and they didn’t get us far. What worked was at best very loosely inspired by the functionality of the human brain, machine learning and neural networks:

Digital Candidates for Superintelligence

Since 2011 (around when AlexNet was introduced) every year has gotten us closer to building Jarvis or C-3PO. We developed better machine learning and neural networks, got good at training them to mimic human text, and now we’re making steady progress on incentivizing them to grow smarter than humans in some areas. This is really exceptional progress.

As mentioned, the best systems that we have right now are already very useful. They know how to code, give directions, and write recipes. They’re decent therapists and life coaches. They’re coming around in the creativity department, too, and writing better prose and poetry each year.

One might expect, then, that we as a field have several candidate systems for superintelligent AI. Perhaps we have a really good video simulator, a few companies have embodied continually-learning robots, and there’s a speech system out there that also exhibits Jarvis-like capabilities.

But this turns out not to be the case. There is only one existing technology that’s close to Jarvis: large language models. We haven’t built models that get smarter by exploring the world or watching lots of movies. We’ve only built models that get smarter by reading lots of text (say, all the words on the Internet) and then marginally smarter after doing a few thousand math problems.

All we have is language models.

All AI Models Might Be The Same

Jack Morris — Thu, 17 Jul 2025 17:17:19 GMT

Project CETI is a large-scale effort to decode whale speech. If AI models do learn a universal language, we might be able to use it to talk to whales.

Growing up, I sometimes played a game with my friends called “Mussolini or Bread.”

It’s a guessing game, kind of like Twenty Questions. The funny name comes from the idea that, in the space of everything, ‘Mussolini’ and ‘bread’ are about as far away from each other as you can get.

One round might go like this:

Is it closer to Mussolini or bread? Mussolini.
Is it closer to Mussolini or David Beckham? Uhh, I guess Mussolini. (Ok, they’re definitely thinking of a person.)
Is it closer to Mussolini or Bill Clinton? Bill Clinton.
Is it closer to Bill Clinton or Pelé? Bill Clinton, I think.
Is it closer to Bill Clinton or Grace Hopper? Grace Hopper.
Is it closer to Grace Hopper or Richard Hamming? Richard Hamming.
Is it closer to Richard Hamming or Claude Shannon? You got it, I was thinking of Claude Shannon.

Hopefully you get the point. By successively narrowing down the space of possible things or people, we’re able to guess almost anything.

How is this game possible? Mussolini or Bread only works because you and I have a shared sense of semantics. Before we played this game, we never talked about whether Claude Shannon is semantically ‘closer’ to Mussolini or Beckham. We never even talked about what it means for two things to be ‘close’, even, or agreed on rules to the game.

As you might imagine, the edge cases in M or B can be controversial. But I’ve played this game with many people and people tend to “just get it” on their first try. How is that possible?

A universal sense of semantics

One explanation for why this game works is that there is only one way in which things are related, and this comes from the underlying world we live in. Put another way, our brains build up complicated models of the world in which we live, and the model of the world that my brain relies on is very similar to the one in yours. In fact, our brains’ models of the world are so similar that we can narrow down almost any concept by successively refining the questions we ask, a-la Mussolini or Bread.

Let’s try to explain this through the lens of compression. One perspective on AI is that we’re just learning to compress all the data in the world. In fact, the task of language modeling (predicting the next word) can be seen as a compression task, ever since Shannon’s source coding theorem formalized the relationship between probability distributions and compression algorithms.

In recent years, we’ve developed much more accurate probability distributions of the world; this turned out to be easy, since bigger and bigger language models give us better and better probability distributions.

Intelligence is compression, and compression follows scaling laws. I like reminding people that the original work on scaling laws came from Baidu in 2017.

And with better probability distributions comes better compression. In practice, we find that a model that can compress real data better knows more about the world. And thus there is a duality between compression and intelligence. Compression is intelligence. Some have even said compression may be the way to AGI. Ilya gave a famously incomprehensible talk about the connections between intelligence and compression.

Last year some folks at DeepMind wrote a paper simply titled Language Modeling Is Compression and actually tested different language models’ ability to compress various data modalities. Across the board, they found that smarter language models are better compressors. (Of course, this is what we’d expect, given the source coding theorem.)

And learning to compress is exactly how models end up generalizing. Some of our recent work has analyzed models’ compression behavior in the limit of training: we train models for infinitely long on datasets of varying size.

Figures from our recent work, How much can language models memorize? Generalization only begins when compression is no longer possible, since the model can’t store data points separately and is forced to combine things.

When a model can fit the training dataset perfectly (left side of both graphs) we see that it memorizes data really well, and totally fails to generalize. But when the dataset gets too big, and the model can no longer fit all of the data in its parameters, it’s forced to “combine” information from multiple datapoints in order to get the best training loss. This is where generalization occurs.

And the central idea I’ll push here is that when generalization occurs, it usually occurs in the same way, even within different models. From the compression perspective, under a given architecture and within a fixed number of parameters, there is only one way to compress the data well. This sounds like a crazy idea–and it is– but across different domains and models, there turns out to be a lot of evidence for this phenomenon.

The Platonic Representation Hypothesis

So how can different models be learning the shared representations? Given the massive number of ~equivalent ways in which a model can represent things, why should two models ever converge to analogous representations?

A terse description and illustration of the headlining theory from The Platonic Representation Hypothesis (2024).

Remember what these models are really doing is modeling the relationships between things in the world. In some sense there’s only one correct way to model things, and that’s the true model, the one that perfectly reflects the reality in which we live. Perhaps an infinitely large model with infinite training data would be a perfect simulator of the world itself.

As models have gotten bigger, their similarities have become more apparent. The theory that models are converging to a shared underlying representation space was formalized in The Platonic Representation Hypothesis, a position paper written by a group of MIT researchers in 2024.

The Platonic Representation Hypothesis argues that as models get bigger, they’re learning more and more of the same features. They provide evidence for this in vision and language.

The Platonic Representation Hypothesis argues that models are converging to a shared representation space, and this is becoming more true as we make models bigger and smarter. This is true in text and language, at a minimum,

Remember the trends in scaling show that models are getting all three of bigger, smarter, and more efficient every year. That means that we can expect models to get more similar, too, as the years go on.

A brief aside on embedding inversion

The evidence for the Platonic Representation Hypothesis is compelling. But is it useful? Before I explain how to take advantage of the PRH, I have to give a bit of background on a problem of embedding inversion.

I worked for a year or so of my PhD on this problem: given a representation vector from a neural network, can we infer what text was inputted to the network?

Visualization of a network that reconstructs images astonishingly well given only the 1000 class probability predictions from an ImageNet classifier (from Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers).

We thought inversion should be possible because results on ImageNet showed that they could do very effective reconstruction given only a model’s output of 1000 class probabilities. This is extremely unintuitive. Apparently knowing that an image is 0.0001% parakeet and 0.0017% baboon is useful enough to infer not only the true class but lots of irrelevant information like facial structure, pose, and background details.

In the realm of text, the problem looks easy on its face, because typical embedding vectors have ~1000 floating-point numbers in them, or around 16 KB of data. If you store 16KB of text, it can represent quite a lot. Since we were working with datapoints on the level of long sentences or short documents, it seemed reasonable that we would be able to do inversion quite well.

But it turns out to be really hard. This mostly comes about because embeddings are in some sense extremely compressed: since similar texts have similar embeddings, it becomes very difficult to distinguish between two embeddings that represent similar-but-different data. So our models could output something close to the embedding, but almost never the exactly-correct text.

We ended up getting around this problem by using a primitive form of test-time compute: we made many queries to the embedding space and built a model that could “narrow down” the true text by iteratively improving itself in embedding space. Our system looks kind of like a learned optimizer that takes text-based steps to move position in embedding space.

Iterative refinement is an extremely effective method for embedding inversion (read more here).

This new approach turns out to work very well. Given an embedding model, we were able to invert text at the level of a long sentence with 94% exact accuracy.

Harnessing Plato for embedding inversion

We were very pleased with ourselves after making that method work. This had a whole lot of implications for the new model of the vector database: sharing vectors, apparently, is equivalent to sharing the text those vectors represent.

But unfortunately our method was embedding-specific. It wasn’t clear that it could transfer to future embedding models or private fine-tunes that we didn’t have access to. And it required making a lot of queries to the embedding model we knew: training the models took millions of embeddings.

We thought that this shouldn’t be the case. If the Platonic Representation Hypothesis is true, and different models (in some sense) are learning the same thing, we should be able to build one universal embedding inverter and use it for any kind of model. This idea set us off on a multi-year quest to “harness” the PRH and build a universal embedding inverter.

We started by expressing our problem as a mathematical one. Given a bunch of embeddings from model A, and a bunch of embeddings from model B, can we learn to map from A→B (or B→A)?

Importantly, we don’t have any correspondence, i.e. pairs of texts with representations in both A and B. That’s why this problem is hard. We want to learn to align the spaces of A and B in some way so that we can ‘magically’ learn how to convert between their spaces.

We realized after a while that this problem has been solved at least once in the deep learning world: work on a model called CycleGAN proposed a way to translate between spaces without correspondence using a method called cycle consistency:

Unpaired image translations from Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (2017). If you squint, you might see extremely preliminary evidence for the Platonic Representation Hypothesis.

Just imagine that the horses and zebras above are a piece of text from model A being translated into the space of model B and back. If this works for zebras and horses, why shouldn’t it work for text?

And, after at least a year of ruthlessly debugging our own embedding-specific version of CycleGAN, we started to see signs of life. In our unsupervised matching task we started to produce GIFs like this:

After training a CycleGAN-like model for mapping between embedding spaces, vec2vec learns to ‘magically’ align them. Hooray for the Platonic Representation Hypothesis!

To us, this was an incredible step forward, and proof for an even stronger claim we call the “Strong Platonic Representation Hypothesis”. Models’ representations share so much structure that we can translate between them, even without having knowledge of individual points in either of the spaces. This meant that we could do unsupervised conversion between models, as well as invert embeddings mined from databases where we know nothing about the underlying model.

Universality in Circuits

Some additional evidence for the PRH comes from the world of mechanistic interpretability, where researchers attempt to reverse-engineer the inner workings of models. Work on Circuits in 2020 found very similar functionalities in very different models:

Universal feature dectors from Circuits (2020). Different networks exhibit remarkably similar behaviors.

More recently, there’s been some action around a method for feature discretization known as sparse autoencoders (SAEs). SAEs take a bunch of embeddings and learn a dictionary of interpretable features that can reproduce those embeddings with minimal loss.

Many are observing that if you train SAEs on two different models, they often learn many of the same features. There’s even been some recent work on ‘unsupervised concept discovery’, a suite of methods that can compare two SAEs to find feature overlap:

Universal features from Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment (2025).

Since the PRH conjectures that models become more aligned as they get stronger, I suspect this type of common circuit discovery will only grow more common.

What can we make of all this?

Besides being a deep philosophical idea, the Platonic Representation Hypothesis turns out to be an important practical insight with real-world implications. As the mechanistic interpretability community develops better tools for reverse-engineering models, I expect them to find more and more similarities; as models get bigger, this will become more common.

Linear A is an ancient Greek text that humans have never been able to decrypt. Perhaps the Platonic Representation Hypothesis gives us hope for one day decoding it back to English.

As for our method (vec2vec), we found strong evidence, but things are still brittle. It seems clear that we can learn an unsupervised mapping between text-based models that are trained on the Internet, as well as CLIP-like image-text embeddings.

It’s not obvious whether we can map between languages with high fidelity. If this turns out to be true, we may be able to decode ancient texts such as Linear A or convert whale speech back to a human language. Only time will tell.

How to scale RL to 10^26 FLOPs

Jack Morris — Thu, 10 Jul 2025 20:47:22 GMT

TLDR: Reinforcement learning (RL) is the next training technique for building frontier-level AI models. To make it better, we need to train on more data. The current approach of scaling many environments simultaneously is messy and complicated. Instead, I propose we find a way to do next-token prediction on the Web using RL. This way, we learn to reason from general web data, instead of just math and code.

I’ve spent a good part of the past year in denial.

I was in denial because when OpenAI released o1, and explained their paradigm of test-time compute, I thought it was a good idea but mostly a way to get better performance out of models of fixed size. After all, letting models ‘think for longer’ by generating more tokens lets them do more internal computation.

The o1 release from OpenAI was the first demonstration of a new type of language model, one that could think for longer to generate better answers.

So I wasn’t that surprised that these new models, termed reasoning models, gave better answers. And I especially wasn’t surprised when I found out these answers mostly came on problems that inherently require lots of computation, like difficult math and engineering test questions.

Don’t get me wrong: I always thought reasoning models were interesting. It’s cool to me that they generate “thinking traces” before giving answers (although the thinking traces might not be very reliable). And it’s amazing that the models were trained with reinforcement learning, a foundational technique in machine learning that was generally understood to be difficult to use effectively for real problems.

But I still thought of myself as a scale maximalist: all that really mattered, I thought, was training bigger models on more data. Anything else (read: reasoning models) appeared to be a coping mechanism, just a way to get by while we wait for the hardware needed to train bigger models.

I’ve spent the past few months working on RL research at Meta. It took a bit of time but I’ve come full-circle: something far more nuanced is happening with reasoning models. RL isn’t just a way to give models more compute. RL training really is teaching models something different, a way to use compute to generate better answers given finite model capacity. Through RL, models are clearly learning something that they’re not getting from pretraining.

Two waves of AI scaling

The AI research-into-production cycle moves through a few distinct phases. First, we as a community identify a new learning paradigm. Second, we find the correct datasets for training and design evaluations to know when our models are getting better, and by how much. And third, we scale it to all hell.

This cycle has happened already, exactly once. Pretraining. It started with the innocuous observation that models can learn quite a lot when trained on internet text data using next-token prediction. We realized that this gives intelligence improvements in just about every domain. And then we scaled.

We spent 2022–2024 scaling language model pretraining: first making models bigger, and now working to pack as much knowledge we could into models of various scales. We’ll spend the next several years scaling post-training using RL.

And to be clear, pretraining research is ongoing. We’re still figuring out how to scale our models via bigger datacenters and better hardware and more efficient algorithms. And we’re gathering more and better data every year. But the upper bound of pretraining performance is really clear. To build better models, we need to give them more parameters and train them on bigger datasets. This is what the AI labs have been working on for three years or so now.

But as the dust settles on the pretraining frenzy, reasoning models are showing us a new way to scale. We’ve found a way to make models better that’s independent of the number of training tokens or model size.

The murky path to RL scaling starts with data

We’ve identified a new paradigm: learning to reason. But reasoning models are in their GPT-3 era: they’re trained on small datasets to do a narrow selection of tasks. We have a brittle proof-of-concept in the reasoning models of 2025. These models have achieved state-of-the-art scores on a small number of tasks, mostly expert-level math and coding questions.

In the case of pretraining, the path to progress was very clear. Models can learn via next-token prediction on just about any data, so we could simply scrape the entire Web and feed it to the models. And once we’d done that it became clear that our models were too small and we needed to make them much, much bigger.

But RL training is different. Let’s briefly remind ourselves how RL works:

Models like o1 are trained with verifiable rewards, meaning that after thinking and generating answers, we teach models by encouraging them to think more of the thoughts that led to correct answers, and less of the thoughts that led to incorrect answers. This is how RL algorithms like PPO (what o1 probably uses) and GRPO (the algorithm behind DeepSeek R1) work. They don’t teach, they incentivize.

So clearly we can only train RL models on tasks where we can score answers based on correctness. This is the idea behind verifiability, an RL buzzword used to describe tasks with a well-defined automatic scoring function. (The o1-style RL training paradigm is usually called RLVR, reinforcement learning with verifiable rewards, as to be distinguished from RLHF, or reinforcement learning from human feedback.)

Unfortunately, most things aren’t automatically verifiable. There aren’t perfect computer programs that can tell you whether an essay is good, for example, or an explanation.

In fact, things that we know how to automatically verify tend to be in the scientific domain. For example, OpenThoughts, a recently-released dataset of training data for reasoning models, contains four categories, Code, Math, Science, and ‘Puzzle’:

The recent OpenThoughts dataset contains verifiable tasks in Math, Science, Coding, as well as a small dataset of puzzles. (What is the Puzzle task, I wonder?)

Ok, so we can see that there are at least four domains that contain verifiable problems that we can train on. But there are many open problems here. Are those all the verifiable things that exist? Are they equally valuable? During training should we randomly alternate between them or train separate models and then average?

In fact, in typical RL setups, we don’t even understand the marginal value of a single training example. One recent paper, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, demonstrated that training on just a single example with thousands of different attempts of reasoning can actually produce a very good model:

A recent paper showed close to on-par reasoning performance from just learning a reasoning process from a single training example.

It’s also interesting to note the x-axis in the above graph: training only runs for 2000 steps. And that’s typical. Right now, these models are typically only trained for a few hundred or at most a few thousand steps. In the pretraining era, we often trained models on trillions of tokens, which meant millions of training steps.

This is mostly a compute issue: each step of RL requires sampling reasoning tokens, which is expensive and just difficult from a software perspective. The infrastructure to do this sort of thing is challenging and requires a lot of new engineering, since we aren’t used to doing generation at all during training, let alone at each step.

Mark my words: before we know it we’ll be running millions of steps of RLVR too.

RL compute scales unevenly

There are many practical engineering problems that need to be solved to scale RL.

In the pretraining days, training was a very homogenous workload, a very real-time continuous process. Batch of text passes through the model, we compute losses and backpropagate once. Queue the next back of text. This was simple and straightforward to optimize.

When we do RL, training infrastructure has to get more complicated. Gradient steps happen much less frequently and (depending on our chosen hyperparameters) we spend a lot more time generating thinking tokens.

Luckily, we’ve spent the last year or two making LLM inference super fast, and we can take advantage of these improvements here. In particular there are two really good libraries for doing inference (SGLang and vLLM) that make this part ~10x faster than naive python inference code. That helps a lot.

Another systems problem arises when we actually compute verifiable rewards. In the case of math problems this is usually pretty easy. Most datasets have answers computed ahead of time, so we can simply check if the final answer is correct and score accordingly (In practice, formatting makes this process slightly more complicated.)

But in domains besides math, verification quickly becomes expensive. This is especially noticeable in the code domain, which is where a lot of AI labs are focusing their efforts right now.

Remember that each domain needs a domain-specific “verifier”, a system that provides rewards that guide LLMs to generate better outputs. In the case of code, this usually involves running some code and scoring based on the code’s output. Given an LLM-generated answer to a coding problem, we may need to run a bunch of unit tests and count the number of ones that pass to provide a reward.

There are a lot of people working on this right now. Doing better and faster verification, running it in parallel, scaling it properly. In some cases, the training bottleneck isn’t anything related to the model – it’s not inference or backpropagation that slows things down, but the time it takes to compile and execute model code.

The new DGX B200 GPU servers from NVIDIA cost $500K a pop and provide around 10^17 FLOPS for training or inference. Unfortunately, our systems for doing RL on LLMs are pretty primitive and can’t get anywhere near this level of performance yet.

Since a single B200 costs over $500K, any time training spends bottlenecked by CPU execution is a big waste of money.

One path to scaling RL is optimizing this kind of system: making verifiers faster and more reliable, optimizing the new generate-before-backprop training pipeline, and designing clever systems that let us scale RL across datacenters.

In addition to all these systems-level improvements, we’ll need to build lots of new environments to try and learn diverse skills via RL training. Oh, and no one really knows what the right skills are to learn via RL, or how to combine them. So we’ll have to try lots of different combinations and train many models. And we can try averaging them in different ways via some kind of model souping (also known as model merging). We’ll just run lots of evaluations to find which combination of environments and souping produces the best model. This sounds difficult, doesn’t it? And quite messy.

What if I told you there was another way?

What does it mean to be verifiable?

If we want to scale RL in verifiable settings, we should probably start by figuring out which things are verifiable in the first place. It’s my feeling that people have been throwing this word around a lot, but there’s no clear definition.

It all comes down to what we can train into the models. If we can check a model output and provide a score, that’s good enough.

Wait– but isn’t this how language modeling works already?

Pretraining for reasoning with next-token prediction

Before making my proposal, let me start with listing a few core tenets that I believe about the current state of AI:

The only data we’ve found that really “works” (i.e. helps us build more intelligent models) is web-scale pretraining data. Ilya Sutskever famously compared all the human text on the internet to a reserve of fossil fuel: it’s exceptionally useful, but finite.
Reasoning, at its core, is a way to get better performance out of smaller models. It’s not doing anything more magical. Crucially, we’re getting limited new signal from the verifier itself; RL with verification is just a way to elicit capabilities that already exist within models. (This is a common belief about RL.)
There is nothing special about math and code. These modalities happen to lie in a space that’s difficult to model with a reward model (so prior approaches didn’t work super well) but easy to verify. And we happen to care about them (automating coding seems especially valuable). But we should be able to learn to reason from any type of data.
And finally, we haven’t fully saturated models with Internet data. Today’s models don’t seem to have enough capacity to memorize the entire Internet. Additional pretraining on Web data should still give us a performance boost – and might be enough to learn to reason.
Next-token prediction is verifiable. This is perhaps the central argument I’m making. The current strategy of checking if a math problem has been answered correctly is spiritually no different than confirming whether a model has outputted the proper next tokens.

Putting all this together, I’m betting that the “right way” to scale RL is by unifying it with next-token prediction. We should teach models to reason by practicing reasoning at scale on the vast diversity of data available on the Web.

Learning to reason via next-token prediction.

The proposed framework of learning to reason via next-token prediction.

This shows a comparison of the new paradigm demonstrated on a math problem. Normal next-token prediction is guessing which tokens come next. Typical RLVR allows the model to ‘think’ for a few tokens and then rewards it for outputting the right thing. Our idea of reasoning with next-token prediction (RNTP) would allow the model to think and then reward it based on the next-token prediction loss of the outputs in the tag. It’s a hybrid between traditional language model pretraining and the new reasoning model RL training.

Grok 2 was trained on more FLOPs than GPT-4. Apparently Grok 3, which came out later, was trained on around 10^26 FLOPs, which would put it above the top of this graph. But that was all supervised learning. How do we scale RL to use this much compute?

What do we even need RL for?

Now that we’ve stripped things down to their base components, it might not be obvious what benefit we get from doing reinforcement learning, if any.

The answer lies in the tokens. In the picture above, we generated everything between directly from the model. There’s no supervision for this.

In other words, we’re trying to get the model to learn to reason without knowing what reasoning should look like. We just sample lots of things from the model and encourage it to do the things that get rewards. If there was ground-truth reasoning, we could use the typical supervised training techniques to train the model to output the proper reasoning chains.

But in the real world, there’s no ground-truth for reasoning, so we can’t do supervised learning. And in fact we want it this way – this is the magic of reinforcement learning. We’re hoping that the model will discover reasoning chains that are more useful than anything we could ever write ourselves.

Scaling reasoning via next-token prediction

If you’ve read this far, and you agree this idea makes sense, you might be thinking about how it could be tricky to implement.

And in fact, you’re right. This is where the research comes in. Almost all research that matters comes from figuring out how to implement and scale ideas that make sense from first principles.

For example: what exactly is the reward? Do we give the model a point for guessing a token correctly? Should we reward it more for multiple tokens in a row? Perhaps we use a string-similarity reward like BLEU score, as was common in machine translation in 2018. We could do some kind of self-evaluation, where a decent model can look at its own outputs and decide which ones should get rewards. Perhaps we filter tokens by entropy and use that to determine which to reason about. Or maybe we want to account for confidence in the reward, and give the model more points for being confidently correct.

Another question: how many times should you “reason” within a single text chunk? One approach is to insert reasoning tokens at a random position per-chunk. Or perhaps we allow models to reason multiple times throughout each chunk. But then we’d have to figure out whether how many times can learn from a given text chunk with different reasoning patterns before memorization starts to occur.

There are additional difficulties that arise when switching from math and code to general reasoning. One reason we like math and code is because they’re difficult for base models to do “from scratch” but often easy to learn via reasoning. This won’t be the case with general text: some tokens are already extremely low-entropy, and therefore easy to predict; other tokens are nearly impossible, and will never be guessed correctly with any amount of reasoning.

Even the best pretraining datasets are still a long way from perfect.

Andrej Karpathy recently noted that if you actually look at samples from a typical pretraining dataset, they’re quite ugly, and there is a lot of obvious noise. One benefit of scale is that it irons out many of these low-level idiosyncrasies: after a lot of training, much of this noise gets averaged away. It’s possible that this would happen with my proposed RL training scheme. If we train for long enough, on enough tokens, we might not even care what the exact reward or reasoning schema looks like.

But wait, didn’t somebody try this already?

Those among us who diligently troll ArXiv for the latest nuggets of progress might recognize that someone recently proposed something like this in a recently released preprint (*Reinforcement Pre-Training).* This research was praised on twitter (the title sounds important, and the figure is funny!) but disappointed a lot of researchers:

Headlining figure from the recent ‘Reinforcement Pre-Training’ paper, which also proposes the idea of pretraining for RL via next-token prediction.

Headliner figure from the recent ‘Reinforcement Pre-Training’ paper, which also proposes the idea of pretraining for RL via next-token prediction.

To be more specific, this paper proposed something similar to what I’m advocating for: using large unlabeled text datasets and next-token prediction to scale RL! And it has pretraining in the name, just like I was describing.

Alas, it turns out to be a classic case of academic titlegrabbing. What the paper actually does is very specific: they finetune a single model with chain-of-thought to improve single-token outputs for some multiple-choice questions. They’re not actually doing pretraining–just finetuning!– and train on a small subset of questions from a single math dataset. There aren’t a lot of comparisons to any of the other RLVR papers, so it’s hard to tell whether this thing even works, and if so, when and how well.

Normally I’d file this type of work away as a sort of negative result – if a simpler and more general setting worked, they surely would have tried it in this paper, right? But that’s exactly what I don’t think we should do. My overall point in this piece is that if something makes sense from first principles, we should keep working on it until we work out all the kinks.

Making good ideas work often turns out to require significantly more labor than academic researchers expect from a single project. But this is is the price of progress.

What’s next?

It’s very exciting to me that (a) RL works and (b) no one knows the right way to do it. There is so much opportunity here. One way or another, we will have much better reasoning models in a year or two; it’s just that the path is unclear. Before we can see with clarity, we have a lot to learn. If reasoning next-token prediction turns out to really be the right way to scale RL, we’re going to need to answer all these questions and many more.

Thanks to my friends Wenting Zhao, Will Brown, and Nishanth Kumar for reading this blog post early and providing helpful feedback.

Superintelligence, from First Principles

Jack Morris — Wed, 18 Jun 2025 14:37:38 GMT

Lots of people have been talking about how we’ll reach AGI (artificial general intelligence) or ASI (artificial super intelligence) with current technology. Meta has recently announced that they’re building a top-secret “superintelligence” lab with billions of dollars of funding. OpenAI, Anthropic, and Google DeepMind have all said in one way or another that their goal is to build superintelligent machines.

Sam Altman in particular has stated that superintelligence is simply a problem of engineering:

This implies that researchers at OpenAI know how to build superintelligence, and simply have to put in time and effort to build the systems required to get there.

Now, I’m an AI researcher, and it’s not clear to me how to build superintelligence – I’m not even sure it’s possible. So in this post I want to think through the details a little bit and speculate how someone might try to build superintelligence from first principles.

We’re going to assume that the fundamental building blocks of the technology is decided: we’re going to build superintelligence with neural networks and train it with backpropagation and some form of machine learning.

I don’t really believe that architecture (the shape of the neural network) matters much. So we’ll gloss over that architectural detail and make a strong assumption: superintelligence will be built using Transformers, by far the most popular architecture for training these systems on large datasets.

So then we already know a lot: superintelligence will be a transformer neural network, it will be trained via some machine learning learning objective and gradient-based backpropagation. There are still two major open questions here. What learning algorithm do we use, and what data?

Let’s start with the data.

The data: it’s gotta be text

Most of the big breakthroughs that led to chatGPT inevitably came from leveraging the treasure trove of human knowledge inside The Internet to learn things. The true scale of this enterprise is mind-boggling and mostly hidden by modern engineering, but let’s spend a second trying to get to the bottom of it all.

The best systems we have right now all learn from Internet text As of writing this (June 2025) I don’t think there has been any demonstrated overall improvement from integrating non-text data into a model. This includes images, video, audio, and extrasensory data from robotics – we don’t know how to use any of these modalities to make chatGPT smarter.

Why is this, by the way? It’s possible this is just a science or engineering challenge, and we’re not doing things the right way; but it’s certainly also possible that there’s just something special about text. After all, every bit of text on the internet (before LLMs, anyway) is a reflection of a human’s thought process. In a sense, huma-written text is preprocessed to have very high information content.

Take this in contrast to images, for example, which are are raw views of the world around us, captured with no human intervention. It’s certainly possible that the fact text written by actual people carries some intrinsic value that pure sensory inputs from the world around us never will.

So until someone demonstrates otherwise, let’s operate under this assumption that only text data is important.

Ok, so how much text do we have?

A next question is how large this dataset might be.

Lots of people have written about what to do if we’re running out of text data. Dubbed the “data wall” or “token crisis”, people have written about what to do and how to scale our models if we really run out of data.

And it seems that might really be happening. Many engineers at the big AI labs have spend countless hours dutifully scraping every last useful bit of text from the darkest corners of the Web, going so far as to transcribe a million hours of YouTube videos and purchasing large troves of news stories to train on.

Luckily there might be another data source available here (verifiable environments!) but we’ll get to hat later.

The learning algorithm

Above we discovered another important principle: the best path we have toward superintelligence lies in text data. In other words, AGI is probably just LLMs or nothing. Some other promising areas are learning from video and robotics, but neither of those seem nearly far along enough to produce independent intelligent systems by at least 2030. They’re also far more data-hungry; learning from text is naturally extremely efficient.

Now we have to confront the most important question. What’s the learning algorithm for superintelligence?

In the field of machine learning, there are basically two tried-and-true ways to learn from large datasets. One is supervised learning, training a model to increase the likelihood of some example data. The other is reinforcement learning, which involves generating from a model and rewarding it for taking “good” actions (for some user-written definition of “good”).

Now that we know this taxonomy, it becomes clear that any potential superintelligent system as written has to be trained using either supervised or reinforcement learning (or both).

Yann LeCun has infamously stated that he knows the recipe to intelligence. In fact, intelligence is a cake, and reinforcement learning is but a tiny cherry on top.

Let’s think through both options.

Hypothesis 1: Superintelligence arises from supervised learning

Remember 2023? That was around when people really became excited about scaling laws; GPT-4 came out, and people were worried that if models continued to scale, they would become dangerous.

Around 2023, a lot of people online became concerned that LLMs scaled up with simple supervised learning would soon become superintelligent.

There was a prevailing opinion for some time that lots of supervised learning, specifically in the form of “next-token prediction”, could lead to superintelligent AI. Notably Ilya Sutskever gave a talk about how next-token prediction is simply learning to compress the Universe, since doing this well requires simulating all possible programs (or something like that).

I think the argument went something like this:

Accurate next-token prediction requires modeling what any human would write in any scenario
The better you can do this for a single person, the more closely you approximate the intelligence of that person
Since the internet contains text written by many people, to this well on e.g. a large text pretraining corpus requires accurately modeling the intelligence of many people
Accurately modeling the intelligence of many people is superintelligence

The vibes argument: can we even reach superintelligence by modeling humans?

Personally I think there are a few flaws in this logic, starting with the fact that we seem to have already created systems that are far above human-level in next-token prediction but still fail to exhibit human-level general intelligence. Somehow we’ve built systems that learned what we told them to learn (to predict the next token) but still can’t do what we want them to do (answer questions without making things up, follow instructions perfectly, etc.).

This might be simply a failure of machine learning. We’re training a model to predict the average human outcome in every situation. The learning objective disincentives giving too-low a probability to any possible outcome. This paradigm often leads to something called mode collapse, where a model gets very good at modeling average outcomes without learning the tails of distribution.

It’s possible that these issues would go away at scale. Billion-parameter models, like LLAMA, hallucinate. But this is only 10^9. What happens when we train models with 10^19 parameters? Perhaps that’s enough capacity for a single LLM to independently model all eight billion humans and provide independent data-driven predictions for each.

The infra argument: we won’t be able to scale up our models and/or our data

But it turns out it’s a moot point, because we may never scale up to 10^19 parameters. This hypothesis basically arose from the 2022-or-so-era deep learning school of thought driven by the wild success of language model scaling laws that continually scaling model and data size would lead to perfect intelligence.

It’s 2025 now. The theoretical argument remains unchallenged and the scaling laws have held faithful. But it turns out that scaling models gets really hard above a certain size (and in 2022 we were already really close to the line of what we could do well). Companies are already far, far beyond what we can do with a single machine – all the latest models are trained on giant networks of hundreds of machines.

The continued effort at scaling model size toward trillions of parameters is causing a hardware shortage (see: NVIDIA stock) and actually an electricity shortage too. Bigger models would draw so much power that they can’t be located in a single place; companies are looking into research to distribute model-training over multiple far-flung datacenters and even buying old nuclear power plants in an effort to resuscitate them and use them to train the next generation of bigger AI models. We’re living in wild times.

In addition to model size, we are potentially running out of data. No one knows how much of the Internet was used to train each model, but it’s safe to say that it’s quite a lot. And great engineering effort over the last few years at big AI labs has gone into scraping the proverbial barrel of Internet text data: OpenAI has apparently transcribed all of Youtube, for example, and high-signal sites like Reddit have been scraped time and time again.

Scaling model size orders of magnitude beyond 100B parameters looks to be hard, as is scaling data size beyond 20T tokens or so. These factors seem to indicate that it will be hard to scale supervised learning much more than 10x further in the next three years or so – so attempts at superintelligence might have to come from somewhere else.

Hypothesis 2: Superintelligence from a combination of supervised and reinforcement learning

So maybe you bought one of the arguments above: either we won’t be able to scale up pre-training by orders of magnitude for a long time, or even if we did, doing really well at predicting human tokens won’t produce systems that are any smarter than humans.

Whatever your issue with supervised learning, there’s another way. The field of reinforcement learning offers a suite of methods to learn from feedback instead of demonstrations.

⚠️ Why do we need supervised learning?
Well, RL is hard. You might be wondering why we can’t just use reinforcement learning all the way down. From a practical perspective, there are a lot of drawbacks to RL. The short explanation is that SL is a lot more stable and efficient than RL. An easy-to-understand reason is that because RL works by letting the model generate “actions” and rating them, a randomly initialized model is basically terrible and all of its actions are useless, and it has to accidentally do something good to get any semblance of a reward. This is called the cold start problem, and it’s just one of many issues with reinforcement learning. Supervised learning from human data turns out to be a great way to get around the cold start problem.

So let’s reiterate the RL paradigm: the model tries stuff, and we tell it whether it did well or not. This can come in two ways: either human raters tell the model it did well (this is more or less how typical RLHF works) or an automated system does that instead.

Hypothesis 2A: RL from human verifiers

Under this first paradigm, we provide the model with human-based rewards. We want our model to be superintelligent, so we want to reward it for producing text that looks closer to superintelligence (as judged by humans).

In practice, this data is really expensive to collect. A typical reinforcement learning from human feedback (RLHF) setting involves training a reward model to emulate the human feedback signal here. Reward models are necessary because they allow us to give much more feedback than we can feasibly collect from actual humans. In other words, they’re a computational crutch. We’re going to treat reward models as an engineering detail and ignore them for now.

So we’re imagining a world where we have infinite humans available to label data for an LLM and provide arbitrary rewards, where higher rewards indicate getting the model closer to superintelligence.

"This is a thousand monkeys working at a thousand typewriters. Soon, they'll have written the greatest novel known to man.” –Mr. Burns

Ignore all the logistic complexity. Assume this approach is possible to run at scale (as it could be some day, even if it’s not today). Would it work? Could a machine learning purely from human reward signals climb the intelligence ladder and pass humans?

Another way of phrasing this question is: can we verify superintelligence when seeing it, even if we can’t generate examples of it ourselves? Remember that humans by definition aren’t superintelligent. But can we recognize superintelligence when shown to us? And can we do so reliably enough to provide useful gradient signal to an LLM which could gather lots of feedback of this type and bootstrap its way to superintelligence?

Some people like to point out that “generation is naturally harder than verification”. you know a good movie when you watch one, but that doesn’t mean you’re able to go out there and produce one yourself. This dichotomy pops up a lot in machine learning. It’s computationally much easier to differentiate cat photos from dog photos than it is to generate entire cats.

Similarly, if humans can verify superintelligence, then it may be possible to train a superintelligent model based using reinforcement learning on human feedback. As a concrete example, you could have an LLM write many novels, reward it based on a human notion of the ones that are good, and repeat this many many times, until you have a superintelligent novel-writing machine.

Do you notice any problems with this logic?

Hypothesis 2B: RL from automated verifiers

More recently people have been excited about using similar approaches to train better language models.

And when we let the computer rate the intermediate performance of our RL algorithms, we can either do this with a model or an automated verifier. For automated verifiers, think of chess, or coding. We can write rules that check if the computer won a chess game and issue a reward if a checkmate is reached. In coding, we can run unit tests that reward the computer for writing code that correctly passes some specifications.

Using a verifier would be much more practical – it would allow us to remove humans from the loop here entirely (besides the fact that they were used to write the entire Internet). A recipe for superintelligence using verifiers would be something like:

Pretrain an LLM using supervised learning on a large amount of Internet text
Plug it into some verification system that can provide rewards for good LLM outputs
Run for a long time
Achieve superintelligence

Is this a good idea? Would it even work?

Famously, DeepMind’s AlphaGo achieved “Go supremacy” (read: beat all the humans, even the ones who trained for decades) by a combination of reinforcement and supervised learning. The second version of AlphaGo, known as AlphaGo Zero, learned by playing against itself for forty straight days.

In 2016, AlphaGo won four out of five games to defeat human Go champion Lee Sedol. The original AlphaGo was trained using supervised learning. The next version of AlphaGo learned with reinforcement learning: by playing against itself for millions of games.

Note that Go has a very important property that many real-world tasks do not. Go is naturally verifiable. By this I mean that we can plug my game of Go into a rule-based computer program and receive a signal indicating whether I won or not. Extrapolating this over a long time horizon, you can tell whether an individual move is “good” or not based on how it affects the probability of the game ending with a win. This is more or less how RL works.

And using this verifiability, AlphaGo was able to achieve something very important, something that the AI labs have aspired to for a while: AlphaGo gets better when it thinks for longer. Language models, by default, cannot do this.

But this was essentially the breakthrough that OpenAI announced last fall. They used this paradigm of reinforcement learning from verifiable rewards to train O1, a model that, like AlphaGo, can think for longer and produce better outputs:

In the o1 blog post, OpenAI introduced a line of “reasoning models” that learn from reinforcement learning on verifiable rewards.

Gazing at the beautiful graphs above (and noting the log x-axis!) we observe that o1 is indeed getting better with more thinking time. But take a look at the title: this is on AIME, a set of extremely difficult math problems with integer solutions. Read: it’s not an open-ended task. It’s a task with verifiable training data, because you can check if an LLM generated the proper answer or not and reward the model accordingly.

It turns out that with current LLMs, pretrained to do arbitrary tasks pretty well, they can make a decent guess at AIME problems, and we can use this to train them via RL to make better and better guesses with time. (And the coolest part, which we won’t talk about here, is that they generate more and more “thinking tokens” to do so, giving us the test-time compute graph from the o1 blog post shown above.)

Is RLVR a path to superintelligence?

It is clear that OpenAI, Google, and the other AI labs are very excited about this type of RL on LLMs and think it might just give them superintelligence. It also seems likely to me that this paradigm is what Sam A. was mentioning in his vague tweet above. The “engineering problem” of superintelligence is to build lots of RL environments for many different types of tasks and train an LLM to do them all at the same time.

Let’s think through the bull case here. The verifiable tasks we know of are coding (you can check if code is correct by running it) as well as math (not proofs, really, but stuff with numeric solutions). If we were able to gather all the verifiable things in the world and train on them all at the same time (or separately, and then do model merging) – would this really produce general superintelligence?

There are a few logical leaps here. The most important is that we don’t know how well RL on verifiable task transfers out-of-domain. Does training a model to do math problems somehow inherently teach it how to book a flight? Or would training a model to code better in verifiable environments even make it a better software engineer overall?

Let’s pretend for a second that this does turn out to be true, and RL transfers neatly to all sorts of tasks. This would be huge. It would put the AI companies in an arms race to produce the most diverse, useful, and well-engineered set of tasks to RL their LLMs on. There would likely be multiple companies producing a “superintelligent LLM” in this fashion.

But this outcome seems unlikely to me. I’d guess that if RL did transfer extremely well to other domains then we’d know about it by now. My humble prediction is that LLMs will continue to get better at things within their training distribution. As we gather more and more types of tasks and train on them, this will produce more and more useful LLMs on a wide array of tasks. But it won’t give us a single superintelligent model.

The Case for More Ambition

Jack Morris — Mon, 09 Jun 2025 14:47:09 GMT

What are the most important problems in your field? And why aren’t you working on them?

This is the pitch made by Richard Hamming in his famous talk You and Your Research. It’s been nearly forty years since the talk was first given and yet it feels like more people are ignoring this advice than ever before.

This may be especially true in AI.

Our field is full of brilliant people – researchers, engineers, scientists. Students are flocking to major in computer science and learn about the technology from an early age. More people are applying to graduate school than ever before. The AI labs are raising money and publishing all sorts of findings in preprints and blog posts.

And yet it doesn’t feel like the actual science is progressing faster than it was five years ago. We’re publishing more papers than ever, but discovering about the same amount. I don’t think it’s a talent issue, and it’s certainly not a funding issue. I think we have an ambition issue.

More papers are being published than ever

This year there were over 25,000 papers submitted to NeurIPS. This represents 30% annual growth since 2017, when around 3,000 papers were submitted. This growth comes from all sorts of places. More companies are doing AI research. More universities are doing AI research. More people are getting into AI research.

The number of submitted papers to NeurIPS, the most popular AI conference, over the last fifteen years. It’s nearing exponential growth territory. (Data from papercopilot.com)

With more people writing more papers about more different topics we’d expect some explosive growth. And yet I’d argue that the pace of progress feels pretty constant, if not slowing down by a small margin. This is a difficult thing to measure, so I polled a bunch of researchers:

If we take the mean response here we find that progress is moving faster – but only a bit. This should be surprising in an area that’s seen explosive growth along just about every dimension.

Expanding the space of everything

If we’re publishing exponentially more, why aren’t we learning exponentially more? One possible explanation is that the space of possible ideas we’re exploring has stayed a constant size, but the number of explorers has grown at a near-exponential pace. If this is an accurate model, then the world of research looks something like this:

If the number of researchers has increased by 10x, but amount of space is only slightly larger, then the current landscape of research looks something like the bubble on the right – but at a larger scale. And maybe packed tighter.

This mental model explains why many AI researchers report experiencing anxiety and frustration around the current research and publication process. There are so many people doing so many things that it’s simply hard to find any open space to occupy.

I think this anxiety also comes from a mindset that the amount of ‘research territory’ is constant, and good research will come from interpolating between existing ideas, rather than pushing forward into the unknown. This is exactly what Hamming warns against.

And this also means that the best research expands the space of territory available to us to explore. And yet clearly most research is doing something different – most papers aren’t groundbreaking. They’re comfortably nestled somewhere in high-dimensional space, in some unoccupied pigeonhole between things that already exist.

Let’s think of research ideas in physical terms. New research can be territory-expanding, in that it opens up more space for other researchers than it consumes, or territory-consuming, in that it `eats' some of the available territory without giving anything back.

The best research is territory-expanding. It's generous. It leaves things to the imagination, sparks more ideas than it explores. It’s explorative rather than exploitative. The most important research always has this property, where it expands space more than occupies it.

So why aren’t more people doing this kind of research?

Good research takes ambition

My theory for why most research isn’t territory-expanding is because the odds of producing something groundbreaking with a project are directly proportional to its chance of failure. This means that if you set out to change the world, you will probably fail. The converse here is that if you set your sights low, and choose a topic that’s more likely to bear fruit in some way, then your chances of `success' (publishing something) are higher, but the chances of it expanding the space of knowledge are smaller.

And thus, we end up with 25,000 NeurIPS submissions. This is what happens when the community optimizes too strongly for the number of papers submitted and too loosely for solving big problems.

I also want to recognize that for some groups, external factors are at play. I’ve been told that you need to publish at least one paper to get into a CS graduate program these days – meaning there are many people who just need to publish something, even if it isn’t their best work. And some tech companies prevent employees from publish findings that feel too important, giving their workers have a perverse incentive to make their research sound small to give their work a chance to see the light of day. Let’s ignore both of these edge cases.

Higher ambition means a higher rate of failure

One question every researcher should ask: what are you optimizing for?

If your goal as a researcher is to publish as much as possible and reach the maximum h-index then logic says to lower your ambition to the minimum possible level such that you can still publish. After a bit of practice, a good researcher with the singular goal of publishing can often write papers every few months. You’ll see this happen. Some conferences even have PhD students publishing multiple papers in the same conference indicating that they worked on multiple projects at the same time (I’m not pointing any fingers, as I have also done this once. Also, it can happen for different timeline-related reasons.)

If your goal is to move the field forward then you should plan differently. Raise your level of ambition. Try crazy things.

A consequence of this is that you will publish less. Deadlines will come around for which you have nothing to show. Perhaps you found out that your idea has already been done, or doesn’t work at scale, or is doing things that you just don’t understand. The most likely explanation is that most great things take more than three months to execute.

(As an aside, I have been doing research for six years or so and have already found that periods without publishing are really just fine. People care most about the one or two projects you’ve ever worked on that are most exciting. They care much less about the six month gaps where you were quietly trying to make something work. In my own career this has already happened several times: I spent my year in the Google AI Residency working on something that never quiet worked. I worked on a music model for six months that never went anywhere. In the third year of my PhD I wanted to build a contextual embedding model, which was a neat idea but took over a year to get the details right. The list does not stop there…)

Most research doesn’t matter much in the long run

But if you look back at the conference proceedings from ten years ago, they don’t feel exceptionally relevant:

A screenshot of the proceedings from NeurIPS 2015. Many of these papers did not stand the test of time.

These are interesting and important problems, but have almost nothing to do with the AI systems we use today. This makes one wonder how much of the content from this year’s 25,000 NeurIPS papers might still be relevant in 2035.

You might have your own hypotheses on why academic research tends toward small-but-eventually-forgotten problems. Predicting the future is hard. And most research moves in cycles that last less than a year, so it’s important to choose problems that are tractable within that timeframe.

Research that stands the test of time

For our purposes, we’re talking about research that affects the way we think about or design modern AI systems. This is an exceptionally small sliver of research papers. Under this definition, I’d guess less than 1% of published AI research ‘matters’.

Why is this number so small? One explanation is that modern AI grew out of machine learning and probabilistic modeling, which is a much deeper subfield with many sub-areas of study that have nothing to do with modern AI. Here’s one example: if you took statistics in school, you know that modern statistics is divided into two schools (Bayesian and frequentist) but modern AI is totally frequentist. Even though lots of people have researched Bayesian neural networks, they’ve basically been abandoned in favor of learning everything from data without a prior.

So what makes up this 1%? What research matters? To answer this question, let’s first gather a bit of data. NeurIPS (the biggest AI conference) gives “Test of Time” awards to ten-year-old papers that have made an outsized impact over the preceding decade. I asked Deep Research to gather me a list of the papers that have won the award:

Papers that have won the “Test of Time” award at NeurIPS, prepared by Deep Research.

Out of the ten papers that have won the Test-of-Time award, I notice the following themes:

These are simpler than the average NeurIPS paper
Each of them seems to leverage more data and more compute than was common at the time

And let’s remember that basically all deep learning research changed with the ImageNet paper in 2012, which only won its test-of-time award in 2022. So only the awards from 2022–2024 reflect research from the “modern” (post-AlexNet) era.

We can clearly see that in the long run, the AI research that matters is research that (a) is simple and (b) scales well. In fact, modern AI systems like Claude 4 and Veo 3 are so simple that they use almost none of the decades of published ML research. And yet this is likely what Hamming would have predicted – in the end, most science is forgotten.

This is probably a good thing. Downstream effects are that AI is easy for newcomers to learn, and relatively simple to implement. And it seems quite reasonable to bet that the next big breakthroughs in AI will be simple.

Remind yourself that simple questions are often the most ambitious. They often also take the longest to mentally grapple with, and are the hardest to engineer experiments for.

Putting Hamming’s advice into practice

I suspect that the problem here isn’t lack of ambitious ideas, it’s more related to a fear of failure. Most people that have done research for a long time confront big, messy, unanswered questions that continually resurface through day-to-day experimentation.

It seems the best strategy is to do whatever you can to can to confront those questions directly. There is no question too big to make into a research project. Einstein famously used the thought experiment of riding along on a beam of light to explore the ideas that ended up revolutionizing modern physics. What’s your question like this?

Perhaps you want to imagine your own personal data being stored inside an LLM, or what it would be like to slide down a gradient during training, or some new mechanism for visualizing and manipulating the internals of an LLM as it processes a sequence along during inference.

Hamming also notes that the best way to move forward is often by talking to capable people that you trust. He specifically mentions that solo daily brainstorming sessions haven’t proved fruitful in his career. The best way to make progress, he says, is to talk to someone and explain where you’re starting from all the way up to the point where you get stuck.

Maybe the specific question isn’t even that important, since so few people seem to be thinking like this. The bottom line is that we’d all be better off if most people worked on big ambitious projects for a longer time. We would see fewer papers, but each individual paper would have more depth. By publishing less, we would move faster.

Thanks to Ege Erdil for Jeffrey Emanuel for feedback on an early draft of this post.

There Are No New Ideas in AI… Only New Datasets

Jack Morris — Wed, 09 Apr 2025 21:32:03 GMT

Most people know that AI has made unbelievable progress over the last fifteen years– especially in the last five. It might feel like that progress is *inevitable* – although large paradigm-shift-level breakthroughs are uncommon, we march on anyway through a stream of slow & steady progress. In fact, some researchers have recently declared a “Moore’s Law for AI” where the computer’s ability to do certain things (in this case, certain types of coding tasks) increases exponentially with time:

the proposed “Moore’s Law for AI”. (by the way, anyone who thinks they can run an autonomous agent for an hour with no intervention as of April 2025 is fooling themselves)

Although I don’t really agree with this specific framing for a number of reasons, I can’t deny the trend of progress. Every year, our AIs get a little bit smarter, a little bit faster, and a little bit cheaper, with no end in sight.

Most people think that this continuous improvement comes from a steady supply of ideas from the research community across academia – mostly MIT, Stanford, CMU – and industry – mostly Meta, Google, and a handful of Chinese labs, with lots of research done at other places that we’ll never get to learn about.

And we certainly have made a lot of progress due to research, especially on the systems side of things. This is how we’ve made models cheaper in particular. Let me cherry-pick a few notable examples from the last couple years:

- in 2022 Stanford researchers gave us FlashAttention, a better way to utilize memory in language models that’s used literally everywhere;

- in 2023 Google researchers developed speculative decoding, which all model providers use to speed up inference (also developed at DeepMind, I believe concurrently?)

- in 2024 a ragtag group of internet fanatics developed Muon, which seems to be a better optimizer than SGD or Adam and may end up as the way we train language models in the future

- in 2025 DeepSeek released DeepSeek-R1, an open-source model that has equivalent reasoning power to similar closed-source models from AI labs (specifically Google and OpenAI)

So we’re definitely figuring stuff out. And the reality is actually cooler than that: we’re engaged in a decentralized globalized exercise of Science, where findings are shared openly on ArXiv and at conferences and on social media and every month we’re getting incrementally smarter.

If we’re doing so much important research, why do some argue that progress is slowing down? People are still complaining. The two most recent huge models, Grok 3 and GPT-4.5, only obtained a marginal improvement on capabilities of their predecessors. In one particularly salient example, when language models were evaluated on the latest math olympiad exam, they scored only 5%, indicating that recent announcements may have been overblown when reporting system ability.

And if we try to chronicle the *big* breakthroughs, the real paradigm shifts, they seem to be happening at a different rate. Let me go through a few that come to mind:

LLMs in four breakthroughs

1. Deep neural networks: Deep neural networks first took off after the AlexNet model won an image recognition competition in 2012

2. Transformers + LLMs: in 2017 Google proposed transformers in Attention Is All You Need, which led to BERT (Google, 2018) and the original GPT (OpenAI, 2018)

3. RLHF: first proposed for LLMs in the InstructGPT paper ~~from OpenAI in 2022~~*
* Correction: Paul Christiano wrote a paper on RLHF in 2017; the concept of aligning models to human preferences is much older. Thanks to everyone who pointed this out!

4. Reasoning: in 2024 OpenAI released O1, which led to DeepSeek R1

If you squint just a little, these four things (DNNs → Transformer LMs → RLHF → Reasoning) summarize everything that’s happened in AI. We had DNNs (mostly image recognition systems), then we had text classifiers, then we had chatbots, now we have reasoning models (whatever those are).

Say we want to make a fifth such breakthrough; it could help to study the four cases we have here. What new research ideas led to these groundbreaking events?

It’s not crazy to argue that all the underlying mechanisms of these breakthroughs existed in the 1990s, if not before. We’re applying relatively simple neural network architectures and doing either supervised learning (1 and 2) or reinforcement learning (3 and 4).

Supervised learning via cross-entropy, the main way we pre-train language models, emerged from Claude Shannon’s work in the 1940s.

Reinforcement learning, the main way we post-train language models via RLHF and reasoning training, is slightly newer. It can be traced to the introduction of policy-gradient methods in 1992 (and these ideas were certainly around for the first edition of the Sutton & Barto “Reinforcement Learning” textbook in 1998).

If our ideas aren’t new, then what is?

Ok, let’s agree for now that these “major breakthroughs” were arguably fresh applications of things that we’d known for a while. First of all – this tells us something about the *next* major breakthrough (that “secret fifth thing” I mentioned above). Our breakthrough is probably not going to come from a completely new idea, rather it’ll be the resurfacing of something we’ve known for a while.

But there’s a missing piece here: each of these four breakthroughs enabled us to learn from a new data source:

1. AlexNet and its follow-ups unlocked ImageNet, a large database of class-labeled images that drove fifteen years of progress in computer vision

2. Transformers unlocked training on “The Internet” and a race to download, categorize, and parse all the text on The Web (which it seems we’ve mostly done by now)

3. RLHF allowed us to learn from human labels indicating what “good text” is (mostly a vibes thing)

4. Reasoning seems to let us learn from “verifiers”, things like calculators and compilers that can evaluate the outputs of language models

Remind yourself that each of these milestones marks the first time the respective data source (ImageNet, The Web, Humans, Verifiers) was used at scale. Each milestone was followed by a frenzy of activity: researchers compete to (a) siphon up the remaining useful data from any and all available sources and (b) make better use of the data we have through new tricks to make our systems more efficient and less data-hungry. (I expect we’ll see this trend in reasoning models throughout 2025 and 2026 as researchers compete to find, categorize, and verify everything that might be verified.)

Progress in AI may have been inevitable once we gathered ImageNet, at the time the largest public collection of images from the Web

How much do new ideas matter?

There’s something to be said for the fact that our actual technical innovations may not make a huge difference in these cases. Examine the counterfactual. If we hadn’t invented AlexNet, maybe another architecture would have come along that could handle ImageNet. If we never discovered Transformers, perhaps we would’ve settled with LSTMs or SSMs or found something else entirely to learn from the mass of useful training data we have available on the Web.

This jibes with the theory some people have that nothing matters but data. Some researchers have observed that for all the training techniques, modeling tricks, and hyperparameter tweaks we make, the thing that makes the biggest difference by-and-large is changing the data.

As one salient example, some researchers worked on developing a new BERT-like model using an architecture other than transformers. They spent a year or so tweaking the architecture in hundreds of different ways, and managed to produce a different type of model (this is a state-space model or “SSM”) that performed about equivalently to the original transformer when trained on the same data.

This discovered equivalence is really profound because it hints that *there is an upper bound to what we might learn from a given dataset*. All the training tricks and model upgrades in the world won’t get around the cold hard fact that there is only so much you can learn from a given dataset.

And maybe this apathy to new ideas is what we were supposed to take away from The Bitter Lesson. If data is the only thing that matters, why are 95% of people working on new methods?

Where will our next paradigm shift come from? *(YouTube…maybe?)

The obvious takeaway is that our next paradigm shift isn’t going to come from an improvement to RL or a fancy new type of neural net. It’s going to come when we unlock a source of data that we haven’t accessed before, or haven’t properly harnessed yet.

One obvious source of information that a lot of people are working towards harnessing is video. According to a random site on the Web, about 500 hours of video footage are uploaded to YouTube *per minute*. This is a ridiculous amount of data, much more than is available as text on the entire internet. It’s potentially a much richer source of information too as videos contain not just words but the inflection behind them as well as rich information about physics and culture that just can’t be gleaned from text.

It’s safe to say that as soon as our models get efficient enough, or our computers grow beefy enough, Google is going to start training models on YouTube. They own the thing, after all; it would be silly not to use the data to their advantage.

A final contender for the next “big paradigm” in AI is a data-gathering systems that some way embodied– or, in the words of a regular person, robots. We’re currently not able to gather and process information from cameras and sensors in a way that’s amenable to training large models on GPUs. If we could build smarter sensors or scale our computers up until they can handle the massive influx of data from a robot with ease, we might be able to use this data in a beneficial way.

It’s hard to say whether YouTube or robots or something else will be the Next Big Thing for AI. We seem pretty deeply entrenched in the camp of language models right now, but we also seem to be running out of language data pretty quickly. But if we want to make progress in AI, maybe we should stop looking for new ideas, and start looking for new data.

CoPilot for Everything

Jack Morris — Fri, 28 Feb 2025 13:45:33 GMT

Between 2020 and 2021 I worked full-time at Google. Although my original plan was to work from Google headquarters in Mountain View, California, due to an international pandemic, my year-long job ended up being fully remote from start to finish. I received my laptop in the mail before my first day and sent it back in a box when I left the company.

During this stint I never set foot on the Google campus. Every contribution I made was digital: a series of keypresses, touchpad movements, and mouse clicks made from my official Google laptop. I also provided audio and video inputs via my laptop’s camera and microphone during video meetings. But no one ever saw me in person.

It only recently dawned on me that my actions may have been recorded. Of course, certain things I already knew *were* recorded: all the lines of code I submitted to the Google monorepository, for example, are probably still there. I’d imagine my emails are stored somewhere too, as well as the notes I wrote via Google docs. But what about the rest of my actions on the company computer?

It’s entirely possible that the entire daily work process was documented. In theory, my employer had a right to collect every input I provided on my company computer. Every mouse click, every keypress. And they could have sent this all back to store in some data warehouse. This is a much richer data type than simply the lines of code I outputted: these are *behavioral traces* that define how I solve problems from start to finish.

As I’ll explain later in the post, the thing that scares me about the existence of this data is that it seems well within the capabilities of current technology to train a model that can replicate *me*, in some sense. I’m calling it as a *Copilot for Everything*: an assistant that can auto-complete entire tasks for me, based on the actions I’ve taken at work in the past. And this feels like such an economically useful tool that it would be crazy for it *not* to happen in the next few years.

I don’t intend to pick on Google specifically, and I don’t know whether they do this or even what their policies are; I’m just trying to use my own experience in the corporate world to speculate on what I imagine will be a much bigger issue in the future. With today’s technology, it is totally plausible that a corporation might train a large AI model to mimic the digital work of an employee, (even after that employee has left the company). Is it ethical for my former employer to use my prior work output to create a digital version of me? Is it ethical for someone’s *current* employer to do this?

Automating remote work with supervised learning

Behavioral traces reduce the concept of automating work output to a well-defined learning problem for AI. The model inputs would be all the inputs the computer provides me: pixels on the screen, maybe audio. Outputs are the “actions” that I took on my computer: keystrokes, mouse movements, clicks.

At a large enough scale, it seems entirely feasible to train a large model that can predict actions from computer inputs. This is doable with the same current technology that powers large language models: supervised learning, which would allow learning to predict actions from computer inputs, and transformers, a type of neural network that excels at learning from large amounts of data.

From the company’s perspective, the process would be simple. Employer records you doing your work for some amount of time; employer trains model to replicate your output for your most monotonous and boring tasks. Or perhaps the company takes all the data they have and trains a model on the aggregate of *all* employees outputs. That would be closer to what we’ve seen work well in other domains, like vision and language, where training on more data is usually the right answer.

Can imperfect models of us improve our productivity?

So that tells us how a company could train a model to mimic employees’ actions. But how do we use these models? If we think about the resulting artifact, it wouldn’t be a drop-in replacement for an employee: instead it’d be a probabilistic model that tells us the likelihood of a given action given some input. First of all, these models have issues. Neural networks are easily tricked by adversarial inputs and generally don’t perform well if the inputs are too out-of-distribution. Neither of these are problems with humans.

Second, how do we do sampling? The obvious answer is to be “greedy” and take the most likely action at every step. This would work well in very low-entropy situations. For example, performing a task that’s identical to one that you’ve performed many times would be easy, and a good model could do it perfectly with no supervision. And in the case where the model has also learned from *other* employees’ actions, if it were to perform a rote task that someone else had done many times, that would be easy too.

But what about higher-entropy situations? In the case of higher uncertainty, the model would output a less useful distribution over potential actions. In this case, greedy action-sampling could lead to “exposure bias” where the model does something weird, ends up in a situation it’s never seen before, and inevitably goes off the rails.

Even good models will sometimes end up in situations of significant uncertainty. What I’d imagine is that we’ll end up with a solution similar to what has worked in self-driving cars: the computer can ‘drive’ by itself until it encounters something completely unexpected, at which point a human will intervene. Once the novel scenario is resolved and the task again resembles something that the computer knows how to do, the model can resume operation.

It’s hard to imagine what this might look from a user interface perspective, but I’ll suggest two basic options: “fast-forward” style and “foreman” style.

In a fast-forward setup, the assistant would perform certain tasks for you very quickly while you wait. In theory the model could still operate the computer at 100x speed; if input latency isn’t a limiting factor, the computer could complete tasks in the blink of an eye. Maybe a button on your screen will light up when the computer detects you’re about to perform a task that’s “understood” by the assistant; clicking the button would fast-forward through that task being accomplished at 100x speed.

In the foreman mode, one human would oversee *many* AIs working in parallel. This is only possible if the models can do most of their work without any interevention. In the foreman setup, one person would be responsible for the success of several AI assistants, and jump in to help them when the task reaches a certain level of uncertainty. This is similar to Zoox operating stations where human operators remotely pilot cars out of unexpected situations from hundreds of miles away.

What does this mean for us?

Even if none of the exact futures I’ve imagined here play out, note the underlying theme: **progress increases the average entropy of digital work**. This principle has has nothing to do with neural networks. As we develop better tools, we’re able to do repetitive tasks more quickly. Better abstractions reduce the amount of input it takes for us to generate the same amount of output. One way to do this, the one that’s discussed here, is directly modeling employee behavior with big neural networks.

The reality is that every remote worker’s job tends to be repetitive sometimes. And your value comes from how you handle the *least* repetitive things you do: reacting to novel situations, adapting to change. New tools will reduce the amount of repetition in your work. Our day-to-day jobs will become *less* predictable as our most mundane and monotonous tasks are modeled away.

Given all this — how can you make yourself indispensable? The answer is by fighting entropy. If the most efficient company is the one where its employees are doing the least repetitive thing at all times, then the most productive employee is the one that’s the least modelable. Focus on the things that are hard to gather training data for.

Please Stop Talking About AGI

Jack Morris — Fri, 21 Feb 2025 18:03:35 GMT

It’s become very popular over the last few years to speculate how close society might be to Artificial General Intelligence (AGI). What AGI actually means is murky, and often-debated, but mentioning AGI is usually a good jumping-off point for discussions of future artificial intelligences’ capabilities. Many following the field maintain AGI timelines, rigorous guesses for the probability of this mythical intelligence to emerge at future points. Those in the know might ask you for your timelines over coffee, classifying them as “long” – that it might be a decade or two before AIs are smart enough to take all of our jobs – or “short” – that it could happen any day now.

This isn’t the most useful way of thinking about the progression of AI capabilities. The existence of a timeline implies AGI has a rigorous definition and can be measured. It also implies that AGI is inevitable, the only question being when it will arrive.

What I see is not a march towards complete general intelligence, but rather a trend of increasing AI productivity per unit of human input. This trend holds across many disparate applications. Our AIs can label more data, write more code, do more math, as well as drive cars and pilot planes for longer with less intervention from us. It may be possible that we’ll never reach a point where AIs can run forever, uninterrupted, without human guidance. Rather we’re pushing the boundary of how much we can get for what we give.

Instead of talking about the mythical final frontier of AGI, I think we should start thinking more realistically and measuring the ratio of human input per useful AI output.

What will the future trend of human input per AI output look like?

Imagine for a moment the curve of how much we input have to provide for a unit of economic value the computer produces, and how this has changed over time. A very rough estimate is pictured above; one important open question is whether we’re approaching some unknowable carrying capacity, or if this figure will eventually decay to zero. (If this happens, it means that computers will be able to produce economic value with zero human input. This would be a frightening outcome.)

To understand what I mean better, let’s take a trip back in time to 2017…

We’ve seen this before (in self-driving cars)

If you’re new to the AI field, you should know that before language models, there was a previous AI craze circa 2017: the rise (and fall?) of the self-driving car.

If you’re not new to AI, let me remind you.

Around that time, several companies declared that within a year they would have Fully Self-Driving cars. Billions of dollars were raised. Millions of miles were driven. Many companies were founded, some of which eventually went bankrupt.

And years later, we’re still not quite at FSD. Teslas certainly can’t drive themselves; Waymos mostly can, within a pre-mapped area, but still have issues and intermittently require human intervention.

In 2016, Tesla CEO Elon Musk promised that a Tesla would drive itself fully autonomously from Los Angeles to New York City by the end of the year. That still hasn’t happened. (Teslas are still sold with an optional “Full Self-Driving” subscription)

In response, the field has moved on from speculating the exact point cars will be fully-self driving. People instead discuss miles-per-disengagement (or miles-per-human-intervention). How far can the car drive without a human getting involved? This new lens gives us something that we can measure and track over time. Better technology gives us more miles driven per necessary human action.

What does the future look like for FSD? A recent report said Teslas can drive thirteen miles per human intervention; this estimate feels a little low to me, but still seems pretty good. We can certainly drive this number up with bigger models, faster inference, more data, and improved overall engineering.

A crucial question is whether with current technology, the miles-per-intervention number is bounded by some theoretical limit we don’t understand. We don’t know whether our models will keep getting better forever (approaching infinite miles driven with no interventions) or if there really is some amount of human intervention that will always be necessary.

Why Yann Lecun was wrong (kind of)

Now let’s apply this idea to today’s AI craze: language models.

A few years ago, Meta’s Chief AI Scientist Yann Lecun gave a talk about how language models won’t give us a direct path to human-level intelligence. He argued that because language models generate outputs token-by-token, and each token introduces a new probability of error, if we generate outputs that are too long, this per-token error will compound to inevitable failure.

Yann presenting his “unpopular opinion”: an argument for why researchers shouldn’t work on language models.

Yann has used this simple argument to explain to the masses why we shouldn’t work on language models if we care about achieving human-level AI. He presents this problem of compounding errors as a critical flaw in language models themselves, something that can’t be overcome without switching away from the current autoregressive paradigm.

But this has turned out to be wrong. A few new AI systems (notably OpenAI o1/o3 line and Deepseek R1) contradict this theory. They are autoregressive language models, but actually get better by generating longer outputs:

A graph from the DeepSeek R1 report showing how their system generates longer outputs as it gets smarter. This directly contradicts the contention that language models will eventually fail if we let them “think” for too long.

The finding that language models can get better by generating longer outputs directly contradicts Yann’s hypothesis. I think the flaw in his logic comes from the idea that errors must compound per-token. Somehow, even if the model makes a mistake, it is able to correct itself and decrease the sequence-level error rate. This is an incredible development, and was not the case with prior generations of LLMs.

And it turns out that the models’ mechanisms for correcting themselves are interesting and interpretable:

An example from the DeepSeek R1 report of a language model increasing its probability of success mid-sequence, which Yann Lecun has argued for several years is impossible.

As shown in the image, models can in fact increase their likelihood of success mid-sequence by generating specific strings of tokens. A cottage industry of research is emerging trying to characterize and induce these behaviors, such as “backtracking” to a better solution. (It’s worth noting here that we still don’t know how generalizable these techniques are outside of the types of problems these models were trained on, like coding and math problems.)

Why Yann Lecun was right (kind of)

Naturally, people have been upset about all this. One of the founding members of the field has been giving bad advice to early-stage researchers based on a busted intuition. It’s infuriating, right?

Well, not exactly. I think that people are taking Yann’s argument a little too literally. Yes, we’ve figured out a way to build language models that don’t strictly get worse as we use them to generate longer outputs. But the limiting behavior remains the same: eventually, if we continue generating from a language model, the probability that we get the answer we want still goes to zero.

The practical takeaway from this is that AIs can’t work on their own forever. Lots of people are working on building Agents, systems that use language models to accomplish tasks over long time horizons. But the quest for a fully autonomous agent feels similar to the quest for fully self-driving cars: it might never be possible to build this, at least with the current stack.

There may be a kind of data processing inequality going on behind the scenes. In some sense, the highest-quality information inputted to language models comes from the human-written prompts (and potentially inputs read in via tool use, like checking flight times or the weather). When the language models are left on their own to generate infinitely-long chains of thought, that input “signal” attenuates to nothing; eventually, without further input from a human, those chains of thought lose all meaningful value. Improving our technology can delay this, and improve the quality and amount of work we can do with a single input prompt. But it doesn’t seem likely that I’ll wake up one day next year and this figure (work / prompt) will have spiked to infinity.

This is why measuring language models’ progress in terms of AGI timelines is misguided. We should be thinking about language models the same way we think about cars: How long can a language model operate without needing human intervention to correct errors? Framing our inquiries into language models like this allows us to reconcile Yann’s valid concerns about the models with new advances from OpenAI and DeepSeek—and will also lead to more productive research and conversation in the language model field, as it has with cars.

Instead of waiting for FAA (fully-autonomous agents) we should understand that this is a continuum, and we’re consistently increasing the amount of useful work AIs can do without human intervention. Even if we never push this number to infinity, each increase represents a meaningful improvement in the amount of economic value that language models provide. It might not be AGI, but I’m happy with that.