How to scale RL to 10^26 FLOPs

Jul 10

A roadmap for RL-ing LLMs on the entire Internet

11 Comments

It's funny, I had the exact same reaction to that cherry cake paper. What I thought it was about before I read it closely is almost exactly the same as your post.

Expand full comment

Thomas DeWitt

Jul 13

Really interesting! This makes a lot of sense to me. It feels closer to how humans learn than pretraining (even if its still a bit different). When I’m learning a difficult thing in a textbook, I spend a lot of time thinking in between reading sentences, and I feel like the models need that too. The current RL paradigm feels more similar to skimming and then repeatedly trying the problems at the end of the chapter.

Expand full comment

theahura

Jul 18

As a general heuristic, I think that real gradients with backprop >>> approximating gradients with RL. I've yet to find a case where that isn't true -- why would that not be true here?

One possible reason is that the 'reasoning trace' is the real magic -- by giving the model a scratch pad to evolve some kind of reasoning, it will do something better. But why can we not do the same thing in backprop-based approaches?

Expand full comment

Reply (1)

Jack Morris

Jul 18

I think this is an excellent question and I agree on all counts. RL should never be one’s first choice when approaching a machine learning problem.

The central issue is that we do not know what the “true” reasoning chain should look like for any given prediction. How many tokens should it be? And which tokens are those?

If we knew the true reasoning, we could certainly just use supervised learning. But we don’t, so we resign to RL.

Expand full comment

Reply (1)

theahura

Jul 18Edited

Sure, but you have those problems in the RL regime too -- we just take it for granted in the RL regime that we don't have real gradients for *anything*.

As a potential counter example to try to illustrate the point, you could have a pre-training system that spits out 100 "reasoning" tokens in between every "real" token, with gradients generated only on the "real" token using normal backprop. Why do we expect that the 100 'reasoning' tokens in this regime wouldnt eventually evolve into something like what youd expect from the RL regime? I'd argue theyre exactly the same except that in RL you have slightly worse gradients every 100-odd token outputs (because you're approximating them)

Expand full comment

mezen

Jul 16

I love your work. Please keep posting. I wait for your posts every week :)

Expand full comment

SorenJ

Jul 15

Something like:

1) The model "reasons" by dumping a bunch of info into its scratchpad -- early on this would be meaningless noise, but eventually it would learn to use it would likely use the scratchpad in a way that starts to use at least some language. If you are reasoning from the very ground up though, it may develop its own internal language and/or mix languages while it reasons.

2) The model then outputs a probability distribution of possible next tokens. From what I understand though, this probability distribution is "under the hood" in a sense?

3) Teach the model also to be "conscious" of this probability distribution -- that is have it be able to generate output tokens along the lines of "my probability distribution looks like P(X) = ..."

4) Let the model choose how long it wants to reason for

Then, have a reward function which:

1) Rewards the model for having the actual next token in its probability distribution. The higher probability it assigned to the actual next token the higher the reward

2) Reward the model for having a well calibrated sense of the actual probability distribution. This teaches the model to "know what it knows" and helps with self awareness, hallucinations, etc.

3) Put a slight penalty for thinking longer so the model learns to be economical. The right function to use here would require experimentation

Expand full comment

Doug

Jul 13

Is the “Grok 2 was trained on…” paragraph out of place, pre-training style? :)

Expand full comment

Reply (1)

Jack Morris

Jul 14

Thanks Doug. A graphic had been dropped in the final version. I fixed it last night.

Expand full comment

shaw

Jul 13

Check this paper out: https://arxiv.org/html/2408.15240v1

Expand full comment

Yong Zheng-Xin (Yong)

Jul 11

Your mentioning about how much time to spend within each reasoning unit is the same as how Shuchao's talk on how we still need to figure out allocation of FLOPs per intelligence.

And well said on "if something makes sense from first principles, we should keep working on it until we work out all the kinks." (cue Hyung Won's and Jason's graph where x-axis is "efforts" instead of "compute").

Expand full comment

Token for Token

How to scale RL to 10^26 FLOPs