28 Comments
User's avatar
akash's avatar

First, a minor correction:

> "3. RLHF: first proposed (to my knowledge) in the InstructGPT paper from OpenAI in 2022"

Deep reinforcement learning from human preferences by Christiano et al. (2017) is the foundational paper on RLHF. Link: https://arxiv.org/abs/1706.03741

Interesting perspective, and I do like the bigger question you are asking: what ended up mattering the most for the success of LLMs? Some quick thoughts and questions:

- I do think building GPT-3-like system was certainly feasible in the 90s *if* we had the computing capacity back then (Gwern has a nice historical exposition on Moravec's predictions which I recommend: https://gwern.net/scaling-hypothesis)

- I am not unsure convinced that just unlocking YT data would be the next big thing (for AGI, and I know you don't like AGI talk ... sorry). There is some evidence that suggests that the models are still not generalizing, but instead, defaulting to bag-of-heuristics and other poor learning strategies (https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1). Assuming this is true, I would expect that a YT-data-trained-LLM will appear much smarter, crush the benchmarks, have a better understanding of the world, but may not be transformative. Massively uncertain about this point, though.

- "perhaps we would’ve settled with LSTMs or SSMs" — are there any examples of LSTM-driven language models that are comparable to Transformer-based LLMs?

- Relatedly, I think the importance of adaptive optimizers is being under-emphasized here. Without Adam, wouldn't LLM training be >2x more expensive and time-taking?

Leo Benaharon's avatar

Maybe YouTube and robotics shouldn't be separated. If we can learn complex motions by just watching videos, why can't AI.

Erik Steiger's avatar

A baby learns by acting upon the world. Sure at some point you have a good enough world model such that you can use novel information from a video to expand that. But building up a good enough world model efficiently probably requires some kind of feedback.

DotProduct's avatar

The difference between language and other data (eg video) is that language has massive compression. It’s what we evolved to deal with social and other world interactions with our limited compute. Perhaps LLMs can do what they do because they are using our distilled version of the world. Hence, rather than video I might turn my attention to novel languages eg between plants and animals. Including extra human sensory: sights and sounds outside of normal human range. Whalesong anyone?

Julie By Default's avatar

I loved this. Yes. Exactly. People talk about AI like it’s inventing things — but this cuts right through that. Most of what we call “generation” is really just recombination, powered by increasingly structured inputs — from us. The breakthroughs weren’t big ideas; they were new ways to learn from new kinds of data.

That’s what makes this piece so sharp: it’s not dismissive of research, just honest about where progress actually comes from. Not magic. Not models. Infrastructure. Access. The moment a new dataset becomes legible at scale, everything shifts — and we call it innovation.

And it’s not just AI. In product too, the surface gets all the credit, but the real leverage sits underneath — in what’s visible, counted, or quietly baked into the defaults.

attiq rahman's avatar

We need to explore new methods for the new data that we have. With old methods, we cannot explore new data sources like sensors and YouTube.

Melon Usk - e/uto's avatar

Interesting observation, Jack! The next big thing is a direct democratic simulated multiverse, you make a digital backup of Earth to have a vanilla virtual Earth, spin up a digital version with magic, with public teleportation and BOOM! You have a simple simulated multiverse with 2 Earths!

By the way, we recently wrote how to align AIs, spent 3 years modeling ultimate futures and found the safest way to the best one

Steven Marlow's avatar

Adding language ability should be the last step in the process, but the AI industry is focused on product development, not core research. The real solution has to come from outside of industry.

Melon Usk - e/uto's avatar

You’re spot on! We’ve done exactly that for AI alignment - combined everything to solve it

Daniel's avatar

This is such a refreshing take! You're absolutely right—we keep chasing shiny new architectures when the real breakthroughs have always been about unlocking new data sources. The YouTube angle is fascinating. Google sitting on that treasure trove while we debate which optimizer to use...

suman suhag's avatar

Provocative question! Having spent at least 25 years studying RL, ever since my first real job at IBM Research, where I explored the use of methods like Q-learning from 1990–93 to train robots new tasks, I’ve watched the field through its various phases. In the early 1990s, when I got involved, it was restricted to a small handful of aficionados. I organized the first National Science Foundation workshop on RL, to which about 50–60 senior researchers were invited (in 1995).

Gradually, through the early part of the 2000s, the field gained popularity, but never seemed to become a mainstream research topic within ML. Then, wham! Deep Mind did its thingie with the combination of deep learning and RL, applied to a visually appealing domain of Atari video games, and (deep) RL’s popularity went through the roof. Now, it seems all the rage, and certainly, many employers are hiring (in the Bay Area, it’s an area sought after by some of the labs doing autonomous driving). Google paid half a billion Euros for Deep Mind (supposedly!), on the basis of their deep RL Atari demo. So, this looked like a real turning point, and RL came to life!

So, getting back to the question, is RL a “dead end”? In answering this provocative question, one has to clarify one’s point of view. Certainly, from the standpoint of the work going on in Deep Mind and other places on using deep RL to play games like Go or Chess, or train given an accurate simulator of the world for a self-driving car, RL is certainly poised to become well established technology, and its popularity is only going to increase. RL sessions at major AI and ML conferences are very well attended, and RL submissions are definitely increasing. In all these dimensions, RL is very much not at a “dead end”, in fact, its popularity is only increasing.

But, but, …. you knew there was a but coming there!

When you impose on RL the goal of “online learning in real time from the real world”, and not doing millions of simulation steps where agents can be killed thousands of times with no penalty, I fear RL is very much at a dead end. It is not clear to me that any extension of the au courant deep RL methods is going to lead to successes in the real world, in terms of a physical agent that can learn in real time with a small number of examples.

That is, if your goal is to build a model of how humans learn complex skills, such as driving, then RL to me is a very poor explanation of how such skills are acquired. One has to only look at the comparative results reported in the AAAI 2017 paper by Tsividis et al., comparing random humans on Amazon Turk with the best deep RL programs at Atari video games to see where deep RL simply flounders. Humans learn Atari video games, like Frostbite, about 1000x faster than the fastest deep RL methods.

A typical human learned Frostbite in 1 minute with a few hundred examples at most. DQN or other deep RL programs take days with millions of examples. It’s not even close, it’s like another galaxy in terms of the speed of learning differences. So, looking at this paper, I’d have to say I don’t see any way to capture such large differences with any incremental tweaking of deep RL methods, such as being reported annually in ICML or NIPS papers (of which I review a bunch each year, hoping against hope to see a new idea emerge, only to be disappointed!).

So, what’s to be done to “rescue RL”. I’m not sure there’s really a solution out there. I for one have stopped believing that we learn complex skills like driving by something that resembles “pure RL” (that is, from rewards alone). Humans learn to drive because they in fact “know” how to drive even before they even try to drive once. They’ve seen their parents, friends, lovers, Uber drivers, etc. drive many many times, and they’ve seen driving behavior in movies for thousands of hours. So, when they finally get behind the wheel, they instinctively “know” what driving means, but of course, they have never actually controlled a physical car before. So, there is that all important “last mile” of actual driving that needs to be learned.

But, since the driving program is largely already in place, built in by many thousands of hours of observation, not to mention active instruction by a driving teacher or an anxious parent, what needs to be “learned” are a few control parameters that tell the human brain how much to turn the wheel, or press the brake, and more importantly, where to look on the road etc. This is course not trivial, which is why humans take a few weeks to get comfortable behind a wheel, But, if you look at real hours of practice, humans learn to drive in a few hundred hours — for those paying for driving instruction, this is expensive since you are charged by the hour.

Also, all important to remember is that when you impose the condition of learning in the real world, there can be “no cheating”! That is, unlike the ridiculous 2D world of Atari video games, like Enduro, where one is given a 2D highly simplified visual world, and actions are limited to a few discrete choices, humans must drive in the full 3D real world and have the huge task of controlling both legs, both hands, neck, body, etc, many hundreds of continuous degrees of freedom, as well as have to cope with an immense sensory space of stereo vision, and binaural hearing as well.

The only way humans ever learn to drive in a few hundred hours is the simple fact that we already almost know driving, and we have obviously a fully working vision system, so we can read signs, recognize cars and pedestrians, and our hearing system also recognizes sirens, alerts, horns etc. So, if you look at the immensity of the whole driving task, I would claim more than 95% of the driving knowledge is already known, and the small remaining part has to be acquired from practice. This is the only explanation for how humans learn such a complex skill as driving in a few hundred hours. There is NO magic here.

So, in that sense, pure (deep) RL seems like a dead end. The pure (deep) RL problem formulation really does not hold much interest for me any more. What is needed in its place is a more complex model of how learning happens by combining observation, transfer learning, and many other types of behavior cloning from observed demonstration to the learner, and finally being able to take this knowledge, and then improve it with some actual trial and error RL.

One can generalize this to other modes of learning as well. The late Richard Feynman, who was arguably the most influential physicist after the 2nd world war, taught a classic introductory course at Caltech, which led to probably the best selling college textbook of all time, the Feynman Lectures on Physics (still being sold almost 60 years later, in the nth edition). When he looked at how students handled his problem sets, Feynman was ultimately disappointed. He realized that even the extremely bright students at Caltech could not “learn” physics, simply sitting in his class and absorbing his lectures. So, he ended his preface to the textbook with a disappointing conclusion, quoting Gibbons (which I had long ago memorized):

“The power of instruction is seldom of much efficacy, except in those happy dispositions where it is almost superfluous”.

I realized the wisdom of this saying after spending two decades or more teaching machine learning to graduate students at several institutions. It seems almost paradoxical, but what Gibbons is saying, and what Feynman and I both discovered is that learning from teaching only works when the learner “almost already knows” the subject.

But, this is precisely what the various theoretical formulations of ML predict must be the case, there is no “free lunch” in terms of being able to learn. Deep Mind’s DQN network takes millions and millions of steps to learn an apparently trivial task (to humans) like Frostbite, because initially DQN knows nothing. Humans, in contrast, learn Frostbite in < 1 minute because they have spent many many hours building the background needed to learn Frostbite so quickly (e.g, vision, hand eye coordination, general game playing strategies).

Unfortunately, the prevailing currents in the field, at venues like “NeurIPS” (NIPS) and ICML and AAAI conferences, tend to “glorify” knowledge-free learning, so you end up with hundreds, if not thousands, of (deep) RL papers, where agents take millions of time steps to learn apparently simple tasks. To me, this approach is ultimately a “dead end”, if your goal is to develop a computational model of how humans learn.

suman suhag's avatar

It can be definitely OK, but it depends on what you're trying to do, and what "reality" is (i.e. what's the most correct answer). Adding variables that aren't needed won't help your model (particularly your estimates), but also might not matter much (e.g. predictions). However, removing variables that are real, even if they don't meet significance, can really mess up your model.

Here's a few rules of thumb:

Include the variable if it is of interest before hand, or you want a direct estimate of its effect. If your business collaborators say to put it in, put it in. If they're looking for estimates of the holiday effects, put it in (although there might be some debate as to whether you should look at each holiday individually).

Include the variable if you have some prior knowledge that it should be relevant. This can be misleading, because it's a confirmation bias, but I'd say in most cases it makes sense to do so. Particularly for holiday effects (I assume this is something like sales or energy consumption), these are well-known and documented, and those small but not-statistically-significant are real.

In general practice (i.e. most real world situations), it's better to have a slightly overspecified model than an underspecified one. This is particularly true for the purposes of prediction, because the response remains unbiased (i.e. determining the response of Y). This rule is very conditional, but the other bullets that favor overspecification tend to be more common in practice, especially in the business/applied world. Note that by saying that, I bring it back to the second bullet point, emphasizing business experience.

If you want a model that can generalize to many cases, you should favor fewer variables. Overfitting works, but it tends to make your model only work for a narrow inference space (i.e. the one reflected by your sample).

If you need precise (low variance) estimates, use fewer variables.

Just to re-emphasize; these are rules of thumb. There are plenty of exceptions. Judging by the limited information you've provided, you probably should include the non-significant "holiday" variable.

I've seen many saturated models (every term included) that perform extremely well. This isn't always true, but this works because, in a lot of business problems, reality is a complex response (so you should expect a lot of variables to be present), in addition to the lack of statistical bias from adding all these variables. Less relevant to this question, but relevant to this answer is that "Big data" also captures the power of the law of large numbers and the central limit theorem.

Variable selection is a long and complicated topic. Look up descriptions of the drawbacks of underspecification vs. overspecification, while remembering that the "right" model is the best - but unachievable. Determine if your interest is in the mean or the variance. There's a lot of focus on variances, especially in teaching and academia...but in practice and in most business settings, most people are more interested in the mean! This goes back to why overspecification in most real world cases should probably be favored.

suman suhag's avatar

As the name suggests, GLM models are the generalization of the linear regression model. When we speak about this generalization. we mean that rather than forcing a linear relationship between the dependent and independent variables, it allows the dependent variable to be related with the independent variables through a link function.

E(Y)=μ=g−1(Xβ)E⁡(Y)=μ=g−1(Xβ)

Here, gg is the link function through which we are relating the variables. Depending on the problem we can choose the link function to be logit, probit, linear etc. It gives us a lot of freedom in choosing the specification of the model for a given problem.

noreson's avatar

100% agree.

We have the world largest content creator (youtube) database with 180million+ channels and billion videos. If you are interested, visit www.socialnetwork0.com or send an email to info@socialnetwork0.com

SQream's avatar

Your post makes a strong point: real progress in AI often comes from new data, not new algorithms. ImageNet, web-scale text for Transformers, and RLHF all show that breakthroughs follow access to richer datasets.

I also agree that the next leap will come from multimodal sources like video, robotics, and sensors. Models can only recombine what they have seen - so expanding what they can learn from matters more than tweaking architectures.

At SQream (www.sqream.com), we focus on how organizations work with massive and complex datasets - so your framing resonates deeply with us.

The Human Playbook's avatar

Yes there is a reason why this is happening … I think you touch on a few things that resonated deeply with me and my writing here: https://thehumanplaybook.substack.com/p/the-prompt-world

Ashutosh's avatar

What I inferred is Jack is talking more about the what can be achieved, rather about the efficiency of it. Of course adaptive optimizers makes training efficient, but without it also you could achieve what you could although in a lesser efficient way.