The premise of this post and the one cited near the start (https://www.tobyord.com/writing/inefficiency-of-reinforcemen...) is that RL involves just 1 bit of learning for a rollout, rewarding success/failure.
However, the way I'm seeing this is that a RL rollout may involve, say, 100 small decisions out of a pool of 1,000 possible decisions. Each training step, will slightly upregulate/downregulate a given training step in the step's condition. There will be uncertainty about which decision was helpful/harmful -- we only have 1 bit of information after all -- but this setup where many steps are slowly learned across many examples seems like it would lend itself well to generalization (e.g., instead of 1 bit in one context, you get a hundred 0.01 bit insights across 100 contexts). There may be some benefits not captured by comparing the number of bits relative to pretraining.
As the blog says, "Fewer bits, sure, but very valuable bits", this also seems like a different factor that would also be true. Learning these small decisions may be vastly more valuable for producing accurate outputs than learning through pretraining.
RL is very important - because while it's inefficient, and sucks at creating entirely new behaviors or features in LLMs, it excels at bringing existing features together and tuning them to perform well.
It's a bit like LLM glue. The glue isn't the main material - but it's the one that holds it all together.
RL before LLMs can very much learn new behaviors. Take a look at AlphaGo for that. It can also learn to drive in simulated environments. RL in LLMs is not learning the same way, so it can't create it's own behaviors.
It is the same type of learning, fundamentally: increasing/decreasing token probabilities based on the left context. RL simply provides more training data from online sampling.
Dwarkesh's blogging confuses me, because I am not sure if the message is free-associating, or, relaying information gathered.
ex. how this reads if it is free-associating: "shower thought: RL on LLMs is kinda just 'did it work or not?' and the answer is just 'yes or no', yes or no is a boolean, a boolean is 1 bit, then bring in information theory interpretation of that, therefore RL doesn't give nearly as much info as, like, a bunch of words in pretraining"
or
ex. how this reads if it is relaying information gathered: "A common problem across people at companies who speak honestly with me about the engineering side off the air is figuring out how to get more out of RL. The biggest wall currently is the cross product of RL training being slowww and lack of GPUs. More than one of them has shared with me that if you can crack the part where the model gets very little info out of one run, then the GPU problem goes away. You can't GPU your way out of how little info they get"
I am continuing to assume it is much more A than B, given your thorough sounding explanation and my prior that he's not shooting the shit about specific technical problems off-air with multiple grunts.
He is essentially expanding upon an idea made by Andrej Karpathy on his podcast about a month prior.
Karpathy says that basically "RL sucks" and that it's like "sucking bits of supervision through a straw".
https://x.com/dwarkesh_sp/status/1979259041013731752/mediaVi...
Dwarkesh has a CS degree, but zero academic training or real world experience in deep learning, so all of his blogging is just secondhand bullshitting to further siphon off a veneer of expertise from his podcast guests.
So grumpy! Please pick up the torch and educate the world better; it can only help.
Better to be honest than say nothing, plenty of people say nothing. I asked a polite question thats near-impossible to answer without that level of honesty.
I thought your question was great. I read the Dwarkesh post as scratch space for working out his thinking - so, closer to a shower thought. But also, an attempt to do what he’s really great at, which is distill and summarize at a “random engineer” level of complexity.
You can kind of hear him pull in these extremely differing views on the future from very different sources, try and synthesize them, and also come out with some of his own perspective this year - I think it’s interesting. At the very least, his perspective is hyper-informed - he’s got fairly high-trust access to a lot of decision makers and senior researchers - and he’s smart and curious.
This year we’ve had him bring in the 2027 folks (AI explosion on schedule), Hinton (LLMs are literally divorced from reality, and a total dead-end), both Ilya (we probably need emotions for super intelligence, also I won’t tell you my plan), Karpathy and Dario (Dario maybe twice?), Gwen, all with very very different perspectives on what’s coming and why.
So, I think if you read him as one of the chroniclers of this era his own take is super interesting, and he’s in a position to be of great use precisely at synthesizing and (maybe) predicting; he should keep it up.
I teach and mentor lots of folks in my world. What I don’t do is feign expertise to rub shoulders with the people doing the actual work so I can soak money from rubes with ad rolls.