The inefficiency of RL, and implications for RLVR progress

dwarkesh.com

・

117 points

・

cubefox

・

5 days ago

46 comments

bogtog ・ a day ago

The premise of this post and the one cited near the start (https://www.tobyord.com/writing/inefficiency-of-reinforcemen...) is that RL involves just 1 bit of learning for a rollout, rewarding success/failure.

However, the way I'm seeing this is that a RL rollout may involve, say, 100 small decisions out of a pool of 1,000 possible decisions. Each training step, will slightly upregulate/downregulate a given training step in the step's condition. There will be uncertainty about which decision was helpful/harmful -- we only have 1 bit of information after all -- but this setup where many steps are slowly learned across many examples seems like it would lend itself well to generalization (e.g., instead of 1 bit in one context, you get a hundred 0.01 bit insights across 100 contexts). There may be some benefits not captured by comparing the number of bits relative to pretraining.

As the blog says, "Fewer bits, sure, but very valuable bits", this also seems like a different factor that would also be true. Learning these small decisions may be vastly more valuable for producing accurate outputs than learning through pretraining.

ACCount37 ・ a day ago

RL is very important - because while it's inefficient, and sucks at creating entirely new behaviors or features in LLMs, it excels at bringing existing features together and tuning them to perform well.
It's a bit like LLM glue. The glue isn't the main material - but it's the one that holds it all together.
- elchananHaas ・ 18 hours ago
  
  RL before LLMs can very much learn new behaviors. Take a look at AlphaGo for that. It can also learn to drive in simulated environments. RL in LLMs is not learning the same way, so it can't create it's own behaviors.
macleginn ・ a day ago

It is the same type of learning, fundamentally: increasing/decreasing token probabilities based on the left context. RL simply provides more training data from online sampling.
refulgentis ・ a day ago

Dwarkesh's blogging confuses me, because I am not sure if the message is free-associating, or, relaying information gathered.
ex. how this reads if it is free-associating: "shower thought: RL on LLMs is kinda just 'did it work or not?' and the answer is just 'yes or no', yes or no is a boolean, a boolean is 1 bit, then bring in information theory interpretation of that, therefore RL doesn't give nearly as much info as, like, a bunch of words in pretraining"
or
ex. how this reads if it is relaying information gathered: "A common problem across people at companies who speak honestly with me about the engineering side off the air is figuring out how to get more out of RL. The biggest wall currently is the cross product of RL training being slowww and lack of GPUs. More than one of them has shared with me that if you can crack the part where the model gets very little info out of one run, then the GPU problem goes away. You can't GPU your way out of how little info they get"
I am continuing to assume it is much more A than B, given your thorough sounding explanation and my prior that he's not shooting the shit about specific technical problems off-air with multiple grunts.
- robrenaud ・ a day ago
  
  He is essentially expanding upon an idea made by Andrej Karpathy on his podcast about a month prior.
  Karpathy says that basically "RL sucks" and that it's like "sucking bits of supervision through a straw".
  https://x.com/dwarkesh_sp/status/1979259041013731752/mediaVi...
- bugglebeetle ・ a day ago
  
  Dwarkesh has a CS degree, but zero academic training or real world experience in deep learning, so all of his blogging is just secondhand bullshitting to further siphon off a veneer of expertise from his podcast guests.
  
  vessenes ・ a day ago
  ・ 4 more
  
  So grumpy! Please pick up the torch and educate the world better; it can only help.
  
  refulgentis ・ a day ago
  ・ 2 more
  
  Better to be honest than say nothing, plenty of people say nothing. I asked a polite question thats near-impossible to answer without that level of honesty.
  
  vessenes ・ 6 hours ago
  
  I thought your question was great. I read the Dwarkesh post as scratch space for working out his thinking - so, closer to a shower thought. But also, an attempt to do what he’s really great at, which is distill and summarize at a “random engineer” level of complexity.
  You can kind of hear him pull in these extremely differing views on the future from very different sources, try and synthesize them, and also come out with some of his own perspective this year - I think it’s interesting. At the very least, his perspective is hyper-informed - he’s got fairly high-trust access to a lot of decision makers and senior researchers - and he’s smart and curious.
  This year we’ve had him bring in the 2027 folks (AI explosion on schedule), Hinton (LLMs are literally divorced from reality, and a total dead-end), both Ilya (we probably need emotions for super intelligence, also I won’t tell you my plan), Karpathy and Dario (Dario maybe twice?), Gwen, all with very very different perspectives on what’s coming and why.
  So, I think if you read him as one of the chroniclers of this era his own take is super interesting, and he’s in a position to be of great use precisely at synthesizing and (maybe) predicting; he should keep it up.
  
  bugglebeetle ・ a day ago
  
  I teach and mentor lots of folks in my world. What I don’t do is feign expertise to rub shoulders with the people doing the actual work so I can soak money from rubes with ad rolls.

derbOac ・ a day ago

There's some insights there about the base rate of correct responses and pretraining to boost that. Basically searching a suboptimal versus optimal area of the model space at a suboptimal versus optimal rate.

I think the framing of the discussion in general is kind of misleading though, because it kind of avoids the question of "information inefficient about what?"

In RL, the model is becoming more informative about a stimulus-action-feedback space; in SL the model is becoming more informative about a stimulus-feedback space. RL is effectively "built for" searching a larger space.

In situations like the essay where you are directly comparing SL and RL, you're kind of saying for RL "the action space is restricted to dictionary X and the feedback space is binary yes or no" and for SL "the feedback space is restricted to dictionary X". So in a certain sense you're equating the RL action space to the SL feedback space.

In that case, maybe searching over suboptimal regions of the RL-action-SL-feedback space is inefficient. But the reason why, I think RL exists is because it generalizes to situations where the feedback and action space is bigger. Maybe you want to differentially associate different responses with different rewards, or sample a response space that is so large that you can't define it a priori. Then SL breaks down?

Maybe this is obvious but I guess I get a little uneasy about talking about information efficiency of RL and SL without a broader framework of equivalence and what information is being represented by the model in both cases. It seems to me RL is a kind of superset of SL in terms of what it is capable of representing, which maybe leads to inefficiencies when it's not being used to its fullest.

dash2 ・ a day ago

SL = supervised learning, right?

macleginn ・ a day ago

In the limit, the "happy" case (positive reward), policy gradients boil down to performing more or less the same update as the usual supervised strategy for each generated token (or some subset of those if we use sampling). In the unhappy case, they penalise the model for selecting particular tokens in particular circumstances -- this is not something you can normally do with supervised learning, but it is unclear to what extent this is helpful (if a bad and a good answer share a prefix, it will be upvoted in one case and penalised in another case, not in the same exact way but still). So during on-policy learning we desperately need the model to stumble on correct answers often enough, and this can only happen if the model knows how to solve the problem to begin with, otherwise the search space is too big. In other words, while in supervised learning we moved away from providing models with inductive biases and trusting them to figure out everything by themselves, in RL this does not really seem possible.

sgsjchs ・ a day ago

The trick is to provide dense rewards, i.e. not only once full goal is reached, but a little bit for every random flailing of the agent in the approximately correct direction.
- thegeomaster ・ a day ago
  
  Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears.
  [0]: https://arxiv.org/abs/2501.12948
- Jaxan ・ a day ago
  
  How do you know the correct direction? Isn’t the point of learning that the right path is unknown to start with?
  
  jsnell ・ a day ago
  ・ 2 more
  
  The correct solutions and the viable paths probably are known to the trainers, just not to the trainee. Training only on problems where the solution is unknown but verifiable sounds like the ultimate hard mode, and pretty hard to justify unless you have a model that's already saturated the space of problems with known solutions.
  (Actually, "pretty hard to justify" might be understating it. How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)
  
  robotresearcher ・ a day ago
  
  Your hard mode is exactly the situation that RL is used, because it requires neither a corpus of correct examples, nor insight into the structure of a good policy.
  > How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)
  You rule out all the stuff that doesn’t work.
  Yes this is difficult and usually very costly. Credit assignment is a deep problem. But if you didn’t find yourself in a hard mode situation, you wouldn’t be using RL.

hereme888 ・ a day ago

recent results like PEFT-Bench (arxiv.org/abs/2511.21285) found that while SFT is efficient for formatting, it actually degraded Llama-3-8B's reasoning on math and code tasks compared to the base model.

So is RL required to preserve those logic circuits?

There seems to be a trade-off in compute-efficiency and format vs intelligence

ACCount37 ・ a day ago

Not necessarily. The reason why SFT can hurt performance is often the gap between the data and the capabilities.
Imagine forcing someone who never used chopsticks to eat with the chopsticks. The results wouldn't be good - the instruction "use chopsticks" has taken effect, but an underlying "chopstick use" capability isn't there.
If your SFT data pushes your LLM too far past its capabilities? It'll teach it to try doing a thing it can't do.
If your SFT traces assume your LLM can do 10 digit multiplication, the LLM wouldn't learn 10 digit multiplication from them. It'll learn to attempt 10 digit multiplication, and it'll fail.
- hereme888 ・ a day ago
  
  fair point regarding data quality, but in the PEFT-Bench study, the base model actually outperformed the fine-tuned versions on those specific math/code tasks.
  So the "chopstick capability" was already there (at least partially), but the SFT process actively degraded it. It seems less about the data being too hard and more about the parameter-efficient methods (like LoRA) overwriting or interfering with delicate reasoning circuits just to satisfy the formatting loss.
  
  yorwba ・ a day ago
  
  I think they must've messed up validation somehow. The performance drops relative to the base model are sometimes quite dramatic, which should've been caught by corresponding deterioration in validation performance.
  They write "we utilize 10% randomly selected from the training set as a validation set and the original validation set as a test set for evaluation. During the validation phase, we measure validation loss and save the weights of the best validation loss for every 5% of the training steps. We train for 10 epochs with a batch size of 4." so it might be as simple as not including the base model in the validation checkpoints, meaning that the first validated checkpoint is after half an epoch, which is plenty of time to do damage if the fine-tuning method/hyperparameter configuration isn't chosen well. Unfortunately, they don't graph their training curves.
cubefox ・ 17 hours ago

With "supervised learning" he meant LLM pretraining, i.e., unsupervised / self-supervised learning. Not actual SFT.

a-dub ・ a day ago

i think in order to make this kind of argument you would need to be able to show all of the trajectories that are effectively reachable as a result of pre-training, and then how much effective pruning takes place as a result of total adjustment of the weights in response to one RL sample.

scaredginger ・ a day ago

Bit of a nitpick, but I think his terminology is wrong. Like RL, pretraining is also a form of *un*supervised learning

cubefox ・ a day ago

Usual terminology for the three main learning paradigms:
- Supervised learning (e.g. matching labels to pictures)
- unsupervised learning / self-supervised learning (pretraining)
- reinforcement learning
Now the confusing thing is that Dwarkesh Patel instead calls pretraining "supervised learning" and you call reinforcement learning a form of unsupervised learning.
- pavvell ・ a day ago
  
  SL and SSL are very similar "algorithmically": both use gradient descent on a loss function of predicting labels, human-provided (SL) or auto-generated (SSL). Since LLMs are pretrained on human texts, you might say that the labels (i.e., next token to predict) were in fact human provided. So, I see how pretraining LLMs blurs the line between SL and SSL.
  In modern RL, we also train deep nets on some (often non trivial) loss function. And RL is generating its training data. Hence, it blurs the line with SSL. I'd say, however, it's more complex and more computationally expensive. You need many / long rollouts to find a signal to learn from. All of this process is automated. So, from this perspective, it blurs the line with UL too :-) Though it dependence on the reward is what makes the difference.
  Overall, going from more structured to less structured, I'd order the learning approaches: SL, SSL (pretraining), RL, UL.
- intalentive ・ a day ago
  
  A “pretrained” ResNet could easily have been trained through a supervised signal like ImageNet labels.
  “Pretraining” is not a correlate of the learning paradigms, it is a correlate of the “fine-tuning” process.
  Also LLM pretraining is unsupervised. Dwarkesh is wrong.
- thegeomaster ・ a day ago
  
  You could think of supervised learning as learning against a known ground truth, which pretraining certainly is.
  
  Davidzheng ・ a day ago
  
  a large number of breakthroughs in AI are based on turning unsupervised learning into supervised learning (alphazero style MCTS as policy improvers are also like this). So the confusion is kind of intrinsic.

andyjohnson0 ・ a day ago

Since it is not explicitly stated, "RL" in this article means Reinforcement Learning.

https://en.wikipedia.org/wiki/Reinforcement_learning

quote ・ a day ago

I, too, started parsing this as RL=real life and that’s why I found the headline interesting
Angostura ・ a day ago

Thank god. Was driving me mad.
- on_the_train ・ a day ago
  
  [flagged]
  
  dang ・ a day ago
  ・ 2 more
  
  "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."
  https://news.ycombinator.com/newsguidelines.html
  
  on_the_train ・ a day ago
  
  That doesn't, or shouldn't apply to the content itself. Because we all know how prevalent clickbait is.
  
  farresito ・ a day ago
  ・ 2 more
  
  This is the first time I read that someone uses an acronym for ragebait purposes. The acronym "RL" is very well known. Dwarkesh's podcast is mostly AI related, so it's not a surprise that he will freely use acronyms. I think your take is very cynical.
  
  undefined ・ a day ago
  
  [deleted]
  
  jsnell ・ a day ago
  
  That is a bizarre take. Dwarkesh Patel is publishing in a very specific domain, where RL is a very common and unambigous acronym. I'd bet it was immediately clear to 99% of his normal audience, and to him it's such a high frequency term that people finding it ambiguous would not even have crossed his mind.
  (Like, would you expect people to expand LLM or AGI in a title?)
  
  gpvos ・ a day ago
  ・ 3 more
  
  [flagged]
  
  sidibe ・ a day ago
  
  Ok so now it's stupid or malicious to use RL as reinforcement learning on a blog about AI where everyone in the field has been referring to it as RL forever? Even wikipedia puts (RL) after reinforcement learning.
  
  bbarnett ・ a day ago
  
  There needs to be a new law, applicable to posts on the Internet of any kind.
  Because that law doesn't hold, when malice has a massive profit motive, and almost zero downside.
  Spammers, popups, spam, clickbait, all of it and more, not stupid, but planned.
  
  robrenaud ・ a day ago
  
  RLVR is the more particular term of art in this domain.
  VR stands for verified rewards and is the single bit per rollout that is the heart of the post. Maybe we can convince dang to update the title.
cheema33 ・ a day ago

Even though I knew which RL was being referred to here, the (ab)use of initials in this ways annoys me to no end. I wish people did not do that.
- vessenes ・ a day ago
  
  Counterpoint: much of academia is creating and learning these shorthands. They are genuinely useful - humans have limited context space in their heads, so this compression allows them to work in larger problem spaces. Classic example: Einstein and tensors.
  Upshot - don’t hate - pick up the vocab, it’s part of the learning process.