RL doesn't completely "work" yet, it still has a scalability problem. Claude can write a small project, but as it becomes larger, Claude gets confused and starts making mistakes.
I used to think the problem was that models can't learn over time like humans, but maybe that can be worked around. Today's models have large enough context windows to fit a medium sized project's complete code and documentation, and tomorrow's may be larger; good-enough world knowledge can be maintained by re-training every few months. The real problem is that even models with large context windows struggle with complexity moreso than humans; they miss crucial details, then become very confused when trying to correct their mistakes and/or miss other crucial details (whereas humans sometimes miss crucial details, but are usually able to spot them and fix them without breaking something else).
Reliability is another issue, but I think it's related to scalability: an LLM that cannot make reliable inferences from a small input data, cannot grow that into a larger output data without introducing cascading hallucinations.
EDIT: creative control is also superseded by reliability and scalability. You can generate any image imaginable with a reliable diffusion model, by first generating something vague, then repeatedly refining it (specifying which details to change and which to keep), each refinement closer to what you're imagining. Except even GPT-4o isn't nearly reliable enough for this technique, because while it can handle a couple refinements, it too starts losing details (changing unrelated things).
I wonder how much of this is that code is less explicit than written language in some ways.
With English, the meaning of a sentence is mostly self-contained. The words have inherent meaning, and if they’re not enough on their own, usually the surrounding sentences give enough context to infer the meaning.
Usually you don’t have to go looking back 4 chapters or look in another book to figure out the implications of the words you’re reading. When you DO need to do that (maybe reading a research paper for instance), the connected knowledge is all at the same level of abstraction.
But with code, despite it being very explicit at the token level, the “meaning” is all over the map, and depends a lot on the unwritten mental models the person was envisioning when they wrote it. Function names might be incorrect in subtle or not-so-subtle ways, and side effects and order of execution in one area could affect something in a whole other part of the system (not to mention across the network, but that seems like a separate case to worry about). There’s implicit assumptions about timing and such. I don’t know how we’d represent all this other than having extensive and accurate comments everywhere, or maybe some kind of execution graph, but it seems like an important challenge to tackle if we want LLMs to get better at reasoning about larger code bases.
This is super insightful, and I think there is at least part of what you are thinking of: an abstract syntax tree! Or at the very least one could include metadata about the token under scrutiny (similar to how most editors can show you git blame / number of references / number of tests passing in the current code you are looking at...)
It makes me think about things like... "what if we also provided not just the source code, but the abstract syntax tree or dependency graph", or at least the related nodes relevant to what code the LLM wants to change. In this way, you potentially have the true "full" context of the code, across all files / packages / whatever.
Yeah! I think an AST is sort of what I'm envisioning here, but with much broader metadata, including requirements and implicit assumptions and stuff.
As a concrete example, a random bit of code from the minih264 encoder:
Someone who's built an encoder or studied h264 probably knows what this is for (I have a very fuzzy idea). But even with the comment there's lots of questions. Are these arrays restricted to certain values? Can they span the full int16, or are there limits, or are the bits packed in an interesting way? Can they be negative? Why would you want to store these 2 numbers together in a struct, why not separately? Do they get populated at the same time, or at different phases of the pipeline, or are they built up over multiple passes? Are all of these questions ridiculous because I don't really understand enough about how h264 works (probably)?/** * Quantized/dequantized representation for 4x4 block */ typedef struct { int16_t qv[16]; // quantized coefficient int16_t dq[16]; // dequantized } quant_t;
LLMs already have a lot of this knowledge, and could probably answer if prompted, but my point is more that the code doesn't explicitly lay out all of these things unless you carefully trace the execution, and even then, some of the requirements might not be evident. Maybe negative numbers aren't valid here (I don't actually know) but the reason that invariant gets upheld is an abs() call 6 levels up the call stack, or the data read from the file is always positive so we just don't have to worry about it. I dunno.
Anyway I imagine LLMs could be even more useful if they knew more about all this implicit context somehow, and I think this is the kind of stuff that just piles up as a codebase gets larger.
Not really true.
You can have a book where in the last chapter you have a phrase "She was not his kid."
Knowing nothing else, you can only infer the self-contained details. But in the book context this could be the phrase which turns everything upside down, and it could refer to a lot of context.
The whole book could be the surrounding context, not just a sentence or two, and I think that still fits with the point I wanted to make - that written words are more linear or in the same plane compared to code which is more "multidimensional" in a sense, when you start to consider the reasons behind the code, the order of execution, things being executed multiple times, etc.
Claude and 4o aren’t RL trained IIRC? Also, who’s using these for code? You’re cool not being able to train on your chat logs used to develop your own codebase? Sounds pretty sus