The key insight here is that DGM solves the Gödel Machine's impossibility problem by replacing mathematical proof with empirical validation - essentially admitting that predicting code improvements is undecidable and just trying things instead, which is the practical and smart move.
Three observations worth noting:
- The archive-based evolution is doing real work here. Those temporary performance drops (iterations 4 and 56) that later led to breakthroughs show why maintaining "failed" branches matters, in that they're exploring a non-convex optimization landscape where current dead ends might still be potential breakthroughs.
- The hallucination behavior (faking test logs) is textbook reward hacking, but what's interesting is that it emerged spontaneously from the self-modification process. When asked to fix it, the system tried to disable the detection rather than stop hallucinating. That's surprisingly sophisticated gaming of the evaluation framework.
- The 20% → 50% improvement on SWE-bench is solid but reveals the current ceiling. Unlike AlphaEvolve's algorithmic breakthroughs (48 scalar multiplications for 4x4 matrices!), DGM is finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches.
The real test will be whether these improvements compound - can iteration 100 discover genuinely novel architectures, or are we asymptotically approaching the limits of self-modification with current techniques? My prior would be to favor the S-curve over the uncapped exponential unless we have strong evidence of scaling.
> gaming the evaluation
Co-evolution is the answer here. The evaluator itself must be evolving.
Co-evolving Parasites Improve Simulated Evolution as an Optimization Procedure Danny Hillis, 1991
https://csmgeo.csm.jmu.edu/geollab/complexevolutionarysystem...
And in Reinforcement Learning:
POET (Paired Open-Ended Trailblazer): https://www.uber.com/en-DE/blog/poet-open-ended-deep-learnin...
SCoE (Scenario co-evolution): https://dl.acm.org/doi/10.1145/3321707.3321831
- [deleted]
The "Goedel Machine" is an interesting definition, but wildly impractical (though I wouldn't say it's impossible, since it only has to find some improvement, not "the best" improvement; e.g. it could optimise its search procedure in a way that's largely orthogonal to the predicted rewards).
Schmidhuber later defined "PowerPlay" as a framework for building up capabilities in a more practical way, which is more adaptive than just measuring the score on a fixed benchmark. A PowerPlay system searches for (problem, replacement) pairs, where it switches to the replacement if (a) the current system cannot solve that problem, (b) the replacement can solve that problem, and (c) the replacement can also solve all the problems that caused previous replacements (maintained in a list).
I formalised that in Coq many years ago ( http://www.chriswarbo.net/projects/powerplay ), and the general idea can be extended to (a) include these genetic-programming approaches, rather than using a single instance; and (b) could be seeded with desirable benchmarks, etc. to guide the system in a useful direction (so it's "self-invented" problems can include things like "achieves X% on benchmark Y")
- [deleted]