These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one.
But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.
The news is not in the way to compare models, it’s that Kimi K2.6 (and I’d add Deepseek v4 Pro) are more or less equivalent to Opus and that’s already pretty big.
They are open source and cost waaaay less per token than American models.
I’m using them right now on the $20 Ollama cloud plan and I can actually work with them on my side projects without reaching the limits too much. With Claude Pro $20 plan my usage can barely survive one or two prompts.
And I choose Ollama cloud just because their CLI is convenient to use but their are a lot of other providers for those models so you aren’t even stuck with shitty conditions and usage rules.
To me that’s a pretty bad thing for American economy.
They are no way as good as Opus yet. But Sonnet, yes. Using all in real life.
Or maybe it is a pretty good thing for the American economy that you can get AI at cost rather than monopoly pricing.
You know, for the rest of the economy that is not big tech.
It's not good for current administration. The American AI growth is only thing that keeps the GDP not looking terrible.
And investor pumping money in US AI circular money flow just makes innovation everywhere else slower. If not for the GPU/Memory drought running stuff locally (or just in competition cloud) would be far cheaper
- [deleted]
> for American economy.
There is more to American economy than big tech.
And that's precisely why this has started: https://www.wired.com/story/super-pac-backed-by-openai-and-p...
>There is more to American economy than big tech.
Most of the stock market valuation is big-tech, and most of people's retirements are the stock market, so... if the AI bubble bursts a lot of the US will be affected.
I do not know why this is downvoted. This is true.
There are objective ways to compare models. They involve repeated sampling and statistical analysis to determine whether the results are likely to hold up in the future or whether they're just a fluke. If you fine-tune each model to achieve its full potential on the task you expect to be giving it, the rankings produced by different benchmarks even agree to a high degree: https://arxiv.org/abs/2507.05195
The author didn't do any of that. They ran each model once on each of 13 (so far) problems and then they chose to highlight the results for the 12th problem. That's not even p-hacking, because they didn't stop to think about p-values in the first place.
LLM quality is highly variable across runs, so running each model once tells you about as much about which one is better as flipping two coins once and having one come up heads and the other tails tells you about whether one of them is more biased than the other.
Fine-tuning for a specific task is even much less realistic than the benchmarks shown in TFA.
Most people who have computers could run inference for even the biggest LLMs, albeit very slowly when they do not fit in fast memory.
On the other hand, training or even fine tuning requires both more capable hardware and more competent users. Moreover the effort may not be worthwhile when diverse tasks must be performed.
Instead of attempting fine-tuning, a much simpler and more feasible strategy is to keep multiple open-weights LLMs and run them all for a given task, then choose the best solution.
This can be done at little cost with open-weights models, but it can be prohibitively expensive with proprietary models.
That's objective metrics. Not an objective way to compare, which is the selection of metrics to include.
That's exactly why there's a ton of different benchmarking suites used for evaluating hardware performance.
I reckon we'll have similar suites comparing different aspects of models.
And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.
While I partially agree with you, there IS work being done to make the metrics comparable. Eg:
https://ghzhang233.github.io/blog/2026/03/05/train-before-te...
It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking.
Unfortunately, you're probably right, but the cock measuring contest is going to keep escalating because the billionaires and VC backers need to _win_. And the Psychosis is going to produce some horrible collateral damage.
A pretty simple one would be to have every model try and one shot every ticket your company has and then measure the acceptance rate of each model.
Except that if you tried one-shotting your ticket twenty times at different hours of the day and different days of the week, you would have enough changes to make benchmarks even if you used the same model every time. Much moreso if you fiddled with the thinking or changed the prompt.
Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.
You never get "the same" Steph Curry, he might be tired, annoyed by a fan, getting older... but if he and I were to throw 100 3-pointers, we could all correctly guess who will perform better.
My theory is we will end up in a similar spot to hiring people. You can look at a CV (benchmarks) but you won't know for sure until you've worked with them for six months.
We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models?
Yes, the entire field of software engineering ran aground on not being able to test how well people can write software.
But I'm more optimistic about testing programming models. You can run repeated tests, and compare median performance. You can run long tests, like hundreds of hours, while getting more than a few humans to complete half-day tests is a huge project. And you can do ablation testing, where you remove some feature of the environment or tools and see how much it helps/hurts.
The CV-to-six-months analogy is actually exactly right and it's also why benchmarks for hiring people stopped being useful. The signal that holds up is what you see when something breaks, which is hard to compress into a number.
this smells like an ai-generated comment so much
Not many things are as manifold broken as hiring these days. I hope we do not end up there.
You do not interview 1000 rounds on problems you're actually solving. If you did, hiring would be fine. Minus the social fit aspect, which isn't as relevant for a model.
That was my thought too.
> The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space.
Just last week my superior asked to implement that for a customer. /s
Maybe some real, real task would be good? Add sone database, some REST, some random JS framework and let it figure out a full-stack task instead of creating some rectangles?
[flagged]
So like Open Router?