If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.
I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.
They are impressive, but they are not performing at Sonnet 4.5 level in my experience.
I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.
That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.
Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash
I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.
I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.
What coding agent do you use with StepFun-3.5-flash? I just tried it from siliconflow's api with opencode. The toolcalling is broken: AI_InvalidResponseDataError: Expected 'function.name' to be a string.
I use pi, but I'm almost done writing a better alternative that doesn't have pi's stability issues. 80K Rust SLOC and a few hundred tests btw.
Any place we can look for you to release this?
Yeah, my github is in the profile. Soon (tm). Feel free to follow.
Are you using stepfun mostly because it's free, or is it better than other models at some things?
I think we are at this point where the hard ceiling of a strong model is pretty hard to delineate reliably (at least in coding, in research work it's clearer ofc) - and in a good sense, meaning with suitable task decomposition or a test harness or a good abstraction you can make the model do what you thought it could not. StepFun is a strong model and I really enjoyed studying and comparing it to others by coding pretty complex projects semi-autonomously (will do a write up on this soon tm).
Even purely pragmatically, StepFun covers 95% of my research+SWE coding needs, and for the remaining 5% I can access the large frontier models. I was surprised StepFun is even decent at planning and research, so it is possible to get by with it and nothing else (1), but ofc for minmaxing the best frontier model is still the best planner (although the latest deepseek is surprisingly good too).
Finally we are at a point where there is a clear separation of labor between frontier & strong+fast models, but tbh shoehorning StepFun into this "strong+fast" category feels limiting, I think it has greater potential.
I pay for copilot to access anthropic, google and openai models.
Claude code always give me rate limits. Claude through copilot is a bit slow, but copilot has constant network request issues or something, but at least I don't get rate limited as often.
At least local models always work, is faster (50+ tps with qwen3.5 35b a4b on a 4090) and most importantly never hit a rate limit.
> Claude code always give me rate limits
> 50+ tps with qwen3.5 35b a4b on a 4090
But qwen3.5 35b is worse than even Claude Haiku 4.5. You could switch your Claude Code to use Haiku and never hit rate limits. Also gets similar 50tps.
I haven't tried 4.5 haiku much, but i was not impressed with previous haiku versions.
My goto proprietary model in copilot for general tasks is gemini 3 flash which is priced the same as haiku.
The qwen model is in my experience close to gemini 3 flash, but gemini flash is still better.
Maybe it's somewhat related to what we're using them for. In my case I'm mostly using llms to code Lua. One case is a typed luajit language and the other is a 3d luajit framework written entirely in luajit.
I forgot exactly how many tps i get with qwen, but with glm 4.7 flash which is really good (to be local) gets me 120tps and a 120k context.
Don't get me wrong, proprietary models are superior, but local models are getting really good AND useful for a lot of real work.
I also started playing with 3.5 Flash and was impressed.
It’s 2× faster than its competitors. For tasks where “one-shotting” is unrealistic, a fast iteration loop makes a measurable difference in productivity.
TDD is really the delineation between being successful or not when using [local] LLMs.
> some opensource models really are strong and useful
To be clear I never said they weren’t strong or useful. I use them for some small tasks too.
I said they’re not equivalent to SOTA models from 6 months ago, which is what is always claimed.
Then it turns into a Motte and Bailey game where that argument is replaced with the simpler argument that they’re useful for open weights models. I’m not disagreeing with that part. I’m disagree with the first assertion that they’re equivalent to Sonnet 4.5
They are not equivalent 1:1, esp. in knowledge coverage (given OOM param size difference) and in taste (Sonnet wins, but for taste one can also use Kimi K2.5), but in my hardcore use (high-performance realtime simulations of various kinds) I would prefer StepFun-3.5-Flash to Sonnet 4 strongly and to 4.5 often enough without a decisive advantage in using exclusively Sonnet 4.5. For truly hard tasks or specifications I would turn to 5.2 or 5.3-codex of course - but one KPI for quality of my work as a lead engineer is to ensure that truly hard tasks are known, bounded and planned-for in advance.
Maybe my detailed, requirement-based/spec-based prompting style makes the difference between anthropic's and OSS models smaller and people just like how good Anthropic's models are at reading the programmer's intent from short concise prompts.
Frankly, I think the 1:1 equivalent is an impossible standard given the set of priorities and decisions frontier labs make when setting up their pre-, mid- and post-training pipelines, and benchmark-wise it is achievable for a smaller OSS model to align with Sonnet 4.5 even on hard benchmarks.
Given the relatively underwhelming Sonnet 4.5 benchmarks [1], I think StepFun might have an edge over it esp. in Math/STEM [2] - even an old deepseek-3.2 (not speciale!) had a similar aggregate score. With 4.6 Anthropic ofc vastly improved their benchmark game, and it now truly looks like a frontier model.
1. https://artificialanalysis.ai/models/claude-4-5-sonnet-think... 2. https://matharena.ai/models/stepfun_3_5_flash
What are you running that model on?
I just use openrouter, it's free for now. But I would pay 30-100$ to use it 24/7.
Ah, I thought you meant you were running it locally.
Have you tried Minimax M2.5? How did it compare?
Much worse - from my experience minimax is not suitable for high autonomy on hard projects. The real distant second in my experience is mimo flash v2 (but I did not try the latest version, might be closer to parity). I would not use minimax for serious work.
StepFun 3.5 Flash is better compared to google's gemini 3 flash which is surprisingly good and pretty costly, and to GLM-5.
I find this outcome ironic given minimax's more aggressive marketing and large-scale distillation accusations from Anthropic specifically accusing minimax but not StepFun.
I can only wonder about the true underlying reasons, but deducing from public information I suspect that minimax simply has weaker, benchmaxx-targeting post-training R&D and leans more on distillation of western frontier models, while StepFun has extensive post-training with lots of hard-won custom R&D and internal large-scale distillation teachers.
Interesting. I'm surprised you feel that it's better than GLM 5 - these models are in different weight classes after all.
I tried it out a bunch and it seems good. I can't really tell if it's better or worse than most of these other models in such a short time though.
I don't think it's strictly better than GLM 5, more like they are peers (but in math competitions StepFun is stronger than most), and in my experience have similar coding/bugfix ceiling where world knowledge is not the deciding factor. But I didn't test GLM 5 for more than 30 hours, and my agentic harness (opencode) might be suboptimal - I'm open to the idea that GLM 5 with the right agentic harness is ready for ultra-long autonomy, but I have yet to see it myself.
Where GLM 5 is strictly worse for me though, compared to StepFun, is long-form content generation (planning, research documents) - but this can be said about geminis too and these are obviously very smart models.
Given the free option I'd explore GLM 5 more, but if I had to pay for it myself ofc I'd choose stepfun every time. Basically I think right now the optimal configuration for maximizing output of correct software features per dollar involves using StepFun or its future class competitor for bulk coding and first stage code review.
Maybe I need to write a blogpost about it after all.
- [deleted]
A 3 bit quant will run on a 128gb MacBook Pro, it works pretty well.
A 3 bit quant is quite a lot weaker than the OpenRouter version the OP is using.
Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models. But some models, especially GLM-5, really have captured whatever circuitry drives pattern matching in the models they were trained off of.
I like this benchmark that competes models against one another in competitive environments, which seems like it can't really be gamed: https://gertlabs.com
> Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models
That’s exactly what I said, though. The headline we’re commenting under claims they’re Sonnet 4.5 level but they’re not.
I don’t disagree that they’re powerful for open models. I’m pointing out that anyone reading these headlines who expects a cheap or local Sonnet 4.5 is going to discover that it’s not true.
All models are doing that. Not only the open source ones.
I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.
I wouldn't mind them benchmaxing my queries.
I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though...
27B dense model is probably the best in the 3.5 lot, not absolutely but for perf:size. Its also pretty good at prose, which is a rarity for a Qwen.
You don't need a coding version of model from Qwen? the 3.5 works?
Are there any up-to-date offline/private agentic coding benchmark leaderboards?
If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.
Edit: These look decent and generally match my expectations:
"When a measure becomes a target, it ceases to be a good measure."
Goodhart's law shows up with people, in system design, in processor design, in education...
Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.
This is because of the forbidden argument in statistics. Any statistic, even something so basic as an average, ONLY works if you can guarantee the independence of the individual facts it measures.
But there's a problem with that: of course the existence of the statistical measure itself is very much a link between all those individual facts. In other words: if there is ANY causal link between the statistical measure and the events measured ... it has now become bullshit (because the law of large numbers doesn't apply anymore).
So let's put it in practice, say there's a running contest, and you display the minimum, maximum and average time of all runners that have had their turns. We all know what happens: of course the result is that the average trends up. And yet, that's exactly what statistics guarantees won't happen. The average should go up and down with roughly 50% odds when a new runner is added. This is because showing the average causes behavior changes in the next runner.
This means, of course, that basing a decision on something as trivial as what the average running time was last year can only be mathematically defensible ONCE. The second time the average is wrong, and you're basing your decision on wrong information.
But of course, not only will most people actually deny this is the case, this is also how 99.9% of human policy making works. And it's mathematically wrong! Simple, fast ... and wrong.
Depends on what you expect from the model. For coding/agentic tasks there is SWE Bench https://www.swebench.com/ which gives a better picture. MiniMax, GLM and Kimi K2 seem to be better models for this purpose than Qwen. And it matches my (limited) actual experience.
> they always disappoint in actual use.
I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.
Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.
At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.
Just going to echo this. Been using K2.5 in opencode as a switch away from Opus because it was too expensive for the sorts of things I was playing with, and it's been great. There's definitely a bit of learning to get the hang of what sort of prompts to give it and to make sure there's enough documentation in the project for it, but it's remarkably capable once you're in the swing of it.
Are you saying that the benchmarks are flawed?
And could quantization maybe partially explain the worse than expected results?
No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.
I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.
> No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up.
That's a much better way to say it than I did.
These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.
This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.
It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.
- [deleted]
There should be a way to turn the questions we ask LLMs into benchmarks.
That way, we can have a benchmark that is always up to date.
There are a few “updating” benchmarks out there. I periodically take a look at these two:
The models outperform on the benchmarks relative to general tasks.
The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.
> And could quantization maybe explain the worse than expected results?
You can use the models through various providers on OpenRouter cheaply without quantization.
- [deleted]
Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice.
Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.
It's not just the open source ones.
The only benchmarks worth anything are dynamic ones which can be scaled up.
Hmm, I second this. Haven't compared Qwen3.5 122B yet, but played around with OpenCode + Qwen3-Coder-Next yesterday and did manual comparisons with Claude Code and Claude Code is still far ahead in general felt "intelligence quality".
Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters.
How much computer do you need to make them work like Sonnet 4.5 from claude but locally?
I've been trying to get these things to local host and use tools. Am I right in understanding that it's impossible for these things to use tools from within llama.cpp? Do I need another "thing" to run the models? What exactly is the mechanism by which the models became aware that they're somewhere where they have tools availbale? So many questions...
they're distilling claude and openai obviously.
that said, sonnet 4.5 is not a good model today, March 1st 2026. (it blew my mind on its release day, September 29th, 2025.)
> That said, they are impressive for open source models.
there is nothing open "source" about them. They are open weights, that's all.
Very good point. I'm playing with them too and got to the same conclusion.
[dead]