Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being the case, it is possible that in actual day-to-day use, Haiku 4.5 may be less expensive than the raw cost breakdown may appear initially, though the increase is significant.
Branding is the true issue that Anthropic has though. Haiku 4.5 may (not saying it is, far to early to tell) be roughly equivalent in code output quality compared to Sonnet 4, which would serve a lot users amazingly well, but by virtue of the connotations smaller models have, alongside recent performance degradations making users more suspicious than beforehand, getting these do adopt Haiku 4.5 over Sonnet 4.5 even will be challenging. I'd love to know whether Haiku 3, 3.5 and 4.5 are roughly in the same ballpark in terms of parameters and course, nerdy old me would like that to be public information for all models, but in fairness to companies, many would just go for the largest model thinking it serves all use cases best. GPT-5 to me is still most impressive because of its pricing relative to performance and Haiku may end up similar, though with far less adoption. Everyone believes their task requires no less than Opus it seems after all.
For reference:
Haiku 3: I $0.25/M, O $1.25/M
Haiku 4.5: I $1.00/M, O $5.00/M
GPT-5: I $1.25/M, O $10.00/M
GPT-5-mini: I $0.25/M, O $2.00/M
GPT-5-nano: I $0.05/M, O $0.40/M
GLM-4.6: I $0.60/M, O $2.20/M
One of the main issues I had with Claude Code (maybe it‘s the harness?) was that the agent tends to NOT read enough relevant code before it makes a change.
This leads to unnecessary helper functions instead of using existing helper functions and so on.
Not sure if it is an issue with the models or with the system prompts and so on or both.
This may have been fixed as of yesterday... Version 2.0.17 added a built in "Explore" sub-agent that it seems to call quite a lot.
Helps solve the inherent tradeoff between reading more files (and filling up context) and keeping the context nice and tight (but maybe missing relevant stuff.)
You might get better results with https://github.com/oraios/serena
I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"
> I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"
Wouldn't that consume a ton of tokens, though? After all, if you don't want it to recreate function `foo(int bar)`, it will need to find it, which means either running grep (takes time on large codebases) or actually loading all your code into context.
Maybe it would be better to create an index of your code and let it run some shell command that greps your ctags file, so it can quickly jump to the possible functions that it is considering recreating.
I agree, claude is an impressive agent but it seems like it's impatient and trying to make its own thing, tries to make its own tests when I already have them, etc. Maybe better for a new project.
GPT 5 (at least with cline) reads whatever you give it, then laser targets the required changes.
With High, as long as I actually provided enough relevant context it usually one shots the solution and sometimes even finds things I left out.
The only downside for me is it's extremely slow, but I still use it on anything nuanced.
> agree, claude is an impressive agent but it seems like it's impatient and trying to make its own thing, tries to make its own tests when I already have them, etc. Maybe better for a new project.
Nope, Claude will deviate from it's own project as well.
Claude is brilliant but needs hard rules. You have to treat it and make it feel like the robot it really. Feed it a bit too much human prose in your instructions and it will start to behave like a teen.
I regularly use @ key to add files to context for tasks I know require edits or patterns I want claude to follow, adds a few extra key strokes but in most cases the quality improvement is worth it
Helpfer functions exploded over the last releases, id say? Very often I state: "combine this into one function"
another thing I saw in the last days starting: Claude now draws always an ASCII art instead of a graphical image, and the ASCII art is completely useless, when something is explained
you can put in your rules what type of output to do for diagrams. Personally I prefer mermaid -> it could be rendered into an image, read and modified by AI easily.
You need to plan down tasks really to the function level, and review things.
Just writing code is often faster.
Not when you stink at writing code but you’re really good at writing specs
In that case you would be much more valued as a business analyst than as a developer.
People that can and want to write specs are very rare.
Update, Haiku 4.5 is not just very targeted in terms of changes but also really fast. Averaging at 220token/sec is almost double most other models I'd consider comparable (though again, far to early to make a proper judgement) and if this can be kept up, that is a massive value add over other models. That is nearly Gemini 2.5 Flash Lite speed for context.
Yes, we got Groq and Cerebras getting up to 1000token/sec, but not with models that seem comparable (again, early, not a proper judgement). Anthropic has been historically the most consistent in outperforming personal benchmarks vs public benchmarks, for what that is worth so I am optimistic.
If speed, performance and pricing are something Anthropic can keep consistent long term (i.e. no regressions), Haiku 4.5 really is a great option for most coding tasks, with Sonnet something I'd tag in only for very specific scenarios. Past Claude models have had a deficiency in longer term chains of tasks. Beyond 7 minutes roughly, performance does appear to worsen with Sonnet 4.5, as an example. That could be an Achilles heel for Haiku 4.5 as well, if not this really is a solid step in terms of efficiency, but I have not done any longer task testing yet.
That being said, Anthropic once again has a rather severe issue it seems, casting a shadow upon this release. From what I am seeing and others are reporting, Claude Code currently does count Haiku 4.5 usage the same as Sonnet 4.5 usage, despite the latter being significantly more expensive. They also did not yet update the Claude Code support pages to reflect the new models usage limits [0]. I really think such information should be public by launch day and hope they can improve their tooling and overall testing, it really continues to throw a shadow over their impressive models.
[0] https://support.claude.com/en/articles/11145838-using-claude...
It's insanely fast. I didn't know it had even been released, but I went to select the copilot SWE test model in VSCode and it was missing and Haiku 4.5 was there instead. I asked for a huge change to a web app and the output from Haiku scrolled the text faster than Windows could keep up. From a cold start. Wrote a huge chunk of code in about 40 seconds. Unreal.
p.s. it also got the code 100% correct on the one-shot p.p.s. Microsoft are pricing it out at 30% the cost of frontier models (e.g. Sonnet 4.5, GPT5)
Hey! I work on the Claude Code team. Both PAYG and Subscription usage look to be configured correctly in accordance with the price for Haiku 4.5 ($1/$5 per M I/O tok).
Feel free to DM me your account info on twitter (https://x.com/katchu11) and I can dig deeper!
lol, I don’t know if you work there or not, but directing folks to send their account info to a random Twitter address is, not considered best practice.
Being charitable, let's assume parent wasn't talking about secrets.
What best practice. He can choose whether he sends or not. The guy is just offering some extra help here.
What's wrong with sending a username to someone?
Generally, nothing inherently wrong with sending a username but directing people to a 3rd party social media platform rather than an official Anthropic email or support system does nothing to build trust that they actually work there.
- [deleted]
Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]
Was just about to post that Haiku 4.5 does something I have never encountered before [0], there is a massive delta between token/sec depending on the query. Some variance including task specific is of course nothing new, but never as pronounced and reproducible as here.
A few examples, prompted at UTC 21:30-23:00 via T3 Chat [0]:
Prompt 1 — 120.65 token/sec — https://t3.chat/share/tgqp1dr0la
Prompt 2 — 118.58 token/sec — https://t3.chat/share/86d93w093a
Prompt 3 — 203.20 token/sec — https://t3.chat/share/h39nct9fp5
Prompt 4 — 91.43 token/sec — https://t3.chat/share/mqu1edzffq
Prompt 5 — 167.66 token/sec — https://t3.chat/share/gingktrf2m
Prompt 6 — 161.51 token/sec — https://t3.chat/share/qg6uxkdgy0
Prompt 7 — 168.11 token/sec — https://t3.chat/share/qiutu67ebc
Prompt 8 — 203.68 token/sec — https://t3.chat/share/zziplhpw0d
Prompt 9 — 102.86 token/sec — https://t3.chat/share/s3hldh5nxs
Prompt 10 — 174.66 token/sec — https://t3.chat/share/dyyfyc458m
Prompt 11 — 199.07 token/sec — https://t3.chat/share/7t29sx87cd
Prompt 12 — 82.13 token/sec — https://t3.chat/share/5ati3nvvdx
Prompt 13 — 94.96 token/sec — https://t3.chat/share/q3ig7k117z
Prompt 14 — 190.02 token/sec — https://t3.chat/share/hp5kjeujy7
Prompt 15 — 190.16 token/sec — https://t3.chat/share/77vs6yxcfa
Prompt 16 — 92.45 token/sec — https://t3.chat/share/i0qrsvp29i
Prompt 17 — 190.26 token/sec — https://t3.chat/share/berx0aq3qo
Prompt 18 — 187.31 token/sec — https://t3.chat/share/0wyuk0zzfc
Prompt 19 — 204.31 token/sec — https://t3.chat/share/6vuawveaqu
Prompt 20 — 135.55 token/sec — https://t3.chat/share/b0a11i4gfq
Prompt 21 — 208.97 token/sec — https://t3.chat/share/al54aha9zk
Prompt 22 — 188.07 token/sec — https://t3.chat/share/wu3k8q67qc
Prompt 23 — 198.17 token/sec — https://t3.chat/share/0bt1qrynve
Prompt 24 — 196.25 token/sec — https://t3.chat/share/nhnmp0hlc5
Prompt 25 — 185.09 token/sec — https://t3.chat/share/ifh6j4d8t5
I ran each prompt three times and got (within expected variance meaning less than 5% plus or minus) the same token/sec results for the respective prompt. Each used Claude Haiku 4.5 with "High reasoning". Will continue testing, but this is beyond odd. I will add that my very early evals leaned heavily into pure code output, where 200 token/sec is consistently possible at the moment, but it is certainly not the average as claimed before, there I was mistaken. That being said, even across a wider range of challenges, we are above 160 token/sec and if you solely focus on coding, whether Rust or React, Haiku 4.5 is very swift.
[0] Normally not using T3 Chat for evals, just easier to share prompts this way, though was disappointed to find that the model information (token/sec, TTF, etc.) can't be enabled without an account. Also, these aren't the prompts I usually use for evals. Those I try to keep somewhat out of training by only using paid for API for benchmarks. As anything on Hacker News is most assuredly part of model training, I decided to write some quick and dirty prompts to highlight what I have been seeing.
Interesting and if they are using speculative decoding that variance would make sense. Also your numbers line up with what openrouter is now publishing at 169.1tps [1]
Anthropic mentioned this model is more then twice as fast as claude sonnet 4 [2], which OpenRouter averaged at 61.72 tps for sonnet 4 [3]. If these numbers hold we're really looking at an almost 3x improvement in throughput and less then half the initial latency.
[1] https://openrouter.ai/anthropic/claude-haiku-4.5 [2] https://www.anthropic.com/news/claude-haiku-4-5 [3] https://openrouter.ai/anthropic/claude-sonnet-4
That's what you get when you use speculative decoding and focus / overfit the draft model on coding. Then when the answer is out of distribution for the draft model, you get increased token rejections by the main model and throughput suffers. This probably still makes sense for them if they expect a lot of their load will come from claude code and they need to make it economical.
I'm curious to know if Anthropic mentions anywhere that they use speculative decoding. For OpenAI they do seem to use it based on this tweet [1].
> Everyone believes their task requires no less than Opus it seems after all.
I have solid evidence that it does. I have been using Opus daily, locally and on Terragonlabs for Rust work since June (on Max plan) and now, since a bit more than a week, being forced to use Sonnet 4.5 most of the time. Because of [1] (see also my comments there, same handle as HN).
Letting Sonnet do tasks on Terry, unsupervised is kinda useless as the fixes I have to do afterwards eat the time I saved giving it the task in the first place.
TLDR; Sonnet 4.5 sucks, compared to Opus 4.1. At least for the type of work I do.
Because of the recent Opus use restrictions Anthropic introduced on Max I use Codex to planning/eval/back and forth (detailed) and then Sonnet for writing code. And then Opus for the small ~5h window each week to "fix" what Sonnet wrote.
I.e. turn its code from something that compiles and passes tests, mostly, into canonical, DRY, good Rust code that passes all tests.
Also: for simpler tasks Opus-generated Rust code felt like I needed to glance at it when reviewing. Sonnet-generated Rust code requires line-by-line full-focus checking as a matter of fact.
This is an interesting perspective to me. For my work, Sonnet 4.5 is almost always better than Opus 4.1. Opus might still have a slight edge when it comes to complex edge-cases or niche topics, but that's about it.
And this is coming from someone who used to use Opus exclusively over Sonnet 4, as I found it was better in pretty much all ways other than speed. I no longer believe that with Sonnet 4.5. So, it is interesting to hear that there may still be areas where Opus wins. But I would definitely say that this does not apply to my work in working on bash scripts, web dev, and work in a C codebase. I am loving using Sonnet 4.5.
I'm doing computer graphics code, 2D, 3D, all CPU (not GPU) and VFX related. It's a niche topic for sure. Often relate to research papers that come without code.
I.e. I can tell from the generated code on this vs. other 'topics' that the model has not seen much or any "prior art".
Could have phrased that a bit better, but I did mean that while there are use cases in which the delta between Haiku, Sonnet, Opus or another providers model are clear, this is not the case for every task.
In my experience, yes, Opus 4 and 4.1 are significantly more reliable for providing C and Rust code. But just because that is the case, doesn't mean these should be the models everyone reaches for. Rather we should make a judgement based on use case and for simpler coding tasks, with a focus on Typescript, the delta between Sonnet 4.5 and Opus 4.1 (still to early to verifiably throw Haiku 4.5 in the ring) is not big enough in my testing to justify consistently reaching for the latter over the former.
This issue has been exacerbated by the recent performance degradations across multiple Sonnet and Opus models, during which many users switched between the two in an attempt to rectify the issue. Because the issue was sticky (once it affected a user it was likely to continue to do so due to the backend setup), some users saw a significant jump switching from e.g. Sonnet 4.5 to Opus 4.1 in performance, leading them to conclude that what they were doing most require the Opus model, despite their tasks not justifying such if Sonnet hadn't been degraded.
Did not comment on that while it was going on as I was fortunate enough not to be affected and thus could not replicate it, but it was clear that something was incorrect as the prompts and output those with degraded performance encountered were commonly shared and I could verify to my satisfaction that this was not merely bad prompting on their part. In any case, this experience strengthened some in believing their project that may be served equally well with e.g. Sonnet 4.5 in its now fixed state does necessitate Opus 4.1, which leads to them not benefiting from the better pricing. With Haiku being an even cheaper (and in the eyes of some automatically worse) model and Haikus past version not being very performant in any coding tasks, this may lead a lot to forgoing it out of default
Lastly, lest we forget, I think it is fair to say that the delta between the most into the weeds and the least informed Rust and React+TS developers ("vibe coding" completely off to the side) is very different.
There are amazing TS devs, incredibly knowledgeable and truly capable, which will take the time and have the interest to properly evaluate and select tools, including models based on their experience and needs. And there will be TS devs who just use this as a means to create a product, are not that experienced, tend to ask a model to "setup vite projet superthink" rather than run the command, reinvent TDD regularly as if solid practices where something only needed for LLM assistance and may just continue to use Opus 4.1 because during a few week window people said it was better, even if they may have started their project after the degradation had already been fixed. Path dependents, doing things, because others did them, so we just continue doing them ...
The average Rust or (even more so) C dev I think it is fair to say will have a more comprehensive understanding and I'd argue it less likely to choose e.g. Opus over Sonnet simply because they "believe" that is what they need. Like you, they will do a fair evaluation and then make an informed rather than a gut decision.
The best devs in any language are likely not that dissimilar in the experience and care with which they can approach new tooling (if they are so inclined which is a topic for another day), but the less skilled devs are likely very different in this regard depending on the language.
Essentially, was a bit hyperbole and never meant to apply to literally every dev in every situation regardless of their tech stack, skill or willingness to evaluate. Anyone who tests models consistently on their specific needs and goes for what they have the most consistent success with, over simply selecting the biggest, most modern or most expensive for every situation, is an exception to that overly broad statement.
Been waiiting for the Haiku update as I still do a lot of dumb work with the old one, and it is darrn cheap for what you get out of it with smart prompting. Very neat they finally release this, updating all my bots... sorry agents :)
- [deleted]
Those numbers don’t mean anything without average token usage stats.
Exactly, token per dollar rates are useful, but without knowing the typical input output token distribution for each model on this specific task, the numbers alone don’t give a full picture of cost.
That’s how they lie to us. Companies can advertise cheap prices to lure you in but they know very well how many tokens you’re going to use on average so they will still make more profit than ever, especially if you’re using any kind of reasoning model which is just like a blank check for them to print money.
I don’t think any of them are profitable are they? We’re in the losing money to gain market share phase of this industry.
Fair point of course and it is still far to early to make a definitive statement, but in my still limited experience throughout the night, I have seen Haiku 4.5 be far better in using what I'd consider a justifiable amount of input tokens over e.g. GPT-5 models. Sonnets recent versions also had been better on this front over OpenAIs current best, but I try (not always succeed) to take prior experience and expectation out of the equation when evaluating models.
Additionally, the AA cost to run benchmark suite numbers are very encouraging [0] and Haiku 4.5 without reasoning is always an option too. Tested that even less, but there is some indication that reasoning may not be necessary for reasonable output performance [1][2][3].
In retrospect, I perhaps would have been served better starting with "reasoning" disabled, will have to do some self-blinded comparisons between model outputs over the coming weeks to rectify that. Am trying my best not to make a judgement yet, but compared to other recent releases, Haiku 4.5 has a very interesting, even distribution.
GPT-5 models were and continue to be encouraging for price/performance with a reliable 400k window and good adherence to prompts with multi minute (beyond 10) adherence, but from the start weren't the fastest and ingests every token there is in a code base with reckless abandon.
No Grok model ever performed for me like they seem to during the initial hype
GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.
Recent Anthropic releases were good at code output quality, but not as reliable beyond 200k vs GPT-5, not exactly fast either when looking at token/sec, though task completion generally takes less time due to more efficient ingestion vs GPT-5 and of course rather expensive.
Haiku 4.5, if they can continue to offer it at such speeds with such low latency and at this price, cupeled with encouraging initial output quality and efficient ingestion of repos seems to be designed in a far more balanced manner, which I welcome. Course with 200k being a hard limit, that is a clear downside compared to GPT-5 (and Gemini 2.5 Pro though that has its own reliability issues in tool calling) and I have yet to test whether it can go beyond 8 min on chains of tool calls with intermittent code changes without suffering similar degradation to other recent Anthropic models, but I am seeing the potential for solid value here.
[0] https://artificialanalysis.ai/?models=gpt-5-codex%2Cgpt-5-mi...
[1] Claude 4.5 Haiku 198.72 tok/sec 2382 tokens Time-to-First: 1.0 sec https://t3.chat/share/35iusmgsw9
[2] Claude 4.5 Haiku 197.51 tok/sec 3128 tokens Time-to-First: 0.91 sec https://t3.chat/share/17mxerzlj1
[3] Claude 4.5 Haiku 154.75 tok/sec 2341 tokens Time-to-First: 0.50 sec https://t3.chat/share/96wfkxzsdk
> GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.
Funny you should say that, because while it is a large model the GLM 4.5 is at the top of Berkley's Function Calling Leaderboard [0] and has one of the lowest costs. Can't comment on speed compared to those smaller models, but the Air version of 4.5 is similarly highly-ranked.