Gemma4 in my view is good enough to do things similar to Gemini 2.5 flash, meaning if I point it code and ask for help and there is a problem with the code it’ll answer correctly in terms of suggestions but it’s not great at using all tools or one shooting things that require a lot of context or “expert knowledge”
If a couple more iterations of this, say gemma6 is as good as current opus and runs completely locally on a Mac, I won’t really bother with the cloud models.
That’s a problem.
For the others anyway.
I agree. At first I was really turned off by the Gemma 4 line of models because they didn’t function with coding agents as well as the qwen3.5 line of models. However, I found that for other use cases Gemma 4 was very good.
EDIT: I just saw this: “”Ollama 0.20.6 is here with improved Gemma 4 tool calling!”” I will rerun my tests after breakfast.
Even Gemma 4 E2B is more useful than you'd think if you give it the right harness. I've been running it on Android via llama.rn and it handles function calling natively — the model outputs structured tool calls without any prompt engineering. Won't replace Opus for hard reasoning but for a mobile app that needs to pick a tool and run it, the cost math is hard to argue with. $0/query forever.
similar vibes as "640k ought to be enough for anybody"
I think the difference is that with LLMs, in a lot of cases you do see some diminishing returns.
I won't deny that the latest Claude models are fantastic at just one shotting loads of problems. But we have an internal proxy to a load of models running on Vertex AI and I accidentally started using Opus/Sonnet 4 instead of 4.6. I genuinely didn't know until I checked my configuration.
AI models will get to this point where for 99% of problems, something like Gemma is gonna work great for people. Pair it up with an agentic harness on the device that lets it open apps and click buttons and we're done.
I still can't fathom that we're in 2026 in the AI boom and I still can't ask Gemini to turn shuffle mode on in Spotify. I don't think model intelligence is as much of an issue as people think it is.
100% agree here. The actual practical bottleneck is harness and agentic abilities for most tasks.
It's the biggest thing that stuck out to me using local AI with open source projects vs Claude's client. The model itself is good enough I think - Gemma 4 would be fine if it could be used with something as capable as Claude.
And that's gonna stay locked down unfortunately especially on mobile and cars - it needs access to APIs to do that stuff - and not just regular APIs that were built for traditional invoking.
The same way that websites are getting llm.txts I think APIs will also evolve.
I'm not sure I understand your last paragraph? The two sentences seem to contradict?
GPT 3.5 was intelligent enough to understand that command and turn it into a correct shaped JSON object: the platforms don't have tight enough integration to take advantage of the intelligence
Agree on the diminishing returns,the Opus 4.6 anecdote is a good signal
I think security is the issue-ai is good at circumventing this. For example , ai can read paywalled articles you cannot. Do you really want ai to have ‘free range’.?
I mean to me even difference between Opus and Sonnet is as clear as day and night, and even Opus and the best GPT model. Opus 4.6 just seems much more reliable in terms of me asking it to do something, and that to actually happen.
It depends what you're asking it though. Sure, in a software development environment the difference between those two models is noticeable.
But think about the general user. They're using the free Gemini or ChatGPT. They're not using the latest and greatest. And they're happy using it.
And I am willing to bet that a lot of paying users would be served perfectly fine by the free models.
If a capable model is able to live on device and solve 99% of people's problems, then why would the average person ever need to pay for ChatGPT or Gemini?
But even other tasks, like research etc, where dates are important, little details and connections are important, reasoning is important, background research activities or usage of tools outside of software development, and this is where I am finding much of the LLMs most useful for my life.
Even Opus makes mistakes with dates or not understanding news and everything correctly in context with chronological orders etc, and it would be even worse with smaller and less performing models.
Scheduling, planning, researching products, shopping, trip plans, etc...
You're quick to say "to me" in your comparison.
My experience is very different than yours. Codex and CC yield very differenty result both because of the harness differencess and the model differences, but niether is noticeably better than the other.
Personally, I like Codex better just because I don't have to mess with any sort of planning mode. If I imply that it shouldn't change code yet, it doesn't. CC is too impatient to get started.
I guess yes, that's a harness difference, and you can also configure CC as a harness to behave very differently, but still with same harness and guidance, "to me" there's still a difference in terms of Opus 4.6 and e.g. GPT 5.4 or which GPT model do you use? I've been using Claude Code, Codex and OpenCode as harnesses presently, but for serious long running implementation I feel like I can only really rely on CC + Opus 4.6.
Yes 5.4
Perhaps Opus is superior and I'm just jaded.
I come from Cursor before having adopted the TUI tools. Opus was nothing short of pathetic in their environment compared to the -codex models. I would only use it for investigations and planning because it was faster.
Like you've said, though, that could just be a harness issue.
I have the opposite experience. Codex gets to work much faster than Claude Code. Also I've never seen the need to use planning mode for Claude. If it thinks it needs a plan it will make one automatically.
I'll drink to the idea that it's all in my head.
Well you can do a lot with 640k…if you try. We have 16G in base machines and very few people know how to try anymore.
The world has moved on, that code-golf time is now spent on ad algorithms or whatever.
Escaping the constraint delivered a different future than anticipated.
> you can do a lot with 640k…if you try.
it is economically not viable to try anymore.
"XYZ Corp" won't allow their developers to write their desktop app in Rust because they want to consume only 16MB RAM, then another implementation for mobile with Swift and/or Kotlin, when they can release good enough solution with React + Electron consuming 4GB RAM and reuse components with React Native.
Strangely enough, AI could turn this on its head. You can have your cake and eat it too, because you can tell Claude/Codex/whatever to build you a full-featured Swift version for iOS and Kotlin for Android and whatever you want on Windows and Mac. There's still QA for the different builds, but you already have to QA each platform separately anyway if you really care that they all work, so in theory that doesn't change.
Of course, it's never that simple in reality; you need developers who know each platform for that to work, because you must run the builds and tell the AI what it's doing wrong and iterate. Currently, you can probably get away with churning out Electron slop and waiting for users to complain about problems instead of QAing every platform. Sad!
My Commodore 64 begs to differ.
Especially if the 640k are "in your hand" and the rest is "in the cloud"
The simple fact is that a 16 GB RAM stick costs much less than the development time to make the app run on less.
> The simple fact is that a 16 GB RAM stick costs much less than the development time to make the app run on less.
The costs are borne by different people: development by the company, RAM sticks by the customer.
A company is potentially (silently?) adding to the cost of the product/service that the customer has to bear by needed to have more RAM (or have the same amount, but can't do as much with it).
Yep, and since companies care about TCO, they reward the software with the lower TCO, which happens to be the one that uses more RAM but is cheaper to produce.
Until RAM prices increases significantly.
- [deleted]
One stick does. How about all the sticks needed for all the people who want to run the software?
Still cheaper, since it amortizes over all the software.
Some software has millions or even billions of users. The cost of 16 GB multiplied by million millions or billions would pay for a lot of refactoring.
That said, I think it’s more of a collective action problem. The person who could pay for the refactor to operate in 640 K is not the same person who has to pay for the 16 GB. And yes, the 16 GB is cheap enough in comparison to other costs that the latter group doesn’t necessarily notice that they are subsidizing inefficient development.
I think stavros means amortization on an individual level - if all software is bloated and requires 16GB to run then my expense for a 16GB stick is not caused by a single piece of software, but everything I use.
Not that I agree of course :) I’m talking more of the net negative of everyone needing to buy 16gb sticks so developers can YOLO vibe-coded unoptimized garbage. But at least I think the former explanation is what stavros meant :)
People get hung up on bad optimization. It you are the working at sufficiently large scale, yes, thinking about bytes might be a good use of your time.
But most likely, it's not. At a system level we don't want people to do that. It's a waste of resources. Making a virtue out of it is bad, unless you care more about bytes than humans.
These bytes are human lives. The bytes and the CPU cycles translate to software that takes longer to run, that is more frustrating, that makes people accomplish less in longer time than they could, or should. Take too much, and you prevent them from using other software in parallel, compounding the problem. Or you're forcing them to upgrade hardware early, taking away money they could better spend in different areas of their lives. All this scales with the number of users, so for most software with any user base, not caring about bytes and cycles is wasting much more people-hours than is saving in dev time.
Creating people able to do these optimizations costs human life, which is not spend on other things, like building the unoptimized version of another product.
We're not talking about writing assembly by hand here. If your software has a million daily users and wastes a minute of their day, that's about 9 work-years of labour wasted every single day.
In a 5-year lifecycle that's about 10,000 years of human labour wasted. Yes, I had to quadruple-check this myself.
Does it take 10,000 work-years of effort, per project, to train its developers to write reasonably performant code?
Of course not all of this would translate into actual productivity gains but it doesn't have to.
What world are you living in where the median piece of software has a million users? Or even a hundredth of that?
I'm not talking about the median piece of software with 2 users and 0.1 developers (I made that up).
The ones that stick out are actively maintained, widely used, and well funded. It doesn't have to be a million active users, but they should be the first to get their act together.
It's what many software companies dream of and aim for. But the same argument works with 10k users, or even 1k users.
You are failing to consider the opportunity cost of how much more work-years can be saved by making a new feature.
Look at the whole history of computing. How many times has the pendulum swung from thin to fat clients and back?
I don't think it's even mildly controversial to say that there will be an inflection point where local models get Good Enough and this iteration of the pendulum shall swing to fat clients again.
Assuming improvements in LLMs follow a sigmoid curve, even if the cloud models are always slightly ahead in terms of raw performance it won't make much of a difference to most people, most of the time.
The local models have their own advantages (privacy, no -as-a-service model) that, for many people and orgs, will offset a small performance advantage. And, of course, you can always fall back on the cloud models should you hit something particularly chewy.
(All IMO - we're all just guessing. For example, good marketing or an as-yet-undiscovered network effect of cloud LLMs might distort this landscape).
More than "a 3 year old laptop is fine"
My thinkpad is nearly 10 years old, I upgraded it to 32GB of ram and have replaced the battery a couple of times, but it's absolutely fine apart from that.
If AI which was leading edge in 2023 can run on a 2026 laptop, then presumably AI which is leading edge in 2026 will run on a 2029 laptop. Given that 2023 was world changing then that capacity is now on today's laptop
Either AI grows exponentially in which case it doesn't matter as all work will be done by AI by 2035, or it plateaus in say 2032 in which case by 2035 those models will run on a typical laptop.
The economy is, more or less, a competition.
If someone gets a really great axe and are happy with it, that’s great for them.
But then, other people will be on bulldozers.
They can say they are happy with the axe, but then they are not in the competition at that point.
I think the article was wondering how many billion dollar bulldozers the world needs. My local hardware store sells a variety of axes. I myself am a happy ax user. I even replace them.
> it’s not great at using all tools
Glad it wasnt just me - i was impressed with the quality of Gemma4 - it just couldnt write the changes to file 9/10 times when using it with opencode
https://huggingface.co/google/gemma-4-31B-it/commit/e51e7dcd...
There was an update to tool calling 3 days ago. I haven't tested it myself but hope it helps.
Wow, that is so much better! I didnt exactly test it extensively but my issues are gone.
Hmm.. is there an updated onnx?
> it just couldnt write the changes to file 9/10 times when using it with opencode
You might want to give this a try, it dramatically improves Edit tool accuracy without changing the model: https://blog.can.ac/2026/02/12/the-harness-problem/
Yep, and to be honest we don't really need local models for intensive tasks. At least yet. You can use openrouter (and others) to consume a wide variety of open models which are capable of using tools in an agentic workflow, close to the SOTA models, which are essentially commodities - many providers, each serving the same model and competing with each-other on uptime, throughput, and price. At some point we will be able to run them on commodity hardware, but for now the fact that we can have competition between providers is enough to ensure that rug pulls aren't possible.
Plus having Gemma on my device for general chat ensures I will always have a privacy respecting offline oracle which fulfils all of the non-programming tasks I could ever want. We are already at the point where the moat for these hyper scalers has basically dissolved for the general public's use case.
If I was OpenAI or Anthropic I would be shitting my pants right now and trying every unethical dark pattern in the book to lock in my customers. And they are trying hard. It won't work. And I won't shed a single tear for them.
Local models seem somewhere between 9 and 24 months behind. I'm not saying I won't be impressed with what online models will be able to do in two years, but I'm pretty satisfied with the prediction that I won't really need them in a couple of years.
We still aren't going to be putting 200gb ram on a phone in a couple years to run those local models.
HBF is coming fast, with the first examples expected to be sampling to users this year.
The storage technology of Flash memory can be optimized to be as fast and more energy-efficient than DRAM at large linear reads, there was just little demand before because doing so costs you ~half of your density and doesn't improve your writes at all. All the flash memory manufacturers realized that this is a huge opportunity for model weights and are now chasing this.
Or in other words, after the initial price peak stabilizes in a few years, it will be reasonable to put ~500GB of weights into a device for ~$100 in memory costs.
We don’t need 200gb of RAM on a phone to run big models. Just 200 GB of storage thanks to Apple’s “LLM in a flash” research.
Yes, I agree that this is the right solution, because for a locally-hosted model I value more the quality of the output than the speed with which it is produced, so I prefer the models as they were originally trained, not with further quantizations.
While that paper praises the Apple advantage in SSD speed, which allows a decent performance for inference with huge models, nowadays SSD speeds equal or greater than that can be achieved in any desktop PC that has dual PCIe 5.0 SSDs, or even one PCIe 5.0 and one PCIe 4.0 SSDs.
Because I had also independently reached this conclusion, like I presume many others, I have just started to work a week ago on modifying llama.cpp to use in an optimal manner weights stored on SSDs, while also batching many tasks, so that they will share each pass through the SSDs. I assume that in the following months we will see more projects in this direction, so the local hosting of very large models will become easier and more widespread, allowing the avoidance of the high risks associated with external providers, like the recent enshittification of Claude Code.
> While that paper praises the Apple advantage in SSD speed, which allows a decent performance for inference with huge models, nowadays SSD speeds equal or greater than that can be achieved in any desktop PC that has dual PCIe 5.0 SSDs, or even one PCIe 5.0 and one PCIe 4.0 SSDs.
Apple’s advantage is their unified memory architecture where the CPU, GPU and Neural Engine share the same memory and the SSD is directly connected to the SoC--less latency than PCIe. Memory bandwidth starts at 300+ GB/s.
In an optimized implementation of model inference, the latency of SSD access has no importance, because no random accesses are done.
The purpose of optimizing model inference for weights stored on SSDs is to achieve a continuous reading from SSDs at the maximum throughput provided by hardware, taking care that any computations and any accesses to the main memory are overlapped over the SSDs reading.
That amount of RAM won’t be necessary. Gemma 4 and comparably sized Qwen 3.5 models are already better than the very best, biggest frontier models were just 12-18 months ago. Now in an 18-36GB footprint, depending on quantization.
A lot of people are making the mistake of noticing that local models have been 12-24 months behind SotA ones for a good portion of the last couple years, and then drawing a dotted line assuming that continues to hold.
It simply.. doesn't. The SotA models are enormous now, and there's no free lunch on compression/quantization here.
Opus 4.6 capabilities are not coming to your (even 64-128gb) laptop or phone in the popular architecture that current LLMs use.
Now, that doesn't mean that a much narrower-scoped model with very impressive results can't be delivered. But that narrower model won't have the same breadth of knowledge, and TBD if it's possible to get the quality/outcomes seen with these models without that broad "world" knowledge.
It also doesn't preclude a new architecture or other breakthrough. I'm simply stating it doesn't happen with the current way of building these.
edit: forgot to mention the notion of ASIC-style models on a chip. I haven't been following this closely, but last I saw the power requirements are too steep for a mobile device.
Don’t underestimate the march of technology. Just look at your phone, it has more FLOPS than there were in the entire world 40 years ago.
And I think it's very likely that with improved methods you could get opus 4.6 level performance on a wrist watch in few years.
You needed supercomputer to win in chess until you didn't.
Currently local models performance in natural language is much better than any algorithm running on a super computer cluster just few years ago.
Yeah, but that's the current state of the art after decades of aggressive optimizations, there's no foreseeable future where we'll ever be able to cram several orders of magnitude more ram into a phone.
We already cram several orders of magnitude more flash storage into phone than RAM (e.g. my phone has 16 GB RAM but 1 TB storage); even now, with some smart coding, if you don't need all that data at the same time for random access at sub millisecond speed, it's hard to tell the difference.
Agreed. Apple is sells an iPhone Pro Max with 2 TB of storage.
but it doesn't have that much more flops than it did a couple of years ago.
- [deleted]
Would the model even need that breath of knowledge? Humans just look things up in books or on Wikipedia, which you can store on a plain old HDD, not VRAM. All books ever written fit into about 60TB if you OCR them, and the useful information in them probably in a lot less, that's well within the range of consumer technology.
Pretty sure there’s at least a couple orders of magnitude in purely algorithmic areas of LLM inference; maybe training, too, though I’m less confident here. Rationale: meat computers run on 20W, though pretraining took a billion years or so.
There's been plenty of free lunch shrinking models thus far with regards to capability vs parameter count.
Contradicting that trend takes more than "It simply.. doesn't."
There's plenty of room for RAM sizes to double along with bus speed. It idled for a long time as a result of limited need for more.
The gap between SOTA models and open / local models continues to diminish as SOTA is seeing diminishing returns on scaling (and that seems to be the main way they are "improving"), whereas local models are making real jumps. I'm actually more optimistic local models will catch up completely than I am SOTA will be taking any great leaps forward.
> We still aren't going to be putting 200gb ram on a phone in a couple years to run those local models.
You can already buy an iPhone with 2 TB of storage. The CPU, GPU and Neural Engine all share the same pool of RAM and the SSD is directly connected to all of this. You won’t need 200 GB of RAM to run local models when you essentially have 500 GB of virtual memory.
> if I point it code and ask for help and there is a problem with the code it’ll answer correctly in terms of suggestions
could I ask how you do that? I installed openclaw and set it to use Gemma 4 but it didn't act in an agent mode at all, it only responded in the chat window while doing nothing, and didn't read any files or do anything that you wrote (though I see you do mention that it's not great at using all tools). What are you using exactly?
I had the same issues. I had to tell it to use sub agents explicitly, and instead of saying set a cron say set an openclaw cron.
I generally do like the model, it’s not a great agent though.
It’s good for summarization tasks, small tool use, and has pretty good world knowledge, though it does hallucinate.
But that difference atm is the difference between it being OK on its own with a team of subagents given good enough feedback / review mechanisms or having to babysit it prompt by prompt.
By the time gemma6 allows you to do the above the proprietary models supposedly will already be on the next step change. It just depends if you need to ride the bleeding edge but specially because it's "intelligence", there's an obvious advantage in using the best version and it's easy to hype it up and generate fomo.
> But that difference atm is the difference between it being OK on its own with a team of subagents given good enough feedback
Do people actually build meaningful things like that?
It's basically impossible to leave any AI agent unsupervised, even with an amazing harness (which is incredibly hard to build). The code slowly rots and drifts over time if not fully reviewed and refactored constantly.
Even if teams of agents working almost fully autonomously were reliable from a functional perspective (they would build a functional product), the end product would have ever increasing chaos structurally over time.
I'd be happy to be proven wrong.
[dead]
When that happens, you'll have fomo from not using opus 5.x. The numbers that they showed for Mythos show that the frontier is still steadily moving (and maybe even at a faster pace than before)
I would be surprised about that behavior even for 10% people doing real AI usable work. Very few people buy new motherboard or CPU or gfxcard every 3 months?
Even now just because the latest Anthropic is super great doesn't mean people are not using other models. Not everyone is subscribed to only the best.
There is a cognitive ceiling for what you can do with smaller models. Animals with simpler neural pathways often outperform whatever think they are capable of but there's no substitute for scale. I don't think you'll ever get a 4B or 8B model equivalent to Opus 4.6. Maybe just for coding tasks but certainly not Opus' breadth.
The only thing that we are sure can't be highly compressed is knowledge, because you can only fit so much information in given entropy budget without losing fidelity.
The minimal size limits of reasoning abilities are not clear at all. It could be that you don't need all that many parameters. In which case the door is open for small focused models to converge to parity with larger models in reasoning ability.
If that happens we may end up with people using small local models most of the time, and only calling out to large models when they actually need the extra knowledge.
> and only calling out to large models when they actually need the extra knowledge
When would you want lossy encoding of lots of data bundled together with your reasoning? If it is true that reasoning can be done efficiently with fewer parameters it seems like you would always want it operating normal data searching and retrieval tools to access knowledge rather than risk hallucination.
And re: this discussion of large data centers versus local models, do recall that we already know it's possible to make a pretty darn clever reasoning model that's small and portable and made out of meat.
I find it difficult to understand the distinction between parametric knowledge and reasoning skills in LLMs. I still think of them as distinct but I understand there is significantly overlap. Arguably, they are the same thing in LLMs. So I would assume that if reasoning is high quality, using RAG could be logical (if much slower). However if the lack of parametric knowledge impacts reasoning, then use of larger models seems warranted. A dumb LLM wouldn't offer sufficient results even with all the RAG in the world.
> we already know it's possible to make a pretty darn clever reasoning model
There's is a problem though: we know that it is possible, but we don't know how to (at least not yet and as far as I am aware). So we know the answer to "what?" question, but we don't know the answer to "how?" question.
I would call brains with the needed support infrastructure small.
I think you underestimate the amount of knowledge needed to deal with the complexities of language in general as opposed to specific applications. We had algorithms to do complex mathematical reasoning before we had LLMs, the drawback being that they require input in restricted formal languages. Removing that restriction is what LLMs brought to the table.
Once the difficult problem of figuring out what the input is supposed to mean was somewhat solved, bolting on reasoning was easy in comparison. It basically fell out with just a bit of prompting, "let's think step by step."
If you want to remove that knowledge to shrink the model, we're back to contorting our input into a restricted language to get the output we want, i.e. programming.
except you don't want knowledge in the model, and most of that "size" comes from "encoded knowledge", i.e. over fitting. The goal should be to only have language handling in the model, and the knowledge in a database you can actually update, analyze etc. It's just really hard to do so.
"world models" (for cars) maybe make sense for self driving, but they are also just a crude workaround to have a physics simulation to push understanding of physics. Through in difference to most topics, basic, physics tend to not change randomly and it's based on observation of reality, so it probably can work.
Law, health advice, programming stuff etc. on the other hand changes all the time and is all based on what humans wrote about it. Which in some areas (e.g. law or health) is very commonly outdated, wrong or at least incomplete in a dangerous way. And for programming changes all the time.
Having this separation of language processing and knowledge sources is ... hard, language is messy and often interleaves with information.
But this is most likely achievable with smaller models. Actually it might even be easier with a small model. (Through if the necessary knowledge bases are achievable to fit on run on a mac is another topic...)
And this should be the goal of AI companies, as it's the only long term sustainable approach as far as I can tell.
I say should because it may not be, because if they solve it that way and someone manages to clone their success then they lose all their moat for specialized areas as people can create knowledge bases for those areas with know-how OpenAI simple doesn't have access to. (Which would be a preferable outcome as it means actual competition and a potential fair working market.)
as a concrete outdated case:
TLS cipher X25519MLKEM768 is recommended to be enabled on servers which do support it
last time I checked AI didn't even list it when you asked it for a list of TLS 1.3 ciphers (through it has been widely supported since even before it was fully standardized..)
this isn't surprising as most input sources AI can use for training are outdated and also don't list it
maybe someone of OpenAI will spot this and feet it explicitly into the next training cycle, or people will cover it more and through this it is feed implicitly there
but what about all that many niche but important information with just a handful of outdated stack overflow posts or similar? (which are unlikely to get updated now that everyone uses AI instead..)
The current "lets just train bigger models with more encoded data approach" just doesn't work, it can get you quite far, tho. But then hits a ceiling. And trying to fix it by giving it also additional knowledge "it can ask if it doesn't know" has so far not worked because it reliably doesn't realize it doesn't know if it has enough outdated/incomplete/wrong information encoded in the model. Only by assuring it doesn't have any specialized domain knowledge can you make sure that approach works IMHO.
I think you are underestimating the strength a small model can get from tool use. There may be no substitute for scale, but that scale can live outside of the model and be queried using tools.
In the worst case a smaller model could use a tool that involves a bigger model to do something.
Small models are bad at tool use. I have liquidai doing it in the browser but it’s super fragile.
I don’t really understand this, but I hear it a lot so I know it’s just confusion on my part.
I’m running little models on a laptop. I have a custom tool service made available to a simple little agent that uses the small models (I’ve used a few). It’s able to search for necessary tool functions and execute them, just fine.
My biggest problem has been the llm choosing not to use tools at all, favoring its ability to guess with training data. And once in a while those guesses are junk.
Is that the problem people refer to when they say that small models have problems with tool use? Or is it something bigger that I wouldn’t have run into yet?