My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)

simonwillison.net

・

569 points

・

simonw

・

3 days ago

413 comments

NitpickLawyer ・ 3 days ago

> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.

Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.

And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)

By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.

genewitch ・ 3 days ago

I'll bite. How do i train/make and/or use LoRA, or, separately, how do i fine-tune? I've been asking this for months, and no one has a decent answer. websearch on my end is seo/geo-spam, with no real instructions.
I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?
- techwizrd ・ 3 days ago
  
  We have been fine-tuning models using Axolotl and Unsloth, with a slight preference for Axolotl. Check out the docs [0] and fine-tune or quantize your first model. There is a lot to be learned in this space, but it's exciting.
  0: https://axolotl.ai/ and https://docs.axolotl.ai/
  
  arkmm ・ 3 days ago
  ・ 6 more
  
  When do you think fine tuning is worth it over prompt engineering a base model?
  I imagine with the finetunes you have to worry about self-hosting, model utilization, and then also retraining the model as new base models come out. I'm curious under what circumstances you've found that the benefits outweigh the downsides.
  
  reissbaker ・ 3 days ago
  ・ 3 more
  
  For self-hosting, there are a few companies that offer per-token pricing for LoRA finetunes (LoRAs are basically efficient-to-train, efficient-to-host finetunes) of certain base models:
  - (shameless plug) My company, Synthetic, supports LoRAs for Llama 3.1 8b and 70b: https://synthetic.new All you need to do is give us the Hugging Face repo and we take care of the rest. If you want other people to try your model, we charge usage to them rather than to you. (We can also host full finetunes of anything vLLM supports, although we charge by GPU-minute for full finetunes rather than the cheaper per-token pricing for supported base model LoRAs.)
  - Together.ai supports a slightly wider number of base models than we do, with a bit more config required, and any usage is charged to you.
  - Fireworks does the same as Together, although they quantize the models more heavily (FP4 for the higher-end models). However, they support Llama 4, which is pretty nice although fairly resource-intensive to train.
  If you have reasonably good data for your task, and your task is relatively "narrow" (i.e. find a specific kind of bug, rather than general-purpose coding; extract a specific kind of data from legal documents rather than general-purpose reasoning about social and legal matters; etc), finetunes of even a very small model like an 8b will typically outperform — by a pretty wide margin — even very large SOTA models while being a lot cheaper to run. For example, if you find yourself hand-coding heuristics to fix some problem you're seeing with an LLM's responses, it's probably more robust to just train a small model finetune on the data and have the finetuned model fix the issues rather than writing hardcoded heuristics. On the other hand, no amount of finetuning will make an 8b model a better general-purpose coding agent than Claude 4 Sonnet.
  
  delijati ・ 3 days ago
  ・ 2 more
  
  Do you maybe know if there is a company in the EU that hosts models (DeepSeek, Qwen3, Kimi)?
  
  reissbaker ・ 3 days ago
  
  Most inference companies (Synthetic included) host in a mix of the U.S. and EU — I don't know of any that promise EU-only hosting, though. Even Mistral doesn't promise EU-only AFAIK, despite being a French company. I think at that point you're probably looking at on-prem hosting, or buying a maxed-out Mac Studio and running the big models quantized to Q4 (although even that couldn't run Kimi: you might be able to get it working over ethernet with two Mac Studios, but the tokens/sec will be pretty rough).
  
  tough ・ 3 days ago
  
  only for narrow applications where your fine tune can let you use a smaller model locally , specialised and trained for your specific use-case mostly
  
  whimsicalism ・ 3 days ago
  
  finetuning rarely makes sense unless you are an enterprise and even generally doesn't in most cases there either.
  
  syntaxing ・ 3 days ago
  
  What hardware do you train on using axolotl? I use unsloth with Google colab pro
- notpublic ・ 3 days ago
  
  https://github.com/unslothai/unsloth
  I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.
- qcnguy ・ 3 days ago
  
  LLM fine tuning tends to destroy the model's capabilities if you aren't very careful. It's not as easy or effective as with image generation.
  
  nxobject ・ 2 days ago
  
  My very cursory understanding -- at least from Unsloth's recommendations -- is that you have to work very hard to preserve reasoning/instruct capabilities [1]: for example to "preserve" Qwen3's reasoning capabilities (however that's operationalized), they suggest a fine-tuning corpus that's 75% chain of thought to 25% non-reasoning. Is that a significant issue for orgs/projects that currently rely on fine-tuning?
  [1] https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tun...
  
  israrkhan ・ 3 days ago
  ・ 5 more
  
  do you have a suggestion or a way to measure if model capabilities are getting destroyed? how do one measure it objectively?
  
  mensetmanusman ・ 2 days ago
  
  These are now questions at the cutting edge of academic research. It might be computationally unknowable until checked.
  
  RALaBarge ・ 3 days ago
  ・ 3 more
  
  Ask it a series of the same questions after you train that you posed before training started. Is the quality lower?
  
  israrkhan ・ 3 days ago
  ・ 2 more
  
  That series of questions will measure only a particular area. I am concerned about destorying model capabilities in some other area that that I do not pay attention to, and have no way of knowing.
  
  simonh ・ 3 days ago
  
  Isn’t that a general problem with LLMs? The only way to know how good it is at something is to test it.
- svachalek ・ 3 days ago
  
  For completeness, for Apple hardware MLX is the way to go.
  
  w10-1 ・ 3 days ago
  
  MLX github: https://github.com/ml-explore/mlx
  get started: https://developer.apple.com/videos/play/wwdc2025/315/
  details: https://developer.apple.com/videos/play/wwdc2025/298/
- minimaxir ・ 3 days ago
  
  If you're using Hugging Face transformers, the library you want to use is peft: https://huggingface.co/docs/peft/en/quicktour
  There are Colab Notebook tutorials around training models with it as well.
- otabdeveloper4 ・ 3 days ago
  
  > So what's the big secret about LLM LoRA?
  No clear use case for LLMs yet. ("Spicy" aka pornography finetunes are the only ones with broad adoption, but we don't talk about that in polite society here.)
  
  AlecSchueler ・ 2 days ago
  
  Where do we speak about it? It feels like the biggest use for these models right now is for deep fakes and other harassment but few people in the industry want to talk about it while continuing to enable it.
- jasonjmcghee ・ 3 days ago
  
  brev.dev made an easy to follow guide a while ago but apparently Nvidia took it down or something when they bought them?
  So here's the original
  https://web.archive.org/web/20231127123701/https://brev.dev/...
- electroglyph ・ 3 days ago
  
  unsloth is the easiest way to finetune due to the lower memory requirements
- pdntspa ・ 3 days ago
  
  Have you tried asking an LLM?
Nesco ・ 3 days ago

Zuck wouldn’t have leaked it on 4chan of all the places
- eckelhesten ・ 2 days ago
  
  It got leaked as a PR with an url to a magnet (torrent) afaik.
- tough ・ 3 days ago
  
  prob just told an employee to get it done no?
- vaenaes ・ 3 days ago
  
  Why not?
tonyhart7 ・ 3 days ago

is GLM 4.5 better than Qwen3 coder??
- diggan ・ 3 days ago
  
  For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.
  My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.
  They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.
  
  kelvinjps10 ・ 3 days ago
  ・ 12 more
  
  coding? they are coding models? what specific tasks is one performing better than the other?
  
  diggan ・ 3 days ago
  ・ 6 more
  
  They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.
  > what specific tasks is one performing better than the other?
  That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".
  
  reverius42 ・ 3 days ago
  ・ 5 more
  
  > coding isn't one homogeneous activity that one model beats all the other models at
  If you can't even replace one coding model with another, it's hard to imagine you can replace human coders with coding models.
  
  Philpax ・ 2 days ago
  ・ 3 more
  
  You probably can't replace a seasoned COBOL programmer with a seasoned Haskell programmer. Does that mean that either person is bad at programming as a whole?
  
  reverius42 ・ 2 days ago
  ・ 2 more
  
  This was my point -- if programmers are not fungible, how can companies claim to be replacing them by the thousands with AI?
  
  Philpax ・ 2 days ago
  
  You don't need to use the same model/system for every task. "AI" isn't a monolith; there's a spectrum of solutions for a spectrum of problems, and figuring out what's applicable to your problem today is one of the larger problems of deployment.
  
  diggan ・ 2 days ago
  
  What you mean "can't even replace"? You can, nothing in my comment says you cannot?
  
  whimsicalism ・ 3 days ago
  ・ 5 more
  
  glm 4.5 is not a coding model
  
  simonw ・ 3 days ago
  ・ 4 more
  
  It may not be code-only, but it was trained extensively for coding:
  > Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.
  From my notes here: https://simonwillison.net/2025/Jul/28/glm-45/
  
  whimsicalism ・ 3 days ago
  ・ 3 more
  
  yes, all reasoning models currently are, but it’s not like ds coder or qwen coder
  
  simonw ・ 3 days ago
  ・ 2 more
  
  I don't see how the training process for GLM-4.5 is materially different from that used for Qwen3-235B-A22B-Instruct-2507 - they both did a ton of extra reinforcement learning training related to code.
  Am I missing something?
  
  whimsicalism ・ 3 days ago
  
  I think the primary thing you're missing is that Qwen3-235B-A22B-Instruct-2507 != Qwen3-Coder-480B-A35B-Instruct. And the difference there is that while both do tons of code RL, in one they do not monitor performance on anything else for forgetting/regression and focus fully on code post-training pipelines and it is not meant for other tasks.
- NitpickLawyer ・ 3 days ago
  
  I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).
throwzasdf ・ 2 days ago

[dead]

bob1029 ・ 3 days ago

> still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.

I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.

It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.

Breza ・ 6 hours ago

Very well put. There's a lot to be gained from using smaller models and existing hardware. So many enterprise PMs skip straight to using a cutting edge LLM via API. There are many tasks where a self-hosted LLM or even a finetuned small language model can either complete a preliminary step or even handle the full task for much less money. And if a self-hosted model can do the job today, imagine what you'll be able to do in a year or five when you have more powerful hardware and even better models.
jonas21 ・ 3 days ago

Wasn't the bitter lesson about training on large amounts of data? The model that he's using was still trained on a massive corpus (22T tokens).
- itsalotoffun ・ 3 days ago
  
  I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.
  
  reactordev ・ 3 days ago
  
  This. When we ran out of speed on the CPU, we moved to the GPU. Same thing here. The more we work with (22T) models, quants, and decimating precision - the more we learn and find more novel ways to do things.
- yahoozoo ・ 3 days ago
  
  What does that have to do with quantizing?

righthand ・ 3 days ago

Did you understand the implementation or just that it produced a result?

I would hope an LLM could spit out a cobbled form of answer to a common interview question.

Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why did they not just pipe the JSON into our already working app that displays this data?

People around me for the most part are using LLMs to enhance their presentations, not to actually implement anything useful. I have been watching my coworkers use it that way for months.

Another example? A different coworker wanted to build a document macro to perform bulk updates on courseware content. Swapping old words for new words. To build the macro they first wrote a rubrick to prompt an LLM correctly inside of a word doc.

That filled rubrik is then used to generate a program template for the macro. To define the requirements for the macro the coworker then used a slideshow slide to list bullet points of functionality, in this case to Find+Replace words in courseware slides/documents using a list of words from another text document. Due to the complexity of the system, I can’t believe my colleague saved any time. The presentation was interesting though and that is what they got compliments on.

However the solutions are absolutely useless for anyone else but the implementer.

simonw ・ 3 days ago

I scanned the code and understood what it was doing, but I didn't spend much time on it once I'd seen that it worked.
If I'm writing code for production systems using LLMs I still review every single line - my personal rule is I need to be able to explain how it works to someone else before I'm willing to commit it.
I wrote a whole lot more about my approach to using LLMs to help write "real" code here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
- photon_lines ・ 3 days ago
  
  This is why I love using the Deep-Seek chain of reason output ... I can actually go through and read what it's 'thinking' to validate whether it's basing its solution on valid facts / assumptions. Either way thanks for all of your valuable write-ups on these models I really appreciate them Simon!
  
  vessenes ・ 3 days ago
  ・ 2 more
  
  Nota bene - there is a fair amount of research that indicates models outputs and ‘thoughts’ do not necessarily align with their chain of reasoning output.
  You can validate this pretty easily by asking some logic or coding questions: you will likely note that a final output is not necessarily the logical output of the end of the thinking; sometimes significantly orthogonal to it, or returning to reasoning in the middle.
  All that to say - good idea to read it, but stay vigilant on outputs.
  
  Breza ・ 6 hours ago
  
  That's a good note. I use DeepSeek for early planning of a project because of how valuable its reasoning output can be. It's common that I'll describe my problem and first draft architecture and see something in the output like "Since this has to be mobile optimized..." Then I'll stop generation, edit the original prompt to specify that I don't have to worry about mobile, and run it again.
- larodi ・ 3 days ago
  
  I think is the right way to do it. Produce with LLM, debug and read every online. Delete lots of it.
  Many people fear this approach for production, but it is reasonable compared to someone with a single course in Coursera writing production JS code.
  Yet, we tend to say the LLM wrote this and that which implies model did all the work. In reality it should be understood as a complex heavy lifting machine which is expected to be operated by a very well prepared operator.
  The fact I got a Kango and drilled some holes does not make me engineer right? And it takes an engineer to sign off the building even thought it was archicad doing the math.
- shortrounddev2 ・ 3 days ago
  
  Serious question: if you have to read every line of code in order to validate it in production, why not just write every line of code instead?
  
  simonw ・ 3 days ago
  ・ 8 more
  
  Because it's much, much faster to review a hundred lines of code than it is to write a hundred lines of code.
  (I'm experienced at reading and reviewing code.)
  
  yencabulator ・ 2 days ago
  ・ 2 more
  
  This sounds like a recipe for destructive bugs and security vulnerabilities to slip into production.
  Reviewing is really hard to do well. Like, on a psychological level. Your brain just starts nodding and humming along, pretending to understand. Humans have to consciously "perform review" to actually review. For example, https://en.wikipedia.org/wiki/Pointing_and_calling and checklists in aviation and health care, Tom Gilb's "Inspection" JPL-inspired spec review processes.
  Even HN gets a steady drip of "look at my vibecoded project" -- "umm, you just leaked your API keys".
  It's just that reviewing doesn't matter for a space invaders clone.
  
  simonw ・ 2 days ago
  
  Reviewing isn't nearly as hard if you told the model exactly what to write already: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#t...
  
  paufernandez ・ 3 days ago
  ・ 3 more
  
  Simon, don't you fear "atrophy" in your writing ability?
  
  simonw ・ 3 days ago
  
  I think it will happen a bit, but I'm not worried about it.
  My ability to write with a pen has suffered enormously now that I do most of my writing on a phone or laptop - but I'm writing way more.
  I expect I'll become slower at writing code without an LLM, but the volume of (useful) code I produce will be worth the trade off.
  
  DonHopkins ・ 3 days ago
  
  Reading other people's (or llm's) code is one of the best ways of improving your own coding abilities. Lazy people using llms to avoid reading any code is called "vibe coding", and their abilities atrophy no matter who or what wrote the code they refuse to read.
  
  otabdeveloper4 ・ 3 days ago
  ・ 2 more
  
  Absolutely false for anything but the most braindead corporate CRUD code.
  We hate reading code and will avoid the hassle every time, but that doesn't mean it is easy.
  
  DonHopkins ・ 3 days ago
  
  >We hate reading code and will avoid the hassle every time, but that doesn't mean it is easy.
  Speak for yourself. I love reading code! It's hard and it takes a lot of energy, but if you hate it, maybe you should find something else to do.
  Being a programmer who hates reading code is like being a bus driver who hates looking at the road: dangerous and menacing to the public and your customers.
- th0ma5 ・ 3 days ago
  
  [flagged]
  
  dang ・ 3 days ago
  ・ 2 more
  
  Please don't cross into personal attack in HN comments.
  https://news.ycombinator.com/newsguidelines.html
  Edit: twice is already a pattern - https://news.ycombinator.com/item?id=44110785. No more of this, please.
  Edit 2: I only just realized that you've been frequently posting abusive replies in a way that crosses into harangue if not harassment:
  https://news.ycombinator.com/item?id=44725284 (July 2025)
  https://news.ycombinator.com/item?id=44725227 (July 2025)
  https://news.ycombinator.com/item?id=44725190 (July 2025)
  https://news.ycombinator.com/item?id=44525830 (July 2025)
  https://news.ycombinator.com/item?id=44441154 (July 2025)
  https://news.ycombinator.com/item?id=44110817 (May 2025)
  https://news.ycombinator.com/item?id=44110785 (May 2025)
  https://news.ycombinator.com/item?id=44018000 (May 2025)
  https://news.ycombinator.com/item?id=44008533 (May 2025)
  https://news.ycombinator.com/item?id=43779758 (April 2025)
  https://news.ycombinator.com/item?id=43474204 (March 2025)
  https://news.ycombinator.com/item?id=43465383 (March 2025)
  https://news.ycombinator.com/item?id=42960299 (Feb 2025)
  https://news.ycombinator.com/item?id=42942818 (Feb 2025)
  https://news.ycombinator.com/item?id=42706415 (Jan 2025)
  https://news.ycombinator.com/item?id=42562036 (Dec 2024)
  https://news.ycombinator.com/item?id=42483664 (Dec 2024)
  https://news.ycombinator.com/item?id=42021665 (Nov 2024)
  https://news.ycombinator.com/item?id=41992383 (Oct 2024)
  That's abusive, unacceptable, and not even a complete list!
  You can't go after another user like this on HN, regardless of how right you are or feel you are or who you have a problem with. If you keep doing this, we're going to end up banning you, so please stop now.
  
  SirChud ・ 3 days ago
  
  [flagged]
  
  ajcp ・ 3 days ago
  
  They said "production systems", not "critical production applications".
  Also the 'if' doesn't negate anything as they say "I still", meaning the behavior is actively happening or ongoing; they don't use a hypothetical or conditional after "still", as in "I still would".
  
  bnchrch ・ 3 days ago
  ・ 4 more
  
  You do realize your talking to the creator of Django, Datassette, and Lanyrd right?
  
  otabdeveloper4 ・ 3 days ago
  
  Offtopic, but Django is really bad and a huge pile of code smell. (Not a Django programmer. I manage them and can compare Django-infected projects to normal projects.)
  
  undefined ・ 3 days ago
  
  [deleted]
  
  tough ・ 3 days ago
  
  that made me chuckle
  
  CamperBob2 ・ 3 days ago
  
  I missed the part where he said he was going to put the Space Invaders game into production. Link?
bsder ・ 3 days ago

> However the solutions are absolutely useless for anyone else but the implementer.
Disposable code is where AI shines.
AI generating the boilerplate code for an obtuse build system? Yes, please. AI generating an animation? Ganbatte. (Look at how much work 3Blue1Brown had to put into that--if AI can help that kind of thing, it has my blessings). AI enabling someone who doesn't program to generate some prototype that they can then point at an actual programmer? Excellent.
This is fine because you don't need to understand the result. You have a concrete pass/fail gate and don't care about underneath. This is real value. The problem is that it isn't gigabuck value.
The stuff that would be gigabuck value is unfortunately where AI falls down. Fix this bug in a product. Add this feature to an existing codebase. etc.
AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.
- giantrobot ・ 2 days ago
  
  > AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.
  It's also giving PHBs the ability to hand ill-conceived ideas to a magic robot, receive "code" they can't understand, and throw it into production. All the while firing what real developers they had on staff.
  
  yencabulator ・ 2 days ago
  
  I expect many of those companies to fail in the 3mo-2y timeline, so in many ways I welcome PHBs to embrace their full stupidity. Same for the people who funded them.
  I do feel semi-sorry for anyone who paid for the services by those companies, though. Maybe something good will arise from that too, in the end; for example, it'd be nice if US society taught more critical reading skills to its members.
  The interesting game for the non-PHBs among us is figuring out if/how we can use LLMs in less risky ways, and what all is possible there. For example, I'd love to see work put into LLMs helping with formal correctness of software; there's a hard backstop there where either the proof checks or it doesn't. Code changes needed to enable less-painful proofs would hopefully largely be refactorings, where reviews should be easier and it might even work out to fuzz test that the old and new implementations return matching output for same input. Or similarly, LLM-powered test coverage improver that only writes new tests (old school/branch-based/mutation-based, there's plenty of room there).
magic_hamster ・ 3 days ago

The LLM is the solution.

AlexeyBrin ・ 3 days ago

Most likely its training data included countless Space Invaders in various programming languages.

gblargg ・ 3 days ago

The real test is if you can have it tweak things. Have the ship shoot down. Have the space invaders come from the left and right. Add two player simultaneous mode with two ships.
- wizzwizz4 ・ 3 days ago
  
  It can usually tweak things, if given specific instruction, but it doesn't know when to refactor (and can't reliably preserve functionality when it does), so the program gets further and further away from something sensible until it can't make edits any more.
  
  simonw ・ 3 days ago
  ・ 2 more
  
  For serious projects you can address that by writing (or having it write) unit tests along the way, that way it can run in a loop and avoid breaking existing functionality when it adds new changes.
  
  greesil ・ 3 days ago
  
  Okay ask it to write unit tests for space invaders next time :)
quantumHazer ・ 3 days ago

and probably some synthetic data are generated copy of the games already on the dataset?
i have this feeling with LLM's generated react frontend, they all look the same
- tshaddox ・ 3 days ago
  
  To be fair, the human-generated user interfaces all look the same too.
  
  undefined ・ 3 days ago
  
  [deleted]
- cchance ・ 3 days ago
  
  Have you used the internet? thats how the internet looks, their all fuckin react and the same layouts and styles 90% shadcn lol
- tw1984 ・ 3 days ago
  
  most human generated methods look the same. in fact, in SWE, we reward people for generating code that look & feel the same, they call it "work as a team".
- bayindirh ・ 3 days ago
  
  Last time somebody asked for a "premium camera app for iOS", and the model (re)generated Halide.
  Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...
  
  Uehreka ・ 3 days ago
  ・ 3 more
  
  > Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...
  People really need to stop saying this. I get that it was the Smart Guy Thing To Say in 2023, but by this point it’s pretty clear that that it’s not true in any way that matters for most practical purposes.
  Coding LLMs have clearly been trained on conversations where a piece of code is shown, a transformation is requested (rewrite this from Python to Go), and then the transformed code is shown. It’s not that they’re just learning codebases, they’re learning what working with code looks like.
  Thus you can ask an LLM to refactor a program in a language it has never seen, and it will “know” what refactoring means, because it has seen it done many times, and it will stand a good chance of doing the right thing.
  That’s why they’re useful. They’re doing something way more sophisticated than just “recombining codebases from their training data”, and anyone chirping 2023 sound bites is going to miss that.
  
  cztomsik ・ 3 days ago
  
  I don't know, I have mixed-bag experiences and it's not really improving. It greatly varies depending on the programming language and the kind of problem which I'm trying to solve.
  The tasks where it works great are things I'd expect to be part of dataset (github, blog posts), or they are "classic" LM tasks (understand + copy-paste/patch). The actual intelligence, in my opinion, is still very limited. So while it's true it's not "just recall" it still might be "mostly recall".
  BTW: Copy-paste is something which works great in any attention-based model. On the other hand, models like RWKV usually fail and are not suited for this IMHO (but I think they have much better potential for the AGI)
  
  yencabulator ・ 2 days ago
  
  > It’s not that they’re just learning codebases, they’re learning what working with code looks like.
  Working in any not-in-training-set environment very quickly shows the shortcomings of this belief.
  For example, Cloudflare Workers is V8 but it sure ain't Node, and the local sqlite in a Durable Object has a sync API with very different guarantees than a typical client-server SQL setup.
  Even in a more standard setting, it's really hard to even get an LLM to use the current-stable APIs when its training data contains now-deprecated examples. Your local rules, llms.txt mentions, corrections etc slip out of the context pretty fast and it goes back to trained data.
  The LLM can perhaps "read any code" but it really really prefers writing only code that was in its training set.
  
  FeepingCreature ・ 3 days ago
  ・ 11 more
  
  True where trivial; where nontrivial, false.
  Trivially, humans don't emit something they don't know either. You don't spontaneously figure out Javascript from first principles, you put together your existing knowledge into new shapes.
  Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times. Will it be put together from smaller fragments? Yes, this is called "experience" or if the fragments are small enough, "understanding".
  
  phkahler ・ 3 days ago
  ・ 2 more
  
  >> Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times.
  I think most people writing software today are reinventing a wheel, even in corporate environments for internal tools. Everyone wants their own tweak or thinks their idea is unique and nobody wants to share code publicly, so everyone pays programmers to develop buggy bespoke custom versions of the same stuff that's been done 100 times before.
  I guess what I'm saying is that your requirements are probably not new, and to the extent they are yes an LLM can fill in the blanks due to its fluency in languages.
  
  FeepingCreature ・ 2 days ago
  
  Nothing is truly and completely new. I'm not formulating my requirements in an extinct language. My point is "filling in the blanks" and "do new things" are a spectrum.
  LLMs have their limits, but they really can understand and productively contribute to programs that achieve a purpose that no program on the internet has done yet. What they are doing is not interpolation at the highest level. It may be interpolation/extrapolation at a lower level, but this goes for any skill learnt by anyone ever.
  
  bayindirh ・ 3 days ago
  ・ 8 more
  
  Humans can observe ants and invent any colony optimization. AIs can’t.
  Humans can explore what they don’t know. AIs can’t.
  
  falcor84 ・ 3 days ago
  ・ 3 more
  
  What makes you categorically say that "AIs can't"?
  Based on my experience with present day AIs, I personally wouldn't be surprised at all that if you showed Gemini 2.5 Pro a video of an insect colony and asked it "Take a look at the way they organize and see if that gives you inspiration for an optimization algorithm", it will spit something interesting out.
  
  sarchertech ・ 3 days ago
  ・ 2 more
  
  It will 100% have something in its training set discussing a human doing this and will almost definitely spit out something similar.
  
  fc417fc802 ・ 3 days ago
  
  That's a good point but all it means is that we can't test the hypothesis one way or the other due to never being entirely certain that a given task isn't anywhere in the training data. Supposing that "AIs can't" is then just as invalid as supposing that "AIs can".
  
  FeepingCreature ・ 3 days ago
  
  What makes you categorically say that "humans can"?
  I couldn't do that with an ant colony. I would have to train on ant research first.
  (Oh, and AIs can absolutely explore what they don't know. Watch a Claude Code instance look at a new repository. Exploration is a convergent skill in long-horizon RL.)
  
  ben_w ・ 3 days ago
  
  > Humans can observe ants and invent any colony optimization. AIs can’t.
  Surely this is exactly what current AI do? Observe stuff and apply that observation? Isn't this the exact criticism, that they aren't inventing ant colonies from first principles without ever seeing one?
  > Humans can explore what they don’t know. AIs can’t.
  We only learned to decode Egyptian hieroglyphs because of the Rosetta Stone. There's no translation for North Sentinelese, the Voynich manuscript, or Linear A.
  We're not magic.
  
  CamperBob2 ・ 3 days ago
  
  That's what benchmarks like ARC-AGI are designed to test. The models are getting better at it, and you aren't.
  Nothing ultimately matters in this business except the first couple of time derivatives.
  
  numpad0 ・ 3 days ago
  
  humans also eat
  
  satvikpendem ・ 3 days ago
  ・ 8 more
  
  This doesn't make sense thermodynamically because models are far smaller than the training data they purport to hold and recall, so there must be some level of "understanding" going on. Whether that's the same as human understanding is a different matter.
  
  Eggpants ・ 3 days ago
  ・ 7 more
  
  It’s a lossy text compression technique. It’s clever applied statistics. Basically an advanced association rules algorithm which has been around for decades but modified to consider order and relative positions.
  There is no understanding, regardless of the wants of all the capital investors in this domain.
  
  simonw ・ 3 days ago
  ・ 4 more
  
  I don't care if it can "understand" anything, as long as I can use it to achieve useful things.
  
  Eggpants ・ 3 days ago
  ・ 3 more
  
  “useful things“ like poorly drawing birds on bikes? ;)
  (I have much respect for what you have done and are currently doing, but you did walk right into that one)
  
  msephton ・ 3 days ago
  ・ 2 more
  
  The pelican on a bicycle is a very useful test.
  
  dfedbeef ・ 2 days ago
  
  Yeah what if you need a drawing of a pelican on a bicycle
  
  CamperBob2 ・ 3 days ago
  
  It’s a lossy text compression technique.
  That is a much, much bigger deal than you make it sound like.
  Compression may, in fact, be all we need. For that matter, it may be all there is.
  
  undefined ・ 3 days ago
  
  [deleted]
  
  mr_toad ・ 3 days ago
  ・ 4 more
  
  > They remix and rewrite what they know. There's no invention, just recall...
  If they only recalled they wouldn’t “hallucinate”. What’s a lie if not an invention? So clearly they can come up with data that they weren’t trained on, for better or worse.
  
  0x457 ・ 3 days ago
  ・ 3 more
  
  Because internally, there isn't a difference between correctly "recalled" token and incorrectly (hallucinated).
  
  pbhjpbhj ・ 2 days ago
  ・ 2 more
  
  Depends on the training? If there was eg RLHF then those connections are stronger and more likely; that's a difference (but not a category difference).
  
  0x457 ・ 5 hours ago
  
  Yes, but I thought we're talking about category difference.
  Proper RLHF surely boosts "predicted next token until it couldn't" to feel more like "actually recalled".
NitpickLawyer ・ 3 days ago

This comment is ~3 years late. Every model since gpt3 has had the entirety of available code in their training data. That's not a gotcha anymore.
We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.
These kinds of comments are really missing the point.
- haar ・ 3 days ago
  
  I've had little success with Agentic coding, and what success I have had has been paired with hours of frustration, where I'd have been better off doing it myself for anything but the most basic tasks.
  Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.
  I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.
  
  aschobel ・ 3 days ago
  ・ 42 more
  
  Bingo, it's magical but the learning curve is very very steep. The METR study on open-source productivity alluded to this a bit.
  I am definitely at a point where I am more productive with it, but it took a bunch of effort.
  
  haar ・ 3 days ago
  ・ 2 more
  
  Apologies if I was unclear.
  The more I've used it, the more I've disliked how poor the results it's produced, and the more I've realised I would have been better served by doing it myself and following a methodical path for things that I didn't have experience with.
  It's easier to step through a problem as I'm learning and making small changes than an LLM going "It's done, and production ready!" where it just straight up doesn't work for 101 different tiny reasons.
  
  airspresso ・ 3 days ago
  
  My preferred approach to avoid that outcome is to divide & conquer the problem. Ask the LLM to implement each small bit in the order you'd implement it yourself given what you know about the codebase.
  
  devmor ・ 3 days ago
  ・ 39 more
  
  The subjects in the study you are referencing also believed that they were more productive with it. What metrics do you have to convince yourself you aren't under the same illusionary bias they were?
  
  simonw ・ 3 days ago
  ・ 38 more
  
  Yesterday I used ffmpeg to extract the frame at the 13 second mark of a video out as a JPEG.
  If I didn't have an LLM to figure that out for me I wouldn't have done it at all.
  
  throwworhtthrow ・ 3 days ago
  ・ 13 more
  
  LLM's still give subpar results with ffmpeg. For example when I asked Sonnet to trim a long video with ffmpeg, it put the input file parameter before the start time parameter, which triggers an unnecessary decode of the video file. [1]
  Sure, use the LLM to get over the initial hump. But ffmpeg's no exception to the rule that LLM's produce subpar code. It's worth spending a couple minutes reading the docs to understand what it did so you can do it better, and unassisted, next time.
  [1] https://ffmpeg.org/ffmpeg.html#:~:text=ss%20position
  
  CamperBob2 ・ 3 days ago
  ・ 12 more
  
  That says more about suboptimal design on ffmpeg's part than it does about the LLM. Most humans can't deal with ffmpeg command lines, so it's not surprising that the LLM misses a few tricks.
  
  nottorp ・ 3 days ago
  ・ 11 more
  
  Had a LLM generate 3 lines of working C++ code that was "only" one order of magnitude slower than what i edited the code to in 10 minutes.
  If you're happy with results like that, sure, LLMs miss "a few tricks"...
  
  ben_w ・ 3 days ago
  ・ 10 more
  
  You don't have to leave LLM code alone, it's fine to change it — unless, I guess, you're doing some kind of LLM vibe-code-golfing?
  But this does remind me of a previous co-worker. Wrote something to convert from a custom data store to a database, his version took 20 minutes on some inputs. Swore it couldn't possibly be improved. Obviously ridiculous because it didn't take 20 minutes to load from the old data store, nor to load from the new database. Over the next few hours of looking at very mediocre code, I realised it was doing an unnecessary O(n^2) check, confirmed with the CTO it wasn't business-critical, got rid of it, and the same conversion on the same data ran in something like 200ms.
  Over a decade before LLMs.
  
  nottorp ・ 3 days ago
  ・ 9 more
  
  We all do that, sometimes where it’s time critical sometimes where it isn’t.
  But I keep being told “AI” is the second coming of Ahura Mazda so it shouldn’t do stuff like that right?
  
  ben_w ・ 3 days ago
  ・ 6 more
  
  > Ahura Mazda
  Niche reference, I like it.
  But… I only hear of scammers who say, and psychosis sufferers who think, LLMs are *already* that competent.
  Future AI? Sure, lots of sane-seeming people also think it could go far beyond us. Special purpose ones have in very narrow domains. But current LLMs are only good enough to be useful and potentially economically disruptive, they're not even close to wildly superhuman like Stockfish is.
  
  CamperBob2 ・ 3 days ago
  ・ 5 more
  
  Sure. If you ask ChatGPT to play chess, it will put up an amateur-level effort at best. Stockfish will indeed wipe the floor with it. But what happens when you ask Stockfish to write a Space Invaders game?
  ChatGPT will get better at chess over time. Stockfish will not get better at anything except chess. That's kind of a big difference.
  
  ben_w ・ 3 days ago
  ・ 4 more
  
  > ChatGPT will get better at chess over time
  Oddly, LLMs got worse at specifically chess: https://dynomight.net/chess/
  But even to the general point, there's absolutely no agreement how much better the current architectures can ultimately get, nor how quickly they can get there.
  Do they have potential for unbounded improvements, albeit at exponential cost for each linear incremental improvement? Or will they asymptomatically approach someone with 5 years experience, 10 years experience, a lifetime of experience, or a higher level than any human?
  If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance; and separately claim that even if they're actually unbounded with exponential cost for linear returns, we can't afford the training cost needed to make them act like someone with even just 6 years professional experience in any given subject.
  Which is still a lot. Especially as it would be acting like it had about as much experience in every other subject at the same time. Just… not a literal Ahura Mazda.
  
  CamperBob2 ・ 3 days ago
  ・ 3 more
  
  If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance
  (Shrug) People with actual money to spend are betting twelve figures that you're wrong.
  Should be fun to watch it shake out from up here in the cheap seats.
  
  ben_w ・ 3 days ago
  
  Nah, trillion dollars is about right for "ok". Percentage point of the global economy in cost, automate 2 percent and get a huge margin. We literally set more than that on actual fire each year.
  For "pretty good", it would be worth 14 figures, over two years. The global GDP is 14 figures. Even if this only automated 10% of the economy, it pays for itself after a decade.
  For "Ahura Mazda", it would easily be worth 16 figures, what with that being the principal God and god of the sky in Zoroastrianism, and the only reason it stops at 16 is the implausibility of people staying organised for longer to get it done.
  
  nottorp ・ 3 days ago
  
  > People with actual money to spend are betting
  ... but those "people with actual money to spend" have burned money on fads before. Including on "AI", several times before the current hysterics.
  If you're a good actor/psychologist, it's probably a good business model to figure out how to get VC money and how to justify your startup failing so they give you money for the next startup.
  
  CamperBob2 ・ 3 days ago
  ・ 2 more
  
  "I'm taking this talking dog right back to the pound. It told me to short NVDA, and you should see the buffer overflow bugs in the C++ code it wrote. Totally overhyped. I don't get it."
  
  nottorp ・ 3 days ago
  
  "We hear you have been calling our deity a talking dog. Please enter the red door for reeducation."
  
  dingnuts ・ 3 days ago
  ・ 10 more
  
  It is nice to use LLMs to generate ffmpeg commands, because those can be pretty tricky, but really, you wouldn't have just used the man page before?
  That explains a lot about Django that the author is allergic to man pages lol
  
  ben_w ・ 3 days ago
  ・ 4 more
  
  I remember when I was a kid, people asking a teacher how to spell a word, and the answer was generally "look it up in a dictionary"… which you can only do if you already have shortlist of possible spellings.
  *nix man pages are the same: if you already know which tool can solve your problem, they're easy to use. But you have to already have a shortlist of tools that can solve your problem, before you even know which man pages to read.
  
  adastra22 ・ 3 days ago
  
  That’s what GNU info is for, of course.
  
  082349872349872 ・ 2 days ago
  ・ 2 more
  
  man -k (or apropos)
  
  ben_w ・ 2 days ago
  
  `apropos` would itself be an example of a *nix tool that I didn't know existed and therefore wouldn't have known to find out more about.
  
  simonw ・ 3 days ago
  ・ 5 more
  
  I just took a look, and the man page DOES explain how to do that!
  ... on line 3,218: https://gist.github.com/simonw/6fc05ea7392c5fb8a5621d65e0ed0...
  (I am very confident I am not the only person who has been deterred by ffmpeg's legendarily complex command-line interface. I feel no shame about this at all.)
  
  lexh ・ 3 days ago
  ・ 2 more
  
  To be a little more fair... that example is tidily slotted into the EXAMPLES section, under the heading "You can extract images from a video, or create a video from many images".
  I don't think most people read the man pages top to bottom. And even if they did, then for as much grief as you're giving ffmpeg, llm has an even larger burden... no man page and the docs weigh in at over 8k lines.
  I get the general point that ffmpeg is a powerful, complex tool... but this is a weird fight to pick.
  
  simonw ・ 3 days ago
  
  I could not be more confident that "ffmpeg is difficult to figure out" is not a weird fight to pick. It's notorious!
  
  quesera ・ 3 days ago
  
  Ffmpeg is genuinely complicated! And the CLI is convoluted (in justifiable, and unfortunate ways).
  But if you approach ffmpeg from the perspective of "I know this is possible", you are always correct, and can almost always reach the "how" in a handful of minutes.
  Whether that's worth it or not, will vary. :)
  
  otabdeveloper4 ・ 3 days ago
  
  The correct solution here would have been to feed the man page to an LLM summarizer.
  Alas instead of correct and easy solutions to problems we are focused on sci-fi robot assitant bullshit.
  
  devmor ・ 3 days ago
  ・ 11 more
  
  You wouldn't have just typed "extract frame at timestamp as jpeg ffmpeg" into Google and used the StackExchange result that comes up first that gives you a command to do exactly that?
  
  simonw ・ 3 days ago
  ・ 10 more
  
  Before LLMs made ffmpeg no-longer-frustrating-to-use I genuinely didn't know that ffmpeg COULD do things like that.
  
  devmor ・ 3 days ago
  ・ 9 more
  
  I'm not really sure what you're saying an LLM did in this case. Inspired a lost sense of curiosity?
  
  simonw ・ 3 days ago
  ・ 2 more
  
  My general point is that people say things like "yeah, but this one study showed that programmers over-estimate the productivity gain they get from LLMs so how can you really be sure?"
  Meanwhile I've spent the past two years constantly building and implementing things I never would have done because of the reduction in friction LLM assistance gives me.
  I wrote about this first two years ago - AI-enhanced development makes me more ambitious with my projects - https://simonwillison.net/2023/Mar/27/ai-enhanced-developmen... - when I realized I was hacking on things with tech like AppleScript and jq that I'd previously avoided.
  It's hard to measure the productivity boost you get from "wouldn't have built that thing" to "actually built that thing".
  
  aschobel ・ 2 days ago
  
  "You can just do things".
  Agreed on all fronts. jq and AppleScript are a total syntax mystery to me, but now I use them all the times since claude code has figured them out.
  It's so powerful knowing the shape of a solution on not having to care about the details.
  
  Philpax ・ 3 days ago
  ・ 5 more
  
  Translated a vague natural language query ("cli, extract frame 13s into video") into something immediately actionable with specific examples and explanations, surfacing information that I would otherwise not know how to search for.
  That's what I've done with my ffmpeg LLM queries, anyway - can't speak for simonw!
  
  wizzwizz4 ・ 3 days ago
  ・ 4 more
  
  DuckDuckGo search results for "cli, extract frame 13s into video" (no quotes):
  • https://stackoverflow.com/questions/10957412/fastest-way-to-...
  • https://superuser.com/questions/984850/linux-how-to-extract-...
  • https://www.aleksandrhovhannisyan.com/notes/video-cli-cheat-...
  • https://www.baeldung.com/linux/ffmpeg-extract-video-frames
  • https://ottverse.com/extract-frames-using-ffmpeg-a-comprehen...
  Search engines have been able to translate "vague natural language queries" into search results for a decade, now. This pre-existing infrastructure accounts for the vast majority of ChatGPT's apparent ability to find answers.
  
  stelonix ・ 3 days ago
  ・ 3 more
  
  Yet the interface is fundamentally different, the output feels much more like bro pages[0] and it's within a click of clipboarding, one CTRL V away from extracting the 13th second screenshot. I've been using Google the past 24 years and my google-fu has always left people amazed; yet I can no longer bother to go through Stack Exchange's results when an LLM not only spits it out so nicely, but also does the equivalent of a explainshell[1].
  Not comparable and I fail to see why going through Google's ads/results would be better?
  [0] https://github.com/pombadev/bropages
  [1] https://github.com/idank/explainshell
  
  wizzwizz4 ・ 3 days ago
  ・ 2 more
  
  DuckDuckGo insists on shoving "AI Assist" entries in its results, so I have a reasonable idea of how often LLMs are completely wrong even given search results. The answer's still "more than one time in five".
  I did not suggest using Google Search (the company's on record as deliberately making Google Search worse), but there are other search engines. My preferred search engines don't do the fancy "interpret natural language queries" pre-processing, because I'm quite good at doing that in my head and often want to research niche stuff, but there are many still-decent search engines that do, and don't have ads in the results.
  Heck, you can even pay for a good search engine! And you can have it redirect you to the relevant section of the top search result automatically: Google used to call this "I'm feeling lucky!" (although it was before URI text fragments, so it would just send you to the top of the page). All the properties you're after, much more cheaply, and you keep the information about provenance, and your answer is more-reliably accurate.
  
  delian66 ・ 2 days ago
  
  > Heck, you can even pay for a good search engine!
  Can you recommend one?
  
  0x457 ・ 3 days ago
  
  LLM somewhat understood ffmpeg documentation? Not sure what is not clear here.
  
  dfedbeef ・ 2 days ago
  ・ 3 more
  
  Was the answer:
  ffmpeg -ss 00:00:13:00 -i myvideo.avi -frames:v 1 myimage.jpeg
  Because this is on stack overflow and it took maybe one second to find.
  I've found reading the man page for a tool is usually a better way to learn what a tool can do for you now and also in the future.
  
  kamranjon ・ 2 days ago
  ・ 2 more
  
  This is the rub for me… people are so quick to forget the original source for a lot of the data these models were trained on, and how easy and useful these platforms were. Now Google will summarize this question for you in an AI overview before you even land on Stack Overflow. It’s killing the network effect of the open web and destroying our crowd sourced platforms in favor of a lossy compression algorithm that will eventually be regurgitating its own entrails.
  
  dfedbeef ・ 2 days ago
  
  Well, maybe. People will just stop using them and will make fun of people who do. You can only bullshit people for so long.
- jan_Sate ・ 3 days ago
  
  Not exactly. The real utility value of LLM for programming is to come up with something new. For Space Invaders, instead of using LLM for that, I might as well just manually search for the code online and use that.
  To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.
  
  devmor ・ 3 days ago
  ・ 9 more
  
  > The real utility value of LLM for programming is to come up with something new.
  That's the goal for these projects anyways. I don't know that its true or feasible. I find the RAG models much more interesting myself, I see the technology as having far more value in search than generation.
  Rather than write some markov-chain reminiscent frankenstein function when I ask it how to solve a problem, I would like to see it direct me to the original sources it would use to build those tokens, so that I can see their implementations in context and use my judgement.
  
  simonw ・ 3 days ago
  ・ 8 more
  
  "I would like to see it direct me to the original sources it would use to build those tokens"
  Sadly that's not feasible with transformer-based LLMs: those original sources are long gone by the time you actually get to use the model, scrambled a billion times into a trained set of weights.
  One thing that helped me understand this is understanding that every single token output by an LLM is the result of a calculation that considers all X billion parameters that are baked into that model (or a subset of that in the case of MoE models, but it's still billions of floating point calculations for every token.)
  You can get an imitation of that if you tell the model "use your search tool and find example code for this problem and build new code based on that", but that's a pretty unconventional way to use a model. A key component of the value of these things is that they can spit out completely new code based on the statistical patterns they learned through training.
  
  devmor ・ 3 days ago
  ・ 7 more
  
  I am aware, and that's exactly why I don't think they're anywhere near as useful for this type of work as the people pushing them want them to be.
  I tried to push for this type of model when an org I worked with over a decade ago was first exploring using the first generation of Tensorflow to drive customer service chatbots and was sadly ignored.
  
  simonw ・ 3 days ago
  ・ 6 more
  
  I don't understand. For code, why would I want to remix existing code snippets?
  I totally get the value of RAG style patterns for information retrieval against factual information - for those I don't want the LLM to answer my question directly, I want it to run a search and show me a citation and directly quote a credible source as part of answering.
  For code I just want code that works - I can test it myself to make sure it does what it's supposed to.
  
  devmor ・ 3 days ago
  ・ 5 more
  
  > I don't understand. For code, why would I want to remix existing code snippets?
  That is what you're doing already. You're just relying on a vector compression and search engine to hide it from you and hoping the output is what you expect, instead of having it direct you to where it remixed those snippets from so you can see how they work to start with and make sure its properly implemented from the get-go.
  We all want code that works, but understanding that code is a critical part of that for anything but a throw-away one time use script.
  I don't really get this desire to replace critical thought with hoping and testing. It sounds like the pipe dream of a middle manager, not a tool for a programmer.
  
  stavros ・ 3 days ago
  ・ 4 more
  
  I don't understand your point. You seem to be saying that we should be getting code from the source, then adapting it to our project ourselves, instead of getting adapted code to begin with.
  I'm going to review the code anyway, why would I not want to save myself some of the work? I can "see how they work" after the LLM gives them to me just fine.
  
  devmor ・ 3 days ago
  ・ 3 more
  
  The work that you are "saving" is the work of using your brain to determine the solution to the problem. Whatever the LLM gives you doesn't have a context it is used in other than your prompt - you don't even know what it does until after you evaluate it.
  If you instead have a set of sources related to your problem, they immediately come with context, usage and in many cases, developer notes and even change history to show you mistakes and adaptations.
  You're ultimately creating more work for yourself* by trying to avoid work, and possibly ending up with an inferior solution in the process. Where is your sense of efficiency? Where is your pride as a intellectual?
  * Yes, you are most likely creating more work for yourself even if you think you are capable of telling otherwise. [1]
  1. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
  
  simonw ・ 3 days ago
  
  It sounds like you care deeply about learning as much as you can. I care about that too.
  I would encourage you to consider that even LLM-generated code can teach you a ton of useful new things.
  Go read the source code for my dumb, zero-effort space invaders clone: https://github.com/simonw/tools/blob/main/space-invaders-GLM...
  There's a bunch of useful lessons to be picked up even from that!
  - Examples of CSS gradients, box shadows and flexbox layout
  - CSS keyframe animation
  - How to implement keyboard events in JavaScript
  - A simple but effective pattern for game loops against a Canvas element, using requestAnimationFrame
  - How to implement basic collision detection
  If you've written games like this before these may not be new to you, but I found them pretty interesting.
  
  stavros ・ 3 days ago
  
  Thanks for the concern, but I'm perfectly able to judge for myself whether I'm creating more work or delivering an inferior product.
  
  tracker1 ・ 3 days ago
  ・ 3 more
  
  I have a friend who has been doing just that... usually with his company he manages a handful of projects where a bulk of the development is outsourced overseas. This past year, he's outpaced the 6 devs he's had working on misc projects just with his own efforts and AI. Most of this being a relatively unique combination of UX with features that are less common.
  He's using AI with note taking apps for meetings to enhance notes and flush out technology ideas at a higher level, then refining those ideas into working experiments.
  It's actually impressive to see. My personal experience has been far more disappointing to say the least. I can't speak to the code quality, consistency or even structure in terms of most people being able to maintain such applications though. I've asked to shadow him through a few of his vibe coding sessions to see his workflow. It feels rather alien to me, again my experience is much more disappointing in having to correct AI errors.
  
  nottorp ・ 3 days ago
  ・ 2 more
  
  Is this the same person who posted about launching 17 "products" in one year a few days ago on HN? :)
  
  tracker1 ・ 3 days ago
  
  No, he's been working on building a larger eLearning solution with some interesting workflow analytics around courseware evaluation and grading. He's been involved in some of the newer LRS specifications and some implementation details to bridge training as well as real world exposure scenarios. Working a lot with first responders, incident response training etc.
  I've worked with him off and on for years from simulating aircraft diagnostics hardware to incident command simulation and setting up core infrastructure for F100 learning management backends.
- AlexeyBrin ・ 3 days ago
  
  You are reading too much into my comment. My point was that the test (a Space Invaders clone) used to asses the model is irrelevant for some time now. I could have gotten a similar result with Mistral Small a few months ago.
- stolencode ・ 3 days ago
  
  It's amazing that none of you even try to falsify you claims anymore. You can literally just put some of the code in a search engine and find the prior art example:
  https://www.web-leb.com/en/code/2108
  Your "AI tools" are just "copyright whitewashing machines."
  These kinds of comments are really ignoring reality.
- MyOutfitIsVague ・ 3 days ago
  
  I don't think they are missing the point, because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated. I use Gemini 2.5 Pro every day for coding, and even that one still falls over on tasks that aren't well known to it (which is why I break the problem down into small parts that I know it'll be able to handle properly).
  It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.
  Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.
  For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.
  
  NitpickLawyer ・ 3 days ago
  
  > because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated
  I agree with you, but your take is much more nuanced than what the GP comment said! These models don't simply regurgitate the training set. That was my point with gpt3. The models have advanced from that, and can now "generalise" over the context in ways they could not do ~3 years ago. We are now at a point where you can write a detailed spec (10-20k tokens) for an unseen scripting language, and have SotA models a) write a parser and b) start writing scripts for you in that language, even though it never saw that particular scripting language anywhere in its training set. Try it. You'll be surprised.
- jayd16 ・ 3 days ago
  
  I think you're missing the point.
  Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.
  Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.
- Aurornis ・ 3 days ago
  
  > These kinds of comments are really missing the point.
  I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.
  Asking them to produce novel output that doesn’t match the training set produces very different results.
  When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.
  It shows the reality of working with LLMs and it’s an important consideration.
phkahler ・ 3 days ago

I find the visual similarity to breakout kind of interesting.
Conflonto ・ 3 days ago

That sounds so dismissive.
I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.
But hey.
elif ・ 3 days ago

Most likely this comment included countless similar comments in its training data, likely all synthetic without any actual tether to real analysis.

alankarmisra ・ 3 days ago

I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.

That said, for something like this, I’d probably get more out of simply finding an existing implementation on github or the like and downloading that.

When it comes to specialized and narrow domains like Space Invaders, the training set is likely to be extremely small and the model's vector space will have limited room to generalize. You'll get code that is more or less identical to the original source and you also have to wait for it to 'type' the code and the value add seems very low. I would rather ask it to point me to known Space Invaders implementations in language X on github (or search there).

Note that ChatGPT gets very nervous if I put this into GPT to clean up the grammar. It wants very badly for me to stress that LLMs don't memorize and overfitting is very unlikely (I believe neither).

tossandthrow ・ 3 days ago

Interesting, I can not produce these warnings in ChatGPT - though this is something that really interests me, as it represents immense political power to be able ti interject such warnings (explicitly, or implicitly by slight reformulations)
undefined ・ 3 days ago

[deleted]
aaron695 ・ 3 days ago

[dead]
dr-detroit ・ 3 days ago

[dead]

xianshou ・ 3 days ago

I initially read the title as "My 2.5 year old can write Space Invaders in JavaScript now (GLM-4.5 Air)."

Though I suppose, given a few years, that may also be true!

DonHopkins ・ 2 days ago

Given a few years your 2.5 year old will be a 5.5 year old, too!
- Breza ・ 4 hours ago
  
  Ugh don't remind me. My daughter's fifth birthday is tomorrow and with how fast she's growing I feel like her 15th is on Thursday.

simonw ・ 3 days ago

There's a new model from Qwen today - Qwen3-30B-A3B-Instruct-2507 - that also runs comfortably on my Mac (using about 30GB of RAM with an 8bit quantization).

I tried the "Write an HTML and JavaScript page implementing space invaders" prompt against it and didn't quite get a working game with a single shot, but it was still an interesting result: https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct...

pyman ・ 3 days ago

I was talking about the new open models with a group of people yesterday, and saying how good they're getting. The big question is:
Can any company now compete with the big players? Or even more interesting, like you showed in your research, are proprietary models becoming less relevant now that anyone can run these models locally?
This trend of better open models that run locally is really picking up. Do you think we'll get to a point where we won't need to buy AI tokens anymore?
- simonw ・ 2 days ago
  
  The problem is cost. A machine that can run a decent local model costs thousands of dollars to buy and won't produce results as good as a model that runs on $30,000+ dedicated servers. Meanwhile you can rent access to LLMs running on those expensive machines for fractions of a cent (because you are sharing them with thousands of other users).
  I don't think cost will be a reason to use local models for a very long time, if ever.

lxgr ・ 3 days ago

This raises an interesting question I’ve seen occasionally addressed in science fiction before:

Could today’s consumer hardware run a future superintelligence (or, as a weaker hypothesis, at least contain some lower-level agent that can bootstrap something on other hardware via networking or hyperpersuasion) if the binary dropped out of a wormhole?

bob1029 ・ 3 days ago

This is the premise of all of the ML research I've been into. The only difference is to replace the wormhole with linear genetic programming, neuroevolution, et. al. The size of programs in the demoscene is what originally sent me down this path.
The biggest question I keep asking myself - What is the Kolmogorov complexity of a binary image that provides the exact same capabilities as the current generation LLMs? What are the chances this could run on the machine under my desk right now?
I know how many AAA frames per second my machine is capable of rendering. I refuse to believe the gap between running CS2 at 400fps and getting ~100b/s of UTF8 text out of a NLP black box is this big.
- bgirard ・ 3 days ago
  
  > ~100b/s of UTF8 text out of a NLP black box is this big
  That's not a good measure. NP problem solutions are only a single bit, but they are much harder to solve than CS2 frames for large N. If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.
  
  bob1029 ・ 3 days ago
  
  > If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.
  Exactly. This is what compels me to try.
switchbak ・ 3 days ago

This is what I find fascinating. What hidden capabilities exist, and how far could it be exploited? Especially on exotic or novel hardware.
I think much of our progress is limited by the capacity of the human brain, and we mostly proceed via abstraction which allows people to focus on narrow slices. That abstraction has a cost, sometimes a high one, and it’s interesting to think about what the full potential could be without those limitations.
- lxgr ・ 3 days ago
  
  Abstraction, or efficient modeling of a given system, is probably a feature, not a bug, given the strong similarity between intelligence and compression and all that.
  A concise description of the right abstractions for our universe is probably not too far removed from the weights of a superintelligence, modulo a few transformations :)
tw1984 ・ 2 days ago

could today's seemingly "superintelligence" models run on 10-20 years old hardware? probably it works.

pulkitsh1234 ・ 3 days ago

Is there any website to see the minimum/recommended hardware required for running local LLMs? Much like 'system requirements' mentioned for games.

svachalek ・ 3 days ago

In addition to the tools other people responded with, a good rule of thumb is that most local models work best* at q4 quants, meaning the memory for the model is a little over half the number of parameters, e.g. a 14b model may be 8gb. Add some more for context and maybe you want 10gb VRAM for a 14gb model. That will at least put you in the right ballpark for what models to consider for your hardware.
(*best performance/size ratio, generally if the model easily fits at q4 you're better off going to a higher parameter count than going for a larger quant, and vice versa)
- nottorp ・ 3 days ago
  
  > maybe you want 10gb VRAM for a 14gb model
  ... or if you have Apple hardware with their unified memory, whatever the assholes soldered in is your limit.
CharlesW ・ 3 days ago

> Is there any website to see the minimum/recommended hardware required for running local LLMs?
LM Studio (not exclusively, I'm sure) makes it a no-brainer to pick models that'll work on your hardware.
GaggiX ・ 3 days ago

https://apxml.com/tools/vram-calculator
This one is very good in my opinion.
- jxf ・ 3 days ago
  
  Don't think it has the GLM series on there yet.
qingcharles ・ 3 days ago

This can be a useful resource too:
https://www.reddit.com/r/LocalLLaMA/
knowaveragejoe ・ 3 days ago

If you have a HuggingFace account, you can specify the hardware you have and it will show on any given model's page what you can run.

ddtaylor ・ 3 days ago

My brain is running legacy COBOL and first read this as

> My 2.5 year old with their laptop can write Space Invaders

For a few hundred milliseconds there I was thinking "these damn kids are getting good with tablets"

Imustaskforhelp ・ 3 days ago

Don't worry I guess my brain is running bleeding edge typescript with react (I am in high school for context) and the first time I also read it this way...
But I am without my glasses, but still I have hackernews at 250%, I think I am a little cooked lol.
- OldfieldFund ・ 3 days ago
  
  We are all cooked at this point :)

petercooper ・ 3 days ago

I ran the same experiment on the full size model. It used a custom 80s style font (from Google Fonts) and gave 'eyes' and more differences to the enemies but otherwise had a similar vibe to Simon's. An interesting visual demonstration of what quantization does though! Screenshot: https://peterc.org/img/aliens.png

stpedgwdgfhgdd ・ 3 days ago

Aside that space invaders from scratch is not representative for real engineering, it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine (no usage tier per hour or week), let’s say, one year from now. At $200 per month for 2 years I can buy a decent Mx with 64GB (or perhaps even 128GB taking residual value into account)

falcor84 ・ 3 days ago

How come it's "not representative for real engineering"? Other than copy-pasting existing code (which is not what an LLM does), I don't see how you can create a space invaders game without applying "engineering".
- hbn ・ 3 days ago
  
  The prompt was
  > Write an HTML and JavaScript page implementing space invaders
  It may not be "copy pasting" but it's generating output as best it can be recreated from its training on looking at Space Invaders source code.
  The engineers at Taito that originally developed Space Invaders were not told "make Space Invaders" and then did their best to recall all the source code they've looked at in their life to re-type the source code to an existing game. From a logistics standpoint, where the source code already exists and is accessible, you may as well have copy-pasted it and fudged a few things around.
  
  simonw ・ 3 days ago
  ・ 5 more
  
  The source code for original Space Invaders from 1978 has never been published. The closest to that is disassembled ROMs.
  I used that prompt because it's the shortest possible prompt that tells the model to build a game with a specific set of features. If I wanted to build a custom game I would have had to write a prompt that was many paragraphs longer than that.
  The aim of this piece isn't "OMG looks LLMs can build space invaders" - at this point that shouldn't be a surprise to anyone. What's interesting is that my laptop can run a model that is capable of that now.
  
  sarchertech ・ 3 days ago
  
  > The source code for original Space Invaders from 1978 has never been published. The closest to that is disassembled ROMs.
  Sure but that doesn’t impact the OPs point at all because there are numerous copies of reverse engineered source code available.
  There are numerous copies of the reverse engineered source code already translated to JavaScript in your models training set.
  
  hbn ・ 3 days ago
  
  The discussion I replied to was just regarding whether or not what the LLM did should be considered "engineering"
  It doesn't really matter whether or not the original code was published. In fact that original source code on its own probably wouldn't be that useful, since I imagine it wouldn't have tipped the weights enough to be "recallable" from the model, not to mention it was tasked with implementing it in web technologies.
  
  nottorp ・ 3 days ago
  
  > What's interesting is that my laptop can run a model that is capable of that now.
  I'm afraid no one cared much about your point :)
  You'll only get "OMG look how good LLMs are they'll get us all fired!" comments and "LLMs suck" comments.
  This is how it goes with religion...
  
  dfedbeef ・ 2 days ago
  
  I have good news about how games were programmed in the 70's. What if I told you the disassembled ROM is the code.
- sharkjacobs ・ 3 days ago
  
  Making a space invaders game is not representative of normal engineering because you're reproducing an existing game with well known specs and requirements. There are probably hundreds of thousands of words describing and discussing Space Invaders in GLM-4.5's training data
  It's like using an LLM to implement a red black tree. Red black trees are in the training data, so you don't need to explain or describe what you mean beyond naming it.
  "Real engineering" with LLMs usually requires a bunch of up front work creating specs and outlines and unit tests. "Context engineering"
  
  jasonvorhe ・ 3 days ago
  
  Smells like moving the goal post. What's real engineering to be in 2028? Implementing Google's infra stack in your homelab?
- phkahler ・ 3 days ago
  
  >> Other than copy-pasting existing code (which is not what an LLM does)
  I'd like to see someone try to prove this. How many space invaders projects exist on the internet? I'd be hard to compare model "generated" code to everything out there looking for plagiarism, but I bet there are lots of snippets pulled in. These things are NOT smart, they are huge and articulate information repositories.
  
  simonw ・ 3 days ago
  ・ 10 more
  
  Go for it. https://www.google.com/search?client=firefox-b-1-d&q=github+... has a bunch of results. Here's the source code GLM-4.5 Air spat out for me on my laptop: https://github.com/simonw/tools/blob/main/space-invaders-GLM...
  Based on my mental model of how these things work I'll be genuinely surprised if you can find even a few lines of code duplicated from one of those projects into the code that GLM-4.5 wrote for me.
  
  undefined ・ 3 days ago
  
  [deleted]
  
  phkahler ・ 3 days ago
  ・ 8 more
  
  So I scanned the beginning of the generated code, picked line 83:
  animation: glow 2s ease-in-out infinite;
  stuffed it verbatim into google and found a stack overflow discussion that contained this:
  animation: glow .5s infinite alternate;
  in under one minute. Then I found this page of CSS effects:
  https://alvarotrigo.com/blog/animated-backgrounds-css/
  Another page has examples and contains:
  animation: float 15s infinite ease-in-out;
  There is just too much internet to scan for an exact match or a match of larger size.
  
  simonw ・ 3 days ago
  ・ 4 more
  
  That's not an example of copying from an existing Space Invaders implementation. That's an LLM using a CSS animation pattern - one that it's seen thousands (probably millions) of times in the training data.
  That's what I expect these things to do: they break down Space Invaders into the components they need to build, then mix and match thousands of different coding patterns (like "animation: glow 2s ease-in-out infinite;") to implement different aspects of that game.
  You can see that in the "reasoning" trace here: https://gist.github.com/simonw/9f515c8e32fb791549aeb88304550... - "I'll use a modern design with smooth animations, particle effects, and a retro-futuristic aesthetic."
  
  threeducks ・ 3 days ago
  ・ 3 more
  
  I think LLMs are adapting higher level concepts. For example, the following JavaScript code generated by GLM (https://github.com/simonw/tools/blob/9e04fd9895fae1aa9ac78b8...) is clearly inspired by this C++ code (https://github.com/portapack-mayhem/mayhem-firmware/blob/28e...), but it is not an exact copy.
  
  simonw ・ 3 days ago
  ・ 2 more
  
  This is a really good spot.
  That code certainly looks similar, but I have trouble imagining how else you would implement very basic collision detection between a projectile and a player object in a game of this nature.
  
  threeducks ・ 3 days ago
  
  A human would likely have refactored the two collision checks between bullet/enemy and enemyBullet/player in the JavaScript code into its own function, perhaps something like "areRectanglesOverlapping". The C++ code only does one collision check like that, so it has not been refactored there, but as a human, I certainly would not want to write that twice.
  More importantly, it is not just the collision check that is similar. Almost the entire sequence of operations is identical on a higher level:
  1. enemyBullet/player collision check 2. same comment "// Player hit!" (this is how I found the code) 3. remove enemy bullet from array 4. decrement lives 5. update lives UI 6. (createParticle only exists in JS code) 7. if lives are <= 0, gameOver
  
  ben_w ・ 3 days ago
  
  So, your example of it copying snippets is… using the same API with fairly different parameters in a different order?
  
  falcor84 ・ 3 days ago
  
  The parent said
  > find even a few lines of code duplicated from one of those projects
  I'm pretty sure they meant multiple lines copied verbatim from a single project implementing space invaders, rather than individual lines copied (or likely just accidentally identical) across different unrelated projects.
  
  sejje ・ 3 days ago
  
  Is this some kind of joke?
  That's how you write css. The examples aren't the same at all, they just use the same css feature.
  It feels like you aren't a coder--you've sabotaged your own point.
  
  ben_w ・ 3 days ago
  
  Sorites paradox. Where's the distinction between "snippet" and "a design pattern"?
  Compressing a few petabytes into a few gigabytes requires that they can't be like this about all of the things they're accused of simply copy-pasting, from code to newspaper articles to novels. There's not enough space.
dmortin ・ 3 days ago

" it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine "
Most people won't bother with buying powerful hardware for this, they will keep using SAAS solutions, so Anthropic can be in trouble if cheaper SAAS solutions come out.
qingcharles ・ 3 days ago

The frontier models are always going to tempt you with their higher quality and quicker generation, IMO.
- kasey_junk ・ 3 days ago
  
  I’ve been mentally mapping tge models to the history of db.
  Most db in the early days you had to pay for. There are still for pay db that are just better than ones you don’t pay for. Some teams think that the cost is worth the improvements and there is a (tough) business there. Fortunes were made in the early days.
  But eventually open source models became good enough for many use cases and they have their own advantages. So lots of teams use them.
  I think coding models might have a similar trajectory.
  
  qingcharles ・ 3 days ago
  ・ 3 more
  
  You make a good point -- a majority of applications are now using open source or free versions[1] of DBs.
  My only feedback is: are these the same animal? Can we compare an O/S DB vs. paid/closed DB to me running an LLM locally? The biggest issue right now with LLMs is simply the cost of the hardware to run one locally, not the quality of the actual software (the model).
  [1] e.g. SQL Server Express is good enough for a lot of tasks, and I guess would be roughly equivalent to the upcoming open versions of GPT vs. the frontier version.
  
  qcnguy ・ 3 days ago
  ・ 2 more
  
  A majority of apps nowadays are using proprietary forks of open source DBs running in the cloud, where their feature set is (slightly) rounded out and smoothed off by the cloud vendors.
  Not that many projects are doing fully self-hosted RDBMS at this point. So ultimately proprietary databases still win out, they just (ab)use the Postgresql trademark to make people think they're using open source.
  LLMs might go the same way. The big clouds offering proprietary fine tunes of models given away by AI labs using investor money?
  
  qingcharles ・ 3 days ago
  
  That's definitely true. I could see more of the running open source models on other people's hardware model.
  I dislike running local LLMs right now because I find the software kinda janky still, you often have to tweak settings, find the right model files. Basically have a bunch of domain knowledge I don't have space for in my head. On top of maintaining a high-spec piece of hardware and paying for the power costs.
- zarzavat ・ 3 days ago
  
  Closed doesn't always win over open. People said the same thing about Windows vs Linux, but even Microsoft was forced to admit defeat and support Linux.
  All it takes is some large companies commoditizing their complements. For Linux it was Google, etc. For AI it's Meta and China.
  The only thing keeping Anthropic in business is geopolitics. If China were allowed full access to GPUs, they would probably die.
  
  airspresso ・ 3 days ago
  
  > The only thing keeping Anthropic in business is geopolitics. If China were allowed full access to GPUs, they would probably die.
  Disagree. Anthropic have a unique approach to how they post-train their models and tune it to be the way they want it. No other lab has managed to reproduce the style and personality of Claude yet, which is currently a key reason why coders prefer it. And since post-training data is secret, it'll take other providers a lot of focused effort to get close to that.
rafaelmn ・ 3 days ago

What about power used and support hardware ? Also card going down means you are down until you get warranty service.
- skeezyboy ・ 3 days ago
  
  why are you doing anything locally then?
  
  rafaelmn ・ 3 days ago
  
  Latency and tooling support ? UX of cloud based LLM vs local is much better for the cloud option - not so much for dev tooling.
  I tried using remote workstations - I am not a fan of lugging a beefy client machine to do my work - would much rather use something thats super light and power efficient.
tptacek ・ 3 days ago

OK, go write Space Invaders by hand.
- LandR ・ 3 days ago
  
  I'd hope most professional software engineers could do this in an afternoon or so?
  
  sejje ・ 3 days ago
  ・ 2 more
  
  Most professional software engineers have never written a game and don't do web work, so I somehow doubt that.
  
  anthk ・ 3 days ago
  
  With TCL/TK it's a matter of less than 2 hours.
  
  Mashimo ・ 3 days ago
  ・ 2 more
  
  Depends on the rules. Can I look up other space invaders games on github first? Can I use a game framework?
  Just JS / HTML docs I probably could not.
  
  pharrington ・ 2 days ago
  
  No preexisting framework.

indigodaddy ・ 3 days ago

Did pretty well with a boggle clone. I like that it tries to do a single html file (I didn't ask for that but was pleasantly surprised). It didn't include dictionary validation so needed a couple of prompts. Touch selection on mobile isn't the greatest but I've seen plenty worse

https://chat.z.ai/space/z0gcn6qtu8s1-art

https://chat.z.ai/s/74fe4ddc-f528-4d21-9405-0a8b15a96520

Keyframe ・ 3 days ago

I went the other route with tetris clone the other day. It's definitely not a single prompt. It took me solid 15 hours until this stage to get here and most of that me thinking.. BUT, except one small trivial thing (space invader logo in pre tag) I haven't touched code - just looked at it. I made it mandatory for myself to see if I can first greenfield myself into this project and then brownfield features and fixes.. It's definitely a ton of work on my end, but it's also not something I'd be able to do in ~2 working days or less. As a cherry on top, even though it's still not done yet, I put in AI-generated music singing about the project itself. https://www.susmel.com/stacky/
Definitely a ton of things I learned about how to "develop" "with" AI along the way.
JKCalhoun ・ 3 days ago

Cool — if only diagonals were easier. ;-) (Hopefully I'm being constructive here.)
- indigodaddy ・ 3 days ago
  
  Yep I tried to have it improve that but actually didn't use the word 'diagonal' in the prompt. I bet it would have done better if I had..
  
  indigodaddy ・ 3 days ago
  
  Had it try to improve Diagonal selection but didn't seem to help much
  https://chat.z.ai/space/b01dc65rg2p0-art

dust42 ・ 3 days ago

I tried with Claude Sonnet 4 and it does *not* work. So looks like GLM-4.5 Air in 3bit quant is ahead.

Chat is here: https://claude.ai/share/dc9eccbf-b34a-4e2b-af86-ec2dd83687ea

Claude Opus 4 does work but is far behind of Simon's GLM-4.5: https://claude.ai/share/5ddc0e94-3429-4c35-ad3f-2c9a2499fb5d

maksimur ・ 3 days ago

A $xxxx 2.5 year old laptop, one that's probably much more powerful than an average laptop bought today and probably next year as well. I don't think it's a fair reference point.

bprew ・ 3 days ago

His point isn't that you can run a model on an average laptop, but that the same laptop can still run frontier models.
It speaks to the advancements in models that aren't just throwing more compute/ram at it.
Also, his laptop isn't that fancy.
> It claims to be small enough to run on consumer hardware. I just ran the 7B and 13B models on my 64GB M2 MacBook Pro!
From: https://simonwillison.net/2023/Mar/11/llama/
- undefined ・ 2 days ago
  
  [deleted]
parsimo2010 ・ 3 days ago

The article is pretty good overall, but the title did irk me a little. I assumed when reading "2.5 year old" that it was fairly low-spec only to find out it was an M2 Macbook Pro with 64 GB of unified memory, so it can run models bigger than what an Nvidia 5090 can handle.
I suppose that it could be intended to be read as "my laptop is only 2.5 years old, and therefore fairly modern/powerful" but I doubt that was the intention.
- simonw ・ 3 days ago
  
  The reason I emphasize the laptop's age is that it is the same laptop I have been using ever since the first LLaMA release.
  This makes it a great way to illustrate how much better the models have got without requiring new hardware to unlock those improved abilities.
nh43215rgb ・ 3 days ago

About $3700 laptop...

efitz ・ 3 days ago

I missed the word “laptop” in the title at first glance and thought this was a “I taught my toddler to code” article.

juliangoetze ・ 3 days ago

I thought I was the only one.
below43 ・ 3 days ago

Same here. Pretty impressive LLM.

joelthelion ・ 3 days ago

Apart from using a Mac, what can you use for inference with reasonable performance? Is a Mac the only realistic option at the moment?

reilly3000 ・ 3 days ago

The top 3 approaches I see a lot on r/localllama are:
1. 2-4x 3090+ nvidia cards. Some are getting Chinese 48GB cards. There is a ceiling to vRAM that prevents the biggest models from being able to load, most can run most quants at great speeds
2. Epyc servers running CPU inference with lots of RAM at as high of memory bandwidth as is available. With these setups people are getting like 5-10 t/s but are able to run 450B parameter models.
3. High RAM Macs with as much memory bandwidth as possible. They are the best balanced approach and surprisingly reasonable relative to other options.
badsectoracula ・ 3 days ago

An Nvidia GPU is the most common answer, but personally i've done all my LLM use locally using mainly Mistral Small 3.1/3.2-based models and llama.cpp with an AMD RX 7900 XTX GPU. It only gives you ~4.71 tokens per second, but that is fast enough for a lot of uses. For example last month or so i wrote a raytracer[0][1] in C with Devstral Small 1.0 (based on Mistral Small 3.1). It wasn't "vibe coding" as much as a "co-op" where i'd go back and forth a chat interface (koboldcpp) and i'd, e.g. ask the LLM to implement some feature, then i'd switch to the editor and start writing code using that feature while the LLM was generating it in the background. Or, more often, i'd fix bugs in the LLM's code :-P.
FWIW GPU aside, my PC isn't particularly new - it is a 5-6 year old PC that was the cheapest money could buy originally and became "decent" at the time i upgraded it ~5 years ago and i only added the GPU around Christmas as prices were dropping since AMD was about to release the new GPUs.
[0] https://i.imgur.com/FevOm0o.png
[1] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92...
AlexeyBrin ・ 3 days ago

A gaming PC with an NVIDIA 4090/5090 will be more than adequate for running local models.
Where a Mac may beat the above is on the memory side, if a model requires more than 24/32 GB of GPU memory you are usually better off with a Mac with 64/128 GB of RAM. On a Mac the memory is shared between CPU and GPU, so the GPU can load larger models.
whimsicalism ・ 3 days ago

you are almost certainly better off renting GPUs, but i understand self-hosting is an HN touchstone
- mrinterweb ・ 3 days ago
  
  I don't know about that. I've had my RTX 4090 for nearly 3 years now. If I had a script that provisioned and deprovisioned a rented 4090 at $0.70/hr for an 8 hour work day for 20 work days per month. Assuming I get 2 paid weeks off per year + normal holidays over 3 years.
  0.7 * 8 * ((20 * 12) - 8 - 14) * 3 = $3662
  I bought my RTX 4090 for about $2200. I also had the pleasure of being able to use it for gaming when I wasn't working. To be fair, the VRAM requirements for local models keeps climbing and my 4090 isn't able to run many of the latest LLMs. Also, I omitted cost of electricity for my local LLM server cost. I have not been measuring total watts consumed by just that machine.
  One nice thing about renting is that it give you flexibility in terms of what you want to try.
  If you're really looking for the best deals look at 3rd party hosts serving open models for the API-based pricing, or honestly a Claude subscription can easily be worth it if you use LLMs a fair bit.
  
  whimsicalism ・ 3 days ago
  
  1. I agree - there are absolutely scenarios in which it can make sense to buy a GPU and run it yourself. If you are managing a software firm with multiple employees, you very well might break even in less than a few years. But I would gander this is not the case for 90%+ of people self-hosting these models, unless they have some other good reason (like gaming) to buy a GPU.
  2. I basically agree with your caveats - excluding electricity is a pretty big exclusion and I don't think that you've had 3 years of really high-value self-hostable models, I would really only say the last year and I'm somewhat skeptical of how good for ones that can be hosted in 24gb vram. 4x4090 is a different story.
- qingcharles ・ 3 days ago
  
  This. Especially if you just want to try a bunch of different things out. Renting is insanely cheap -- to the point where I don't understand how the renters are making their money back unless they stole the hardware and power.
  It can really help you figure a ton of things out before you blow the cash on your own hardware.
  
  4b11b4 ・ 3 days ago
  ・ 3 more
  
  Recommended sites to rent from
  
  whimsicalism ・ 3 days ago
  
  runpod, vast, hyperbolic, prime intellect. if all you're doing is going to be running LLMs, you can pay per token on openrouter or some of the providers listed there
  
  doormatt ・ 3 days ago
  
  runpod.io
regularfry ・ 3 days ago

This one should just about fit on a box with an RTX 4090 and 64GB RAM (which is what I've got) at q4. Don't know what the performance will be yet. I'm hoping for an unsloth dynamic quant to get the most out of it.
- weberer ・ 3 days ago
  
  Whats important is VRAM, not system RAM. The 4090 has 16gb of VRAM so you'll be limited to smaller models at decent speeds. Of course, you can run models from system memory, but your tokens/second will be orders of magnitude slower. ARM Macs are the exception since they have unified memory, allowing high bandwidth between the GPU and the system's RAM.
  
  regularfry ・ 3 days ago
  
  Yes and no. The 4090 has 24GB, not 16; but with a big MoE you're not getting everything in there anyway. In that case you really want all the weights in RAM so that swapping experts in isn't a load from disk.
  It's not as good as unified RAM, but it's also workable.
  
  throwaway0123_5 ・ 3 days ago
  
  iirc 4090s have 24GB
thenaturalist ・ 3 days ago

This guy [0] does a ton of in-depth HW comparison/ benchmarking, including against Mac mini clusters and an M3 ultra.
0: https://www.youtube.com/@AZisk

h-bradio ・ 3 days ago

Thanks so much for this! I updated LM Studio, and it picked up the mlx-lm update required. After a small tweak to tool-calling in the prompt, it works great with Zed!

torarnv ・ 3 days ago

Could you describe the tweak you did, and possibly the general setup you have with zed working with LM Studio? Do you use a custom system prompt? What context size do you use? Temperature? Thanks!
- h-bradio ・ 2 days ago
  
  Here is how my prompt ended up! https://gist.github.com/hbradio/2f504c3fdb6f7113181b2d8c6862... I just asked an LLM to make it similar to a working Qwen prompt.
  To make the LLMs able to use tools, I had to configure them in the Zed settings like this: https://gist.github.com/hbradio/fa4b456658a8d250e6ccc69ae9b3...
  Also, I had to go into LM Studio and increase the max context size for each model I wanted to use in Zed. Otherwise it gives a parsing error on the response. I set it to the max allowable value.
  I start LM Studio, start the LM Studio server, then go to Zeds AI config and tell it to connect to LM Studio. I put it in Agent mode, and it seems to work!
  I don't know much about temperature, and I didn't use any other system prompt.
  Good luck!

jauntywundrkind ・ 3 days ago

MLX does have decent/good software support among ML stacks. Targeting both iOS and mac is a big win in itself.

I wonder what's possible, what the software situation is today with the PC NPU's. AMD's XDNA has been around for a while, XDNA2 jumps from 10->40 TOps. AMD iGPU can access huge memory: is it similar here? The "AMDXDNA" driver merged in 6.14 last winter: where are we now?

But not seeing any evidence that there's popular support in any of the main frameworks. https://github.com/ggml-org/llama.cpp/issues/1499 https://github.com/ollama/ollama/issues/5186

Good news, AMD has an initial implementation of llama.cpp. I don't particularly know what it means, but the firt gen supports W4ABF16 quantization, newer chips support W8A16. https://github.com/ggml-org/llama.cpp/issues/14377 . I'm not sure what it's good for, but there is a Linux "xdna-driveR", https://github.com/amd/xdna-driver . IREE has an experimental backend: https://github.com/nod-ai/iree-amd-aie

There's a lot of other folks also starting on their NPU journeys. ARM's Ethos, and Rockchip's RKNN recently shipped Linux kernel drivers, but it feels like that's just a start? https://www.phoronix.com/news/Arm-Ethos-NPU-Accel-Driver https://www.phoronix.com/news/Rockchip-NPU-Driver-RKNN-2025

captainregex ・ 2 days ago

What is the biggest group of people that uses local AI? Students who don’t want to pay but somehow have hardware? Devs who are price conscious?

I kinda think about it like Linux- there’s an audience for it but most people will just use Windows. Are AI capabilities in a local device necessary even in whatever future self serving state some companies are pushing?

It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me

lifestyleguru ・ 3 days ago

> my 2.5 year old laptop (a 64GB MacBook Pro M2) i

My MacBook has 16GB of RAM and it is from a period when everyone was fiercely insisting that 8GB base model is all I'll ever need.

tracker1 ・ 3 days ago

I'm kind of with you... while I've run 128gb on my desktop, and currently at 96gb with dr5 what it is, It's far less common for typical laptops. I'm a bit curious how the Ryzen 395+ with 128gb will handle some of these models. The 200gb options feel completely out of reach.

chickenzzzzu ・ 3 days ago

"2.5 year old laptop" is potentially the most useless way of describing a 64GB M2, as it could be confused with virtually any other configuration of laptop.

simonw ・ 3 days ago

The thing I find most notable here is that this is the same laptop I've used to run every open weights model since the original LLaMA.
The models have got so much better without me needing to upgrade my hardware.
- chickenzzzzu ・ 3 days ago
  
  That's great! Why can't we say that instead?
  No need to overly quantize our headlines.
  "64GB M2 makes Space Invaders-- can be bought for under $xxxx"
OJFord ・ 3 days ago

I think the point is just that it doesn't require absolute cutting edge nor server hardware.
- jphoward ・ 3 days ago
  
  No but 64 GB of unified memory provides almost as much GPU RAM capacity as two RTX 5090s (only less due to the unified nature) - top of the range GPUs - so it's a truly exceptional laptop in this regard.
  
  turnsout ・ 3 days ago
  ・ 6 more
  
  Except that it is not exceptional at all; it's an older-generation MacBook Pro with 64GB of RAM. There's nothing particularly unusual about it.
  
  jphoward ・ 3 days ago
  ・ 5 more
  
  64 GB of RAM which is addressable by a GPU is exceptional for a laptop - this is not just system RAM.
  
  turnsout ・ 3 days ago
  
  I understand, but that is not exceptional for a Mac laptop. You could say all Apple Silicon Macs are exceptional, and I guess I agree in the context of the broader PC community. But I would not point at an individual MacBook Pro with 64 GB of RAM and say "whoa, that's exceptional." It's literally just a standard option when you buy the computer. It does bump the price pretty high, but the point of the MBP is to cater to higher-end workflows.
  
  chickenzzzzu ・ 3 days ago
  ・ 3 more
  
  To emphasize this point further, at least with my efforts, it is not even possible to buy a 64GB M4 Pro right now. 32GB, 64GB, and 128GB are all sold out.
  We can say that 64GB addressable by a GPU is not exceptional when compared to 128GB and it still costs less than a month's pay for a FAANG engineer, but the fact that they aren't actually purchasable right now shows that it's not as easy as driving to Best Buy and grabbing one off the shelf.
  
  turnsout ・ 3 days ago
  ・ 2 more
  
  They're not sold out—Apple's configurator (and chip naming) is just confusing. The MacBook Pro with M4 Pro is only available in 24 or 48 GB configurations. To get 64 or 128 GB, you need to upgrade to the M4 Max.
  If you're looking for the cheapest way into 64 of unified memory, the Mac mini is available with an M4 Pro and 64GB at $1999.
  So, truly, not "exceptional" unless you consider the price to be exorbitant (it's not, as evidenced by the long useful life of an M-series Mac).
  
  chickenzzzzu ・ 3 days ago
  
  thank you for providing that extra info! i agree that $2000-4000 is not an absolutely earth shattering price, but i still wonder what the benefit one receives is when they say "2.5 year old laptop" instead of "64GB M2 laptop"
- tantalor ・ 3 days ago
  
  It was also something he already had lying around. Did not need to buy something new to get new functionality.

aplzr ・ 3 days ago

I really like talking to Claude (free tier) instead of using a search engine when I'm stumbling upon a random topic that interests me. For example, this morning I had it explain the differences between pass by value, pass by reference, and pass by sharing, the last of which I wasn't aware of until then.

Is this kind of thing also possible with one of these self-hosted models in a comparable way, or are they mostly good for coding?

pmarreck ・ 3 days ago

I have an M4 Mac with 128GB RAM and I'm currently downloading GLM-4.5-Air-q5-hi-mlx via LM Studio (80GB) and will report back!

rexreed ・ 3 days ago

How is it going? Intrigued enough to possibly get an M4 Mac with 128GB RAM if it's worthwhile...
- ls-a ・ 3 days ago
  
  Apple is going to make so much money if they keep pushing on-device LLMs. It makes absolute sense to sell more macbook pros
  
  pmarreck ・ 2 days ago
  
  Assuming they keep improving, yes. The one I tried (I responded to the other reply with some output, which was great) is as fast as the cloud ones and nearly as good.
- pmarreck ・ 2 days ago
  
  Pretty impressive. Spit out perfectly-working Asteroids on the first try. https://gist.github.com/pmarreck/db782fdb68053292ca746d6c756...
  I want to hook it up to Zed next and see how that goes

xkcd1963 ・ 2 days ago

Standalone mini projects like that are also a good way to train students. But I believe LLMs are still a far long path away from being able to solve problems that require combination solutions like different environments, software, circumstances, projects, ...

Aurornis ・ 3 days ago

This is very cool. The blog had to run it from the main branch of the mlx-lm library and a custom script. Can someone up to date on the local LLM tools let us know which mainstream tools we should be watching for an easier way to run this on MLX? The space moves so fast that it's hard to keep up.

simonw ・ 3 days ago

I expect LM Studio will have this pretty soon - I imagine they are waiting on the next stable release of mlx-lm which will include the change I needed to get this to work.

andai ・ 3 days ago

I got almost the same result with a 4B model (Qwen3-4B), about 20x smaller than OP's ~200B model.

https://jsbin.com/lejunenezu/edit?html,output

Its pelican was a total fail though.

andai ・ 3 days ago

Update: It failed to make Flappy Bird though (several attempts).
This surprises me, I thought it would be simpler than Space Invaders.

GardenLetter27 ・ 3 days ago

Crazy how Apple is still the only option for this consumer hardware.

airspresso ・ 3 days ago

Framework desktop with AMD Strix Halo [1] are getting there as a viable alternative. Offering up to 96 GB of unified RAM at the moment, so still a gap up to the beefiest Mac Studio alternatives though.
[1]: https://frame.work/desktop
- GardenLetter27 ・ 2 days ago
  
  Surprisingly competitive pricing there though. It sucks that they're all priced around $3k in total though (with my Europoor VAT), but it's not as bad vs. Apple as I thought it would be.

lherron ・ 3 days ago

With the Anthropic rug pull on quotas for Max, I feel the short-mid term value sweet spot will be a Frankensteined together “Claude as orchestrator/coder, falling back to local models as quota limits approach” tool suite.

4b11b4 ・ 3 days ago

Was thinking this one might backfire on Anthropic in the end...
People are going to explore and get comfortable with alternatives.
There may have been other ways to deal with the cases they were worried about.

sneak ・ 3 days ago

What is the SOTA for benchmarking all of the models you can run on your local machine vs a test suite?

Surely this must exist, no? I want to generate a local leaderboard and perhaps write new test cases.

slimebot80 ・ 3 days ago

(novice question)

64gb is pure RAM? I thought Apple Silicon was efficient at paging SSD as memory storage - how important is RAM if you've got a fast SSD?

ethan_smith ・ 3 days ago

While Apple Silicon's memory compression and SSD swapping are efficient, RAM access is still ~100x faster than SSD, so sufficient physical RAM remains crucial for memory-intensive workloads like running large LLMs.
nicce ・ 3 days ago

Memory speed is the most important factor with LLMs and SSD is very slow when compared to RAM.

skeezyboy ・ 3 days ago

But arent we still decades away from running our own video-creating AIs locally? Have we plateaued with this current generation of techniques?

svachalek ・ 3 days ago

It's more a question of, how long do you want it to take to create a video locally?
- skeezyboy ・ 3 days ago
  
  nah, i definitely want to know what i asked
  
  sejje ・ 3 days ago
  
  His answer implies you can run them locally now, just not in a useful timeframe.

neutronicus ・ 3 days ago

If I understand correctly, the author is managing to run this model on a laptop with 64GB of RAM?

So a home workstation with 64GB+ of RAM could get similar results?

simonw ・ 3 days ago

Only if that RAM is available to a GPU, or you're willing to tolerate extremely slow responses.
The neat thing about Apple Silicon is the system RAM is available to the GPU. On most other systems you would need ~48GB of VRAM.
- xrd ・ 3 days ago
  
  Aren't there non-Macos laptops which also support sharing the VRAM and regular RAM, i.e. iGPU?
  https://www.reddit.com/r/GamingLaptops/comments/1akj5aw/what...
  I personally want to run linux and feel like I'll get a better price/GB offering that way. But, it is confusing to know how local models will actually work on those and the drawbacks of iGPU.
  
  mft_ ・ 3 days ago
  
  iGPUs are typically weak, and/or aren't capable of running the LLM so the CPU is used instead. You can run things this way, but it's not fast, and it gets slower as the models go up in size.
  If you want things to run quickly, then aside from Macs, there's the 2025 ASUS Flow z13 which (afaik) is the only laptop with AMD's new Ryzen Max+ 395 processor. This is powerful and has up to 128Gb of RAM that can be shared with the GPU, but they're very rare (and Mac-expensive) at the moment.
  The other variable for running LLMs quickly is memory bandwidth; the Max+ 395 has 256Gb/s, which is similar to the M4 Pro; the M4 Max chips are considerably higher. Apple fell on their feet on this one.
- sagarm ・ 3 days ago
  
  LLM evaluation on GPU and CPU is memory bandwidth constrained. The highest-end Apple machines are good for this because they have ~500GBps high memory bandwidth and up to ~128GB, not just because they can share that memory with the GPU (which any iGPU does). Most consumer machines are limited to 2xDDR5 channels (~50GBps).
NitpickLawyer ・ 3 days ago

> So a home workstation with 64GB+ of RAM could get similar results?
Similar in quality, but CPU generation will be slower than what macs can do.
What you can do with MoEs (GLMs and Qwens) is to run some experts (the shared ones usually) on a GPU (even a 12GB/16GB will do) and the rest from RAM on CPU. That will speed things up considerably (especially prompt processing). If you're interested in this, look up llama.cpp and especially ik_llama, which is a fork dedicated to this kind of selective offloading of experts.
undefined ・ 3 days ago

[deleted]
simlevesque ・ 3 days ago

Not so sure. The MBP uses hybrid memory, the ram is shared with the cpu and gpu.
Your 64gb workstation doesn't share the ram with your gpu.
0x457 ・ 3 days ago

You can run, it will just run on CPU and will be pretty slow. Macs, like everyone in this thread said, use unified memory, so it's 64GB between CPU and GPU, while for you its just 64 for CPU.
lynndotpy ・ 3 days ago

The laptop has "unified RAM", so that's like 64GB of VRAM.

anthk ・ 3 days ago

Writting a Z80 emulator with the original Space Invaders ROM will make you more fullfilled.

Either with SDL2+C, or even TCL/Tk, or Pythn with TKInter.

accrual ・ 3 days ago

Very impressive model! The SVG pelican designed by GLM 4.5 in Simon's adjacent article is the most accurate I've seen yet.

4b11b4 ・ 3 days ago

Quick, someone knit a quilt with all the different SVG pelicans

undefined ・ 3 days ago

[deleted]

undefined ・ 3 days ago

[deleted]

bgwalter ・ 3 days ago

The GML-4.5 model utterly fails at creating ASCII art or factorizing numbers. It can "write" Space Invaders because there are literally thousands of open source projects out there.

This is another example of LLMs being dumb copiers that do understand human prompts.

But there is one positive side to this: If this photocopying business can be run locally, the stocks of OpenAI etc. should got to zero.

simonw ・ 3 days ago

Why would you use an LLM to factorize numbers?
- bgwalter ・ 3 days ago
  
  Because we are told that they can solve IMO problems. Yet they fail at basic math problems, not only at factorization but also when probing them with relatively basic symbolic math that would not require the invocation of an external program.
  Also, you know it they fail they could say so instead of giving a hallucinated answer. First the models lie and say that a 20 digit number takes vast amounts of computing. Then, if pointed to a factorization program they pretend to execute it and lie about the output.
  There is no intelligence or flexibility apart from stealing other people's open source code.
  
  simonw ・ 3 days ago
  ・ 5 more
  
  That's why the IMO results were so notable: that was one of those moments where new models were demonstrated doing something that they had previously been unable to do.
  
  ducktective ・ 3 days ago
  ・ 2 more
  
  I can't fathom why more people aren't talking about the IMO story. Apparently the model they used is not just an LLM but some RL are involved too. If a model wins gold at IMO, is it still merely a "statistical parrot"?
  
  sejje ・ 3 days ago
  
  Stochastic parrot is the term.
  I don't think it's ever been accurate.
  
  bgwalter ・ 3 days ago
  ・ 2 more
  
  The results were private and the methodology was not revealed. Even Tao, who was bullish on "AI", is starting to question the process.
  
  simonw ・ 3 days ago
  
  The same thing has also been achieved by a Google DeepMind team and at least one group of independent researchers using publicly available models and careful promoting tricks.

another_one_112 ・ 3 days ago

Crazy to think that you can have a mostly-competent oracle even when disconnected from the grid.

dfedbeef ・ 2 days ago

Finally I can get a maybe correct answer about what to do about my snake bite from my laptop

polynomial ・ 3 days ago

At first I read this as "My 2.5 year old can write Space Invaders in JavaScript now"

matt3210 ・ 3 days ago

Is this more than ‘import space invaders; run_space_invaders()’?

__mharrison__ ・ 3 days ago

Time to get a new laptop. My MBP only has 16 gigs.

Looking forward to trying this with Aider.

dcchambers ・ 3 days ago

Amazing. There really is no secret sauce that the frontier models have.

msikora ・ 3 days ago

With 48GB MAcBook Pro M3 I'm probably out of luck, right?

simonw ・ 3 days ago

For this particular model, yes.
This new one from Qwen should fit though - it looks like that only needs ~30GB of RAM: https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-Inst...
- omneity ・ 3 days ago
  
  It takes ~17-20GB on Q4 depending on context length & settings (running it as we speak)
  ~30GB in Q8 sure, but it's a minimal gain for double the VRAM usage.

joshstrange ・ 3 days ago

My next MBP is going to need the next size up SSD (RIP bank account) so it can hold all the models I want to play with locally and my data. Thankfully I already have been maxing out the RAM so that isn't something new I also have to do.

ygritte ・ 3 days ago

Can you host that model locally with ollama?

simonw ・ 3 days ago

I haven't seen a GGUF for it yet, I imagine one will show up on Hugging Face soon which will probably work with Ollama.
- pyman ・ 3 days ago
  
  Do you think local LLMs combined with P2P networks could become a thing? Imagine people adding datasets to an open model, the same way they add blocks to a blockchain, which is around 500GB in size.
  It could help decentralise power and reduce our dependency on the big players.
  
  simonw ・ 2 days ago
  
  There have been ambitions to do that kind of thing with LoRA - see the leaked "no moat" Google memo from a couple of years ago for one example: https://simonwillison.net/2023/May/4/no-moat/
  It hasn't really happened though. I suspect that's because it turns out techniques like RAG or tool calling are massively easier and more effective than trying to tech models new information through shared model weights.

undefined ・ 3 days ago

[deleted]

asadm ・ 3 days ago

How good is this model with tool calling.

deadbabe ・ 3 days ago

You can overtrain a neural network to write a space invaders clone. The final weights might take up less disk space than the output code.

wslh ・ 3 days ago

Here's a sci-fi twist: suppose Space Invaders and similar early games were seeded by a future intelligence. (•_•)⌐■-■

bradly ・ 3 days ago

I appreciate you sharing both the chat log and the full source code. I would be interested to see a followup post on how adding moderately-sized features like High Score go.

Also, IANAL but Space Invaders is owned IP. I have no idea the legality of a blog post describing steps to create and releasing an existing game, but I've seen headlines on HN of engs in trouble for things I would not expect to be problematic. Maybe Space Invaders is in q-tip/band-aid territory at this point?, but if this was Zelda instead of Space Invaders, I could see things being more dicey.

sowbug ・ 3 days ago

It doesn't infringe any kind of intellectual property.
This isn't copyright infringement; it isn't based on the original assembly code or artwork. A game concept can't be copyrighted. Even if one of SI's game mechanics were patented, it would have long expired. Trade secret doesn't apply in this situation.
That leaves trademark. No reasonable person would be confused whether Simon is trying to pass this creation off as a genuine Space Invaders product.
- 9rx ・ 3 days ago
  
  > No reasonable person would be confused whether Simon is trying to pass this creation off as a genuine Space Invaders product.
  There may be no reasonable confusion, but trademark holders also have to protect against dilution of their brand, if they want to retain their trademark. With use like this, people might come to think of Space Invaders as a generic term for all games of this type, not the brand of a specific game.
  (there is a strong case to be made that they already do, granted)
Joker_vD ・ 3 days ago

> Space Invaders is owned IP
So is Tetris. And I believe that Snake is also an owned IP although I could be wrong on this one.

jus3sixty ・ 3 days ago

I recently let go of my 2.5 year old vacuum. It was just collecting dust.

falcor84 ・ 3 days ago

Thinking about it, the measure of whether a vacuum is being sufficiently used is probably that the circulation of dust within it over the last year is greater than the circulation of dust on its external boundary over that time period.

croes ・ 3 days ago

I bet the training data included enough space invader cloned in JS

jplrssn ・ 3 days ago

I also wouldn't be surprised if labs were starting to mix in a few pelican SVGs into their training data.
- diggan ・ 3 days ago
  
  Even "accidentally" it makes sense that "SVGs of pelicans riding bikes" are now included into datasets used for training as it has spread as a wildfire on the internet, making it less useful as a simple benchmark.
  This is why I keep all my benchmarks private and don't share anything about them publicly, as soon as you write about them anywhere publicly they'll stop being useful in some months.
  
  toyg ・ 3 days ago
  ・ 6 more
  
  > This is why I keep all my benchmarks private
  This is also why, if I were an artist or anyone commercially relying on creative output of any kind, I wouldn't be posting anything on the internet anymore, ever. The minute you make anything public, the engines will clone it to death and turn it into a commodity.
  
  debugnik ・ 3 days ago
  
  That makes it so much harder to show art to people and market yourself though.
  I considered experimenting with web DRM for art sites/portfolios, on the assumption that scrappers won't bother with the analog loophole (and dedicated art-style cloners would hopefully be disappointed by the quality), but gave up because of limited compatible devices for the strongest DRM levels, and HDCP being broken on those levels anyway. If the DRM technique caught on it would take attackers, at most, a few bucks and hours once to bypass it, and I don't think users would truly understand that upfront.
  
  __mharrison__ ・ 3 days ago
  ・ 4 more
  
  Somewhat defeats the purpose of being an artist, doesn't it?
  
  diggan ・ 2 days ago
  
  Worth noting, in case people weren't aware, but bunch of people have different motivations for being artists, not everyone wants everything they do to be as widely shared as possible, some are happy playing/drawing/whatever for a small group of people, and not even making money off it.
  
  toyg ・ 3 days ago
  
  Defeating the purpose of creating almost anything, really.
  AI is definitely breaking the whole "labor for money" architecture of our world.
  
  zhengyi13 ・ 3 days ago
  
  Eeeehhhh.
  Maybe the thing to do is provide public, physical exhibits of your art in search of patronage.
- simonw ・ 3 days ago
  
  I'll believe they are doing that when one of the models draws me an SVG that actually looks like a pelican.
  
  __mharrison__ ・ 3 days ago
  ・ 2 more
  
  Someone needs to craft a beautifully bike donned by a pelican, throw in some seo, and see how long it takes a model to replicate it.
  Simon probably wouldn't be happy about killing his multi-year evaluation metric though...
  
  simonw ・ 3 days ago
  
  I would be delighted.
  My pelican on a bicycle benchmark is a long con. The goal is to finally get a good SVG of a pelican riding a bicycle, and if I can trick AI labs into investing significant effort in cheating on my benchmark then fine, that gets me my pelican!
- quantumHazer ・ 3 days ago
  
  SVG benchmarking is a thing since GPT-4, so probably all major labs are overfitting on some dataset ov svg images for sure
shermantanktop ・ 3 days ago

How about an SVG of 9.11 pelicans riding bicycles and counting the three Rs in “strawberry”?
gchamonlive ・ 3 days ago

Which would make this disappointing if it was only good at cloning space invaders. If it can reproduce all the clone it has ever seen it would still be an impressive feat.
I just think we should stop to appreciate exactly how awesome language models are. It's compressing and correctly reproducing a lot of data with meaningful context between each token and the rest of the context window. It's still amazing, specially with smaller models like this, because even if it's reproducing a clone, you can still ask questions about it and it should perform reasonably well explaining you what it does and how you can take it over to further develop that clone.
- croes ・ 3 days ago
  
  But that would still be copy and paste with extra steps.
  Like all these vibe coded to do apps, one of the most used starting problems of programming courses.
  It’s great that an AI can do that but it could stall progress if we get limited to existing tools and programs.

vFunct ・ 3 days ago

please please apple give us a M5 MacBook Pro laptop with 2TB of unified memory please please

amelius ・ 3 days ago

Wake me up when I can apt-get install the llm.

Kurtz79 ・ 3 days ago

You can install ollama with a script fetched with curl and run a llm model with a grand total of two bash commands (including curl).

Strawberry76 ・ 3 days ago

[dead]

th0ma5 ・ 3 days ago

[flagged]

simonw ・ 3 days ago

Which bit of this post did you find condescending or infantilizing?

karenbass ・ 3 days ago

[dead]

pamelafox ・ 3 days ago

Alas, my 3 year old Mac has only 16 GB RAM, and can barely run a browser without running out of memory. It's a work-issued Mac, and we only get upgrades every 4/5 years. I must be content with 8B parameters models from Ollama (some of which are quite good, like llama3.1:8b).

dreamer7 ・ 3 days ago

I am able to run Gemma 3 12B on my M1 MBP 16GB. It is pretty good at logic and reasoning!
__mharrison__ ・ 3 days ago

Odd. My MBP has 16 GB and I routinely have 5 browsers windows open. Most of them have 5-20 tabs. I'm also routinely running vi vscode and editing videos with davinci resolve without issue.
My only memory issue that I can remember is an OBS memory leak, otherwise these MBPs incredible hardware. I wish any other company could actually deliver a comparable machine.
- pamelafox ・ 3 days ago
  
  I was exaggerating slightly - I think it's some combo of the apps I use: Edge, Teams, Discord, VS Code, Docker. When I get the RAM popup once a week, I typically have to close a few of those, whichever is using the most memory according to Activity Monitor. I've also got very little hard drive space on my machine, about 15 GB free, so that makes it harder for me to download the larger models. I keep trying to clear space, even using CleanMyMac, but I somehow keep filling it up.
GaggiX ・ 3 days ago

Reasoning models like qwen3 are even better, and they have more options, for example you can choose the 14B model (at the usual 4KM quantization) instead of the 8B model.
- pamelafox ・ 3 days ago
  
  Are they quantized more effectively than the non-reasoning models for some reason?
  
  GaggiX ・ 3 days ago
  
  There is no difference, you can choose a 6 bits quantization if you prefer, at that point it's essentially lossless.
e1gen-v ・ 3 days ago

Just download more ram!

larodi ・ 3 days ago

Is probably more correct to say - my 2.5 year laptop can RETELL space invaders. Pretty sure it cannot write a game it has never seen, so you can even say - my old laptop can now do this fancy extraction of data from a smart probabilistic blob, where the original things are retold in new colours and forms :)

simonw ・ 3 days ago

I know these models can build games and apps they've never seen before because I've already observed them doing exactly that time and time again.
If you haven't seen that yourself yet I suggest firing up the free, no registration required GLM-4.5 Air on https://chat.z.ai/ and seeing if you can prove yourself wrong.
- larodi ・ 3 days ago
  
  I’m using all major models it a daily driver and none of these can create anything I have not spent excessive amount of time explaining.
  It works for me on the architectural level, but does not change the fact that your expounding on prior information and not a new one.
  Not sure though why I’m getting downvoted.
  
  simonw ・ 2 days ago
  
  You said: "Pretty sure it cannot write a game it has never seen"
  Then you said: "none of these can create anything I have not spent excessive amount of time explaining"
  Presumably then they CAN create a game they have never seen if you spend an "excessive amount of time" explaining that game?
  So are you talking about game design here? It can implement anything, but it can't design entirely new games without human input?
- th0ma5 ・ 3 days ago
  
  [flagged]
  
  dang ・ 3 days ago
  
  Please see https://news.ycombinator.com/item?id=44726957 and please stop posting these.
uludag ・ 3 days ago

It's unfortunate that the ideas of things to test first are exactly the things more likely to be contained in training data. Hence why the pelican on a bicycle was such a good test, until it became viral.
oceanplexian ・ 3 days ago

So you're saying it works exactly the same way as humans, who copied Space Invaders from Breakout which came out in 1976.
MattRix ・ 3 days ago

No, that would be incorrect, nobody uses “retell” like that.
The impressive thing about these models is their ability to write working code, not their ability to come up with unique ideas. These LLMs actually can come up with unique ideas as well, though I think it’s more exciting that they can help people execute human ideas instead.