Show HN: OctopusGarden – An autonomous software factory (specs in, code out)

I built this over the weekend after reading about StrongDM's software factory (their writeup: https://factory.strongdm.ai/, Simon Willison's deep dive: https://simonwillison.net/2026/Feb/7/software-factory/, Dan Shapiro's Five Levels: https://www.danshapiro.com/blog/2026/01/the-five-levels-from...). OctopusGarden is an open-source implementation of the pattern StrongDM described: holdout scenarios, probabilistic satisfaction scoring via LLM-as-judge, and a convergence loop that iterates until the code works; no human code review in the loop.

What stood out to me was that this architecture largely rhymes with the coding workflows I and others already do with coding agents. It's basically automating the connective tissue between the workflows I was already doing in Claude Code, and then brute-forcing a result. In the dark factory model, a spec goes in, code gets generated, built in Docker, validated against scenarios the agent never saw, scored, and failures feed back until it converges.

I've tried it with mostly standard CRUD/REST API apps and it works. I haven't tried anything with HTML/JS yet. You can try the sample specs in the repo.

Some raw notes from the experience:

1. I don't want to maintain the code these factories generate. It works. The phenotype is (largely) correct, but the genotype is pretty wild and messy. I did not use OctopusGarden to build OctopusGarden (you can tell because it uses strict linting and tests). I know the point of these systems is zero human in the loop, but I think there's a real opportunity to get factories to generate code that humans actually want to maintain. I'm going to work on getting OctopusGarden there.

2. Compliance might be a nightmare. In my day job I think a lot about ISO 27001 and SOC 2 compliance. The idea of deploying dark-factory-generated projects into my environments and checking compliance boxes sounds painful. That might just be the current state of OctopusGarden and the code it generates, but I think we can get to a point where generated code is completely linted, statically checked, and tested inside the factory. That's not OctopusGarden today, but maybe it will be there next week? I can see this moving fast.

3. These dark factory apps will be hard to debug. There was a Claude outage today and I couldn't run my smoke tests or generate new apps. I don't want to maintain services that can't be debugged and fixed by a human in a pinch. We're already partially there with AI-assisted code, but this factory-generated code is even more convoluted. Requiring AI to create a new app version is probably worth it...but it's still yet another thing between you and quickly patching an urgent bug.

4. Security needs a better story. These things need real security hardening. Maybe that's just better spec files and scenarios, maybe it's something more. I'm going to drink a strong cola and think about this one.

5. The unit of responsibility keeps growing. Last year we said code must come in PR-sized bites — that's how we manage risk. Now we're talking about deploying meshes of services created and deployed with no humans in the loop (except at creation). AI-generated services could really push the scale of what people are willing to accept responsibility for. Most SRE teams manage 1-5 services at big companies. Will that number increase per team? How much GDP is one person willing to manage via agents? Just a shower thought.

6. I was surprised this works. I'm surprised at how easy it was to make. I'm surprised more of these aren't out there already. I only did a couple of GitHub searches and didn't find many. I'm bad at searching. Sorry if I didn't find your project.

github.com

6 points

foundatron

6 hours ago


5 comments

jlongo78 2 hours ago

The hardest part of autonomous pipelines like this is observability when things go sideways. Specs rarely survive first contact with real codebases intact. Worth investing heavily in session persistence and replay so you can audit exactly what the agent reasoned at each step. Being able to resume a failed run mid-conversation rather than starting cold saves enormous time. Multi-agent parallelism also compounds fast, so a grid view across simultaneous runs becomes essential pretty quickly.

deltaops 6 hours ago

This is exactly the problem we're tackling! We built DeltaOps (delta-ops-mvp.vercel.app) - human-in-the-loop governance for autonomous agents. You hit the nail on the head with "no human in the loop" - that's the gap. DeltaOps adds a layer where agents can work autonomously, but critical actions (deploys, code merges, spending) require human approval. Also addresses your compliance concerns - every action is logged and approved. Would love to chat about integrating governance into dark factories!

  • foundatron 6 hours ago

    Cool site/ good idea. Maybe I'm underestimating it (I probably am), but I don't think it's a huge leap from what I published today and that compliant vision you're tackling.

guerython 6 hours ago

Curious how you are handling those guard logs and approvals in OctopusGarden?

  • foundatron 6 hours ago

    Right now OctopusGarden logs every LLM call with token counts and cost, and the SQLite store records each run and iteration (spec hash, scores per scenario, generated code). So you get a full trace of what was generated, what it was tested against, and how it scored.

    For approvals, the current model is that the spec is the approval. If the spec is right and scenarios pass at 95%+ satisfaction, the code ships. There's no PR review step by design (the "code is opaque weights" philosophy).

    That said, you could totally layer approvals on top. Gate on spec changes, require sign-off before a run kicks off, or add a human checkpoint between "converged" and "deployed." The tool doesn't enforce a deployment pipeline, so that's up to your org's workflow.

    Worth noting: this is purely a hobby project at this point. It hasn't been used in any commercial setting. The guard rails and approval workflow stuff is where it would need the most work before anyone used it for real.