I think it omits the real reason I want to run the harness in the sandbox: I barely trust the harness more than the LLM, at least at this point in time. They are so rapidly evolving along with the underlying models, that I don't think they are a reasonable component to rely on to provide safety constraints. Put more precisely: if your harness has an ability to do something the LLM can't, and it has a set of conditions under which the LLM can cause those to be invoked, you have to assume the LLM will work out those conditions and execute them. Effectively you have an arm of the lethal trifecta and pretending otherwise is more dangerous than helpful.
Having said that, some components need to live outside the sandbox (otherwise, who creates the sandbox?). Longer term, I see it as a dedicated security layer, not part of the harness. This probably has yet to emerge fully but it's more like a hypervisor type layer that sits outside of everything and authorises access based on context, human user, etc and can apply policy including mediate the human intervention for decision points when needed.
>Having said that, some components need to live outside the sandbox (otherwise, who creates the sandbox?).
I run a single-node k3d cluster on each of my MacBooks which uses Agent Sandbox[0] to keep harnesses isolated. Harnesses access models through LiteLLM only. I have aliases for `kubectl exec`ing into whatever harness I need.
I don't trust the harness, and I especially don't trust that the LLM won't be able to subvert the harness, or trick me via the harness. I assume that the LLM will be able to leak any secret in the harness context to arbitrary internet destinations, or somehow encode the secret in a work product. Eg space characters at the end of lines encoding access tokens.
Having the harness in one VM, and tool use applied to user data in another, is about as safe as you can be at present. You can mount filesystem fragments from the data VM into the harness VM, but tool execution remains painful.
Having all authorisation and access control exist outside of the harness layer is essential. It should only have narrowly scoped and time limited credentials that are bound to its IP, and even then that is problematic.
> Effectively you have an arm of the lethal trifecta and pretending otherwise is more dangerous than helpful.
"Lethal trifecta" is basically describing phishing but in a way more palatable to people who would rather die before allowing themselves to anthropomorphize LLMs even a little bit. It's not a problem you can fix with better coding, like some SQL injection. You can only manage risk around it (for which sandboxing is one of many solutions that can help).
So on one hand, I agree with you - you need to be mindful of what you're actually dealing with. On the other hand, you always have this, and need this, for the agent to be able to do anything useful.
Author here.
I should have made it more clear that the article is about agent / harness building (not about running third party agents).
> I barely trust the harness more than the LLM
Since we built it, I trust it just as much as I trust our API server :)
The latter gets untrusted inputs from the internet, while the former gets untrusted inputs from the LLM
The LLM has harness control in claude ;) “Let me switch off the sandbox and try again”
> if your harness has an ability to do something the LLM can't
What does this even mean. The only capability of an LLM is generate text.
The LLM can only generate text. The harness can do more than just generate text. By joining the two you're allowing the LLM (through text) to carry out whatever actions the harness can take.
My brain can only generate electrical signals. My hand responds to electrical signals and can interact with the real world. The two together can do more than just what my brain alone can do.
If you don't trust a particular brain, don't put a gun in the hand which is connected to it. If you don't trust a LLM, don't connect it to a harness which has access to your production database and only recent backups (https://www.theregister.com/2026/04/27/cursoropus_agent_snuf...).
We’ve trained models on JSON schemas for “tool calls”, and then built software to interpret and run those calls for the LLMs