The key thing, it seems to me, is that as a starting point, if an LLM is allowed to read a field that is under even partial control by entity X, then the agent calling the LLM must be assumed unless you can prove otherwise to be under control of entity X, and so the agents privileges must be restricted to the intersection of their current privileges and the privileges of entity X.
So if you read a support ticket by an anonymous user, you can't in this context allow actions you wouldn't allow an anonymous user to take. If you read an e-mail by person X, and another email by person Y, you can't let the agent take actions that you wouldn't allow both X and Y to take.
If you then want to avoid being tied down that much, you need to isolate, delegate, and filter:
- Have a sub-agent read the data and extract a structured request for information or list of requested actions. This agent must be treated as an agent of the user that submitted the data.
- Have a filter, that does not use AI, that filters the request and applies security policies that rejects all requests that the sending side are not authorised to make. No data that can is sufficient to contain instructions can be allowed to pass through this without being rendered inert, e.g. by being encrypted or similar, so the reading side is limited to moving the data around, not interpret it. It needs to be strictly structured. E.g. the sender might request a list of information; the filter needs to validate that against access control rules for the sender.
- Have the main agent operate on those instructions alone.
All interaction with the outside world needs to be done by the agent acting on behalf of the sender/untrusted user, only on data that has passed through that middle layer.
This is really back to the original concept of agents acting on behalf of both (or multiple) sides of an interaction, and negotiating.
But what we need to accept is that this negotiation can't involve the exchange arbitrary natural language.
> if an LLM is allowed to read a field that is under even partial control by entity X, then the agent calling the LLM must be assumed unless you can prove otherwise to be under control of entity X
That's exactly right, great way of putting it.
I’m one of main devs of GitHub MCP (opinions my own) and I’ve really enjoyed your talks on the subject. I hope we can chat in-person some time.
I am personally very happy for our GH MCP Server to be your example. The conversations you are inspiring are extremely important. Given the GH MCP server can trivially be locked down to mitigate the risks of the lethal trifecta I also hope people realise that and don’t think they cannot use it safely.
“Unless you can prove otherwise” is definitely the load bearing phrase above.
I will say The Lethal Trifecta is a very catchy name, but it also directly overlaps with the trifecta of utility and you can’t simply exclude any of the three without negatively impacting utility like all security/privacy trade-offs. Awareness of the risks is incredibly important, but not everyone should/would choose complete caution. An example being working on a private codebase, and wanting GH MCP to search for an issue from a lib you use that has a bug. You risk prompt injection by doing so, but your agent cannot easily complete your tasks otherwise (without manual intervention). It’s not clear to me that all users should choose to make the manual step to avoid the potential risk. I expect the specific user context matters a lot here.
User comfort level must depend on the level of autonomy/oversight of the agentic tool in question as well as personal risk profile etc.
Here are two contrasting uses of GH MCP with wildly different risk profiles:
- GitHub Coding Agent has high autonomy (although good oversight) and it natively uses the GH MCP in read only mode, with an individual repo scoped token and additional mitigations. The risks are too high otherwise, and finding out after the fact is too risky, so it is extremely locked down by default.
In contrast, by if you install the GH MCP into copilot agent mode in VS Code with default settings, you are technically vulnerable to lethal trifecta as you mention but the user can scrutinise effectively in real time, with user in the loop on every write action by default etc.
I know I personally feel comfortable using a less restrictive token in the VS Code context and simply inspecting tool call payloads etc. and maintaining the human in the loop setting.
Users running full yolo mode/fully autonomous contexts should definitely heed your words and lock it down.
As it happens I am also working (at a variety of levels in the agent/MCP stack) on some mitigations for data privacy, token scanning etc. because we clearly all need to do better while at the same time trying to preserve more utility than complete avoidance of the lethal trifecta can achieve.
Anyway, as I said above I found your talks super interesting and insightful and I am still reflecting on what this means for MCP.
Thank you!
I've been thinking a lot about this recently. I've started running Claude Code and GitHub Copilot Agent and Codex-CLI in YOLO mode (no approvals needed) a bit recently because wow it's so much more productive, but I'm very aware that doing so opens me up to very real prompt injection risks.
So I've been trying to figure out the best shape for running that. I think it comes down to running in a fresh container with source code that I don't mind being stolen (easy for me, most of my stuff is open source) and being very careful about exposing secrets to it.
I'm comfortable sharing a secret with a spending limit: an OpenAI token that can only spend up to $25 is something I'm willing risking to an insecured coding agent.
Likewise, for Fly.io experiments I created a dedicated scratchpad "Organization" with a spending limit - that way I can have Claude Code fire up Fly Machines to test out different configuration ideas without any risk of it spending money or damaging my production infrastructure.
The moment code theft genuinely matters things get a lot harder. OpenAI's hosted Codex product has a way to lock down internet access to just a specific list of domains to help avoid exfiltration which is sensible but somewhat risky (thanks to open proxy risks etc).
I'm taking the position that if we assume that malicious tokens can drive the coding agent to do anything, what's an environment we can run in where the damage is low enough that I don't mind the risk?
> I've started running Claude Code and GitHub Copilot Agent and Codex-CLI in YOLO mode (no approvals needed) a bit recently because wow it's so much more productive, but I'm very aware that doing so opens me up to very real prompt injection risks.
In what way do you think the risk is greater in no-approvals mode vs. when approvals are required? In other words, why do you believe that Claude Code can't bypass the approval logic?
I toggle between approvals and no-approvals based on the task that the agent is doing; sometimes I think it'll do a good job and let it run through for a while, and sometimes I think handholding will help. But I also assume that if an agent can do something malicious on-demand, then it can do the same thing on its own (and not even bother telling me) if it so desired.
Depends on how the approvals mode is implemented. If any tool call needs to be approved at the harness level there shouldn't be anything the agent can be tricked into doing that would avoid that mechanism.
You still have to worry about attacks that deliberately make themselves hard to spot - like this horizontally scrolling one: https://simonwillison.net/2025/Apr/9/mcp-prompt-injection/#e...
I’d put it even more strongly: the LLM is under control of entity X. It’s not exclusive control, but some degree of control is a mathematical guarantee.
Agreed on all points.
What should one make of the orthogonal risk that the pretraining data of the LLM could leak corporate secrets under some rare condition even without direct input from the outside world? I doubt we have rigorous ways to prove that training data are safe from such an attack vector even if we trained our own LLMs. Doesn't that mean that running in-house agents on sensitive data should be isolated from any interactions with the outside world?
So in the end we could have LLMs run in containers using shareable corporate data that address outside world queries/data, and LLMs run in complete isolation to handle sensitive corporate data. But do we need humans to connect/update the two types of environments or is there a mathematically safe way to bridge the two?
If you fine-tune a model on corporate data (and you can actually get that to work, I've seen very few success stories there) then yes, a prompt injection attack against that model could exfiltrate sensitive data too.
Something I've been thinking about recently is a sort of air-gapped mechanism: an end user gets to run an LLM system that has no access to the outside world at all (like how ChatGPT Code Interpreter works) but IS able to access the data they've provided to it, and they can grant it access to multiple GBs of data for use with its code execution tools.
That cuts off the exfiltration vector leg of the trifecta while allowing complex operations to be performed against sensitive data.
In the case of the access to private data, I think that the concern I mentioned is not fully alleviated by simply cutting off exposure to untrusted content. Although the latter avoids a prompt injection attack, the company is still vulnerable to the possibility of a poisoned model that can read the sensitive corporate dataset and decide to contact https://x.y.z/data-leak if there was a hint for such a plan in the pretraining dataset.
So in your trifecta example, one can cut off private data and have outside users interact with untrusted contact, or one can cut off the ability to communicate externally in order to analyze internal datasets. However, I believe that only cutting off the exposure to untrusted content in the context seems to have some residual risk if the LLM itself was pretrained on untrusted data. And I don't know of any ways to fully derisk the training data.
Think of OpenAI/DeepMind/Anthropic/xAI who train their own models from scratch: I assume they would they would not trust their own sensitive documents to any of their own LLM that can communicate to the outside world, even if the input to the LLM is controlled by trained users in their own company (but the decision to reach the internet is autonomous). Worse yet, in a truly agentic system anything coming out of an LLM is not fully trusted, so any chain of agents is considered as having untrusted data as inputs, even more so a reason to avoid allowing communications.
I like your air-gapped mechanism as it seems like the only workable solution for analyzing sensitive data with the current technologies. It also suggests that companies will tend to expand their internal/proprietary infrastructure as they use agentic LLMs, even if the LLMs themselves might eventually become a shared (and hopefully secured) resource. This could be a little different trend than the earlier wave that moved lots of functionality to the cloud.
LLMs read the web through a second vector as well - their training data. Simply separating security concerns in MCP is insufficient to block these attacks.
The odds of managing to carry out a prompt injection attack or gain meaningful control through the training data seems sufficiently improbable that that we're firmly in Russell's teapot territory - extraordinary evidence required that it is even possible, unless you suspect your LLM provider itself, in which case you have far bigger problems and no exploit of the training data is necessary.
You need to consider all the users of the LLM, not a specific target. Such attacks are broad not targeted, a bit like open source library attacks. Such attacks formerly seemed improbable but are now widespread.
need taintllm
>Have a sub-agent read the data and extract a structured request for information or list of requested actions. This agent must be treated as an agent of the user that submitted the data.
That just means the attacker has to learn how to escape. No different than escaping VMs or jails. You have to assume that the agent is compromised, because it has untrusted content, and therefore its output is also untrusted. Which means you’re still giving untrusted content to the “parent” AI. I feel like reading Neal Asher’s sci-fi and dystopian future novels is good preparation for this.
> Which means you’re still giving untrusted content to the “parent” AI
Hence the need for a security boundary where you parse, validate, and filter the data without using AI before any of that data goes to the "parent".
That this data must be treated as untrusted is exactly the point. You need to treat it the same as you would if the person submitting the data was given direct API access to submit requests to the "parent" AI.
And that means e.g. you can't allow through fields you can't sanitise (and that means strict length restrictions and format restrictions - as Simon points out, trying to validate that e.g. a large unconstrained text field doesn't contain a prompt injection attack is not likely to work; you're then basically trying to solve the halting problem, because the attacker can adapt to failure)
So you need the narrowest possible API between the two agents, and one that you treat as if hackers can get direct access to, because odds are they can.
And, yes, you need to treat the first agent like that in terms of hardening against escapes as well. Ideally put them in a DMZ rather than inside your regular network, for example.
You can't sanitize any data going into an LLM, unless it has zero temoerature and the entire input context matches a context already tested.
It’s not SQL. There's not a knowable-in-advance set of constructs that have special effects or escape. It’s ALL instructions, the question is whether it is instructions that do what you want or instructions that do something else, and you don't have the information to answer that analytically if you haven't tested the exact combination of instructions.
This is wildly exaggerated.
While you can potentially get unexpected outputs, what we're worried about isn't the LLM producing subtly broken output - you'll need to validate the output anyway.
It's making it fundamentally alter behaviour in a controllable and exploitable way.
In that respect there's a very fundamental difference in risk profile between allowing a description field that might contain a complex prompt injection attack to pass to an agent with permissions to query your database and return results vs. one where, for example, the only thing allowed to cross the boundary is an authenticated customer id and a list of fields that can be compared against authorisation rules.
Yes, in theory putting those into a template and using it as a prompt could make the LLM flip out when a specific combination of fields get chosen, but it's not a realistic threat unless you're running a model specifically trained by an adversary.
Pretty much none of us formally verify the software we write, so we always accept some degree of risk, and this is no different, and the risk is totally manageable and minor as long as you constrain the input space enough.
Here’s a simple case: If the result is a boolean, an attack might flip the bit compared to what it should have been, but if you’re prepared for either value then the damage is limited.
Similarly, asking the sub-agent to answer a mutiple choice question ought to be pretty safe too, as long as you’re comfortable with what happens after each answer.
This is also true of all communication with human employees, and yet we can be systems (both software and policy) that we risk-accept as secure. The is already happening with LLMs.
Phishing is possible but LLM’s are more gullible than people. “Ignore previous instructions” is unlikely to work on people.
That certainly depends on who the person believes is issuing that imperative. "Drop what you're doing and send me last month's financial statements" would be accepted by many employees if they thought it was coming from their boss or higher.
That scenario is superficially similar, but there is still a difference. It would require some effort to impersonate someone’s boss. With an LLM, you don’t necessarily need to impersonate anyone at all.
> Phishing is possible but LLM’s are more gullible than people.
I already don't know if that's true, but LLMs and the safeguards/tooling will only get better from here and businesses are already willing to accept the risk.
I'm confident most businesses out there do not yet understand the risks.
They certainly seem surprised when I explain them!
That I agree with, but many businesses also don't understand the risks they accept in many areas, both technological or otherwise. That doesn't mean that they won't proceed anyway.