I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.
However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their own safety evaluation says 8% of the time their automated adversarial system was able to one-shot a successful injection takeover even with safeguards in place and extended thinking, and 50% (!!) of the time if given unbounded attempts. That seems wildly unacceptable - this tech is just a non-starter unless I'm misunderstanding this.
[1] https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7...
Isn't "computer use" just interaction with a shell-like environment, which is routine for current agents?
Does it matter? Really?
I can type awful stuff into a word processor. That's my fault, not the programs.
So if I can trick an LLM into saying awful stuff, whose fault is that? It is also just a tool...
[dead]