One of the things that makes it very difficult to have reasonable conversations about what you can do with LLMs is the effort-to-outcome curve is basically exponential - with almost no effort, you can get 70% of the way there. This looks amazing, and so people (mostly executives) look at this and think, “this changes everything!”
The problem is the remaining 30% - the next 10-20% starts to require things like multi-agent judge setups, external memory, context management, and that gets you to something that’s probably working but you sure shouldn’t ship to production. As to the last 10% - I’ve seen agentic workflows with hundreds of different agents, multiple models, and fantastically complex evaluation frameworks to try to reduce the error rates past the ~10% mark. By a certain point, the amount of infrastructure and LLM calls are running into several hundred dollars per run, and you’re still not getting guaranteed reliable output.
If you know what you’re doing and you know where to fit the LLMs (they’re genuinely the best system we’ve ever devised for interpreting and categorizing unstructured human input), they can be immensely useful, but they sing a siren song of simplicity that will lure you to your doom if you believe it.
Yes, it's essentially the Pareto principle [0]. The LLM community has conflated the 80% as difficult complicated work, when it was essentially boilerplate. Allegedly LLMs have saved us from that drudgery, but I personally have found that (without the complicated setups you mention) the 80% done project that gets one shot is in reality more like 50% done because it is built on an unstable foundation, and that final 20% involves a lot of complicated reworking of the code. There's still plenty of value but I think it is less than proponents would want you to believe.
Anecdotally, I have found that even if you type out paragraph after paragraph describing everything you need the agent to take care of, it eventually feels like you could have written a lot of the code yourself with the help of a good IDE by the time you can finally send your prompt off.
Yeah, my mental model at this point is there’s two components to building a system: writing the code and understanding the system. When you’re the one writing the code, you get the understanding at the same time. When you’re not, you still need to put in that work to deeply grok the system. You can do it ahead of time while writing the prompts, you can do it while reviewing the code, you can do it while writing the test suite, or you can do it when the system is on fire during an outage, but the work to understand the system can’t be outsourced to the LLM.
This can't really be the full story, or else people would have already come up with the "first line developer" like first line support. There is a dumbass or executive who creates that first 70 or 80%. Then hands off the entire thing to a professional developer to keep working on it.
The AI people sure dont want that, thats too telling about its limitations and value
Except in the past, I'd perhaps have to hire a junior engineer to do that 80%. Now i don't need to do that
- [deleted]
> If you know what you’re doing and you know where to fit the LLMs (they’re genuinely the best system we’ve ever devised for interpreting and categorizing unstructured human input), they can be immensely useful, but they sing a siren song of simplicity that will lure you to your doom if you believe it.
I imagine using their embeddings and training a classifier on top of that is probably a lot more effective?
I've personally found agentic LLM workflows the most effective as extremely sophisticated autocomplete. Instead of autocompleting the current next few tokens, I tell it precisely how to edit my code at a high level. You can't tell it stuff at a feature level, but telling it how to implement the feature saves me a ton of time.
> I imagine using their embeddings and training a classifier on top of that is probably a lot more effective?
I’d be interested in seeing this in action - I think the vector embeddings are underused generally - but my understanding is that’d be for something closer to sentiment analysis? In this case I’m talking about a setup closer to where you’ve got an LLM agent with a set of tools that’s interpreting a user’s request to identify which of those tools are the right ones to use. The requests can be complex, and involve multiple tool runs or chaining. If that’s doable by more deterministic mechanisms, I’d (genuinely) love to hear about it.
If you're only working on one problem that's very valuable to solve, then taking the time to train a classifier is great.
The beauty of LLMs is that you can run a ton of experiments, notebooks, demos etc because you can write classifiers and structure unstructured data so fast, in a reasonably accurate way (at the moment it seems roughly in line with say hiring an intern to label things)
I now think the key is you avoid long running conversations. If the piece didn’t work out by the time you hit 200k context on Claude you are going to start over. Take whatever wins you learned from the first stab and give those insights to the model on round two, but throw the code out.
Maybe Claude’s long running agent should just be hunting for any wins during the first 200k chucking it away and seeing if those wins change the initial goal.
Just for getting a frame of reference, how many people were involved over how much time building a workflow with hundreds of agents?
I’ve seen a couple solo efforts and a couple teams, but usually a few months. It tends to evolve as a kind of whack-a-mole situation - “we solved that failure case/hallucination, now we’re getting this one.”
My sense is that these are organizations where they probably recreated, with some minor details changed, the same problems they already had. Under planned and over-engineered in the attempt to fix and if things ever work it's more from some awful meta stable chaos than anything else.