Ask HN: How are you preventing runaway LLM workflows in production?

We’ve been pushing LLM backed workflows into production and are starting to run into reliability edges that observability alone doesn’t solve.

Things like:

- loops that don’t terminate cleanly

- retries cascading across tool calls

- cost creeping up inside a single workflow

- agents making technically “allowed” but undesirable calls

Monitoring here is fine. We can see what’s happening. The harder part is deciding where the enforcement boundary actually lives.

Right now, most of our shutdown paths still feel manual, things like feature flags, revoking keys, rate limiting upstream, etc.

Curious how others are handling these problems in practice:

- What’s your enforcement unit? Tool call, workflow, container, something else?

- Do you have automated kill conditions?

- Did you build this layer internally?

- Did you have to revisit it multiple times as complexity increased?

- Does it get worse as workflows span more tools or services?

Would appreciate any concrete experiences from teams running agents in production. Really just trying to figure out how to scale.

1 point

HenryM12

4 hours ago


1 comment

guerython an hour ago

Plan doc now drives everything. Every request becomes a four-bullet plan (inputs, connectors, guards, metric) before any workflow runs so the agent has a concrete target instead of just questions.

Each connector includes a watcher expecting {status:'ok'}, logs the sessionId, and enforces guard thresholds: retries>3/30s, loop depth>4, or cost>2x baseline. When a guard trips we pause the plan, stream the watcher log to the manual gate, and only let the next run continue after a human approves the diff. That keeps automation fast but keeps the ops team in control.