Evaluating Agents

aunhumano.com

・

42 points

・

mfalcon

・

2 days ago

9 comments

localbuilder ・ 2 days ago

> There’s one issue with this, you’ll have to be careful to keep the “N – 1” interactions updated whenever you make some changes because you will be “simulating” something that will never happen again in your agent.

This is the biggest problem I've encountered with evals for agents so far. Especially with agents that might do multiple turns of user input > perform task > more user input > perform another task > etc.

Creating evals for these flows has been difficult because I've found mocking the conversation to a certain point runs into the drift problem you highlighted as the system changes. I've also explored using an LLM to create dynamic responses to points that require additional user input in E2E flows, which adds its own levels of complexity and indeterministic behavior. Both approaches are time consuming and difficult to setup in their own ways.

mfalcon ・ 2 days ago

Yes, and these problems are more present in the first iterations, when you are still trying to get a good enough agent behaviour.
I'm still thinking about good ways to mitigate this issue, will share.

mailswept_dev ・ 2 days ago

Totally agree with this — especially the part about end-to-end evals. I’ve seen too many teams rely only on manual testing and miss obvious regressions. Checkpoints + lightweight e2e evals feel like the sweet spot before things get too costly.

mfalcon ・ 2 days ago

Hey fellow hners, OP here. Been working on agents for a while so I started sharing some things.

The idea is to keep updating this post with a few more approaches I'd been using.

CuriouslyC ・ 2 days ago

Feed your failure traces into gemini to get a distillate then use DSPy to optimize the tools/prompts that are failing.

yuzhun ・ 2 days ago

I'm a beginner user. My current agent is built using Java. I'm hesitant whether to use Python to call the api for evaluation or to introduce some tools into the Java project for evaluation, such as those related to OpenTelemetry.

mfalcon ・ 2 days ago

You can evaluate with your programming language of choice.

codazoda ・ 2 days ago

Would love to see some examples

mfalcon ・ 2 days ago

Good idea for a follow up post :)