What real evaluation looks like for agents on real data
Calling an agent good because it feels good is not how you decide if it is right for you. It is also not how you decide if it beats a general tool like Claude Code. Evaluating these systems is hard but that's not a reason to fall back on vibes.
This is harder for agents that work on real data. Software engineering has a deep set of eval systems and benchmarks to lean on. For data agents, that set barely exists yet. So many times, vibes fill the gap.
Here are a few ways I've seen that a vibes check slips in, and what a real evaluation does instead.
People rely on the feeling of being productive
Here's a typical example. You connect an analytics agent to your team chat. Overnight the whole company feels much faster, because anyone can ask a business question without waiting on the data team. That feeling is not proof.
Often it is just read access to the warehouse with a chat box on top. It cannot break a pipeline, so it feels safe. It can still break a decision, the first time it returns a confident wrong number and someone acts on it.
A real evaluation sets a starting point to compare against, then measures the same task before and after. It does not ask people how they feel.
And remember, measuring productivity is a long lasting very hard problem that humanity has invested a lot into trying to figure out without definitely solving yet, that should tell you something about how seriously you should rely on your feeling of productivity.
An answer looks right, so people assume it is right
Example: The answer ran without errors, used the right tables, and looked clean in the demo. But some data errors do not show up as crashes. Errors in how rows are grouped or joined can pass every test and change the numbers in quiet ways. A wrong number on a screen is annoying. A wrong business number can break a real decision.
This becomes an even bigger problem when agents are very confident in how they state their answers.
A real evaluation checks the output against an answer you already know is correct.
People compare two agents by feel
Without the right tooling and available time, people fall back into comparing two agents by feel instead of designing and implementing the right eval/benchmark systems for their use case.
They run the same request through both, read the two outputs, and pick the one they consider the right one or the better one. That is a choice made from a single run many times.
But, these systems are not deterministic, they give a different answer each time you run them. One run just tells you what happened once but it tells you very little about the next hundred runs you will get in production.
That's why, a real evaluation runs each agent many times and compares the spread of results, not one output.
A gut feeling decides what counts as correct
This is especially true in cases where it's genuinely impossible to quantify correctness.
But here's the thing, the hard part of designing agent evaluations is not getting an answer, it is deciding what a correct answer is.
Two main approaches hold up.
When the task allows it, use a fixed correct answer and check against it. When the task is fuzzy, such as a written explanation, write a clear scoring guide and score against it. The technical name for the thing that decides correctness is an oracle. A gut feeling is not one.
Have you experienced any of these in your organization and if yes, how do you protect your team from them? I'd love to hear your experience and share notes.
