Context Engineering is becoming Harness Engineering

That was my biggest takeaway from the AI Agents for Discovery in the Wild workshop last week.

The accepted papers (40 in total) span many different domains, from compilers to data systems and harness optimization but there's a consistent pattern emerging from all of them.

To make it to the next stage of performance for LLMs we have to do more than prompt editing. We need to engineer the systems around the model, based on the domain we are working in and the goals we have.

What systems are we talking about?

You might have heard that an agent is just an LLM in a loop that also has access to tools. This reductionist approach is simplistic but also accurate enough to help us understand what we mean by stems around models.

In this simple description the systems are the environment that puts the LLM into the loop and the set of tools that the model has access to.

Fast forward to today and these systems have become both more complex but also enriched with others that are increasingly needed, both because of the complexity of the tasks we are going after but also because of the need to put things into production reliably.

When we treated an agent as a model in a loop with a way to execute code, the knobs we had to care about where more limited. The most impactful action we could take was to change the context of the model between the loop iterations. Naturally, the focus was on techniques and tooling to work on the prompts and that's how things like skills and the agent markdown documents became a thing.

But as we move into longer horizon tasks and demand more independence for our agents, these knobs are starting to show their limitations and the accepted papers at the workshop provide some amazing evidence for that.

Today we need more control over the context, we also define context in a much richer way, what does that mean?

The blueprint of a harness

What are the components of a harness?

Based on the systems I've seen on the accepted papers, here are the main components that I identified.

First, we have the context of the model itself. This is not going away but things are getting more complicated on how it is defined and where it comes from. The main difference is that things tend to get even more dynamic now. We started with the context window being updated primarily by the interactions between a human and the model, then we moved into MCP and skills where context is being dynamically ingested during runtime to surgically add and remove parts of the context window.

Next we have the tools. As engineers we instinctively think of tools as "API calls" but tools are a major mechanism for interacting with context. Both from the descriptions of the tools that we have to provide so the model knows what to do but more importantly by the tokens they add to the context with what they return and all that happening in a highly unpredictable way as the model is deciding when and how to invoke these tools.

Keep also in mind that the models are being trained to work with specific tools (that's where RL is being applied on) which is adding another layer of complexity in incorporating the tools correctly in the agentic system you build.

Then there's memory and this is quite a rabbit hole. By memory we mean information that has been generated that we decide to store long term and make it searchable by the model, with guess what, more tools. There are many different types of memory and a lot of marketing noise around that but at the end of the day memory is a sequence of tokens that is relevant enough to the context of your problem that you want to keep around available to the models.

Then there's a special type of tool which we were kind of getting for free in the case of coding agents, which is the validation tooling we have. For the loop to make sense at all we need a way to escape or continue it and the way this is done is usually by some form of validation. when you write code, validation is typically straightforward, you use the compiler and your tests to validate. But this is not always the case and the moment we leave coding behind, validation becomes a very difficult problem.

Finally we have instrumentation which is a higher level loop that we need to have that requires it. To improve the system we build we need to collect traces that represent the behavior of the agent and then use that to decide if something needs to change. All the sub-systems we talked about earlier are potential knobs that can be used to improve the system or fix a problem that the system has.

But all I want to do is write some SQL so why should I care?

You might argue that data work is also software engineering, at the end of the day a data engineer or a data scientist will write code in SQL, python or yaml files.

Naturally people think that you can use Claude Code or Codex in data related tasks and achieve the same things that a front end engineer can when developing a web app, but that's exactly where the harnesses we have today for coding start showing their limitations.

Similar to the domains the accepted papers were investigating, working on data platforms requires to rethink all the components of a harness to make the agents effective. A bunch of skills and an AGENTS.md file will help poking around your dbt project but none of that will tell the model:

what a metric means

which table is the source of truth
what grain a dataset has
how lineage flows through dbt/Airflow/BI
which downstream assets might break
whether a proposed SQL/code change is safe
what evidence should be shown before a human trusts it

To provide all the above, we need to rethink and rebuild the harness for our models to turn them into effective data agents.

If you are part of a data platform team thinking of how to add AI in your capabilities, sooner or later you will end up trying to build or re-invent a harness. Keep that in mind when you scope your work.

What the workshop brought into focus

The workshop ended up being a great motivator for me. It pushed me to write down my thoughts on harnesses, why I think they matter, and why data professionals in particular should pay attention to this line of work.

There were 40 papers accepted to the workshop. Going through all of them is worthwhile, but for anyone looking for a shorter starting point, these are a few that stood out to me.

AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve uses AlphaEvolve and Vizier to evolve compiler code-layout heuristics. What I found especially interesting is the evaluation harness: a low-noise rebuild, run, and hardware-measurement setup that turns tiny warehouse-scale performance deltas into a useful optimization signal.

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems feels particularly relevant for data professionals. It proposes typed contracts, logical operator DAGs, physical system skills, and runtime attribution loops as building blocks for agents that can compose deployable data-system backends.

Meta-Harness: Harness Search for Agents Under Expensive Evaluation takes the idea of a harness one level higher. Instead of treating retrieval, memory, prompts, orchestration, and context policy as fixed parts of an agent, it treats the harness itself as executable code that can be searched and improved using prior scores and raw traces.

Autonomous Agent Learning in Production describes a production-agent optimizer that freezes recent traffic into evaluation sessions and searches over whole-agent edits, including prompts, tools, scaffolds, memory and context policy, and model routing. I think this is a compelling direction because it connects agent improvement directly to production behavior rather than isolated benchmarks.

What I’m taking away

I’m excited about the potential of AI in the data world, but the workshop reinforced something more specific for me: progress will not come only from better models. It will also come from better harnesses, better evaluations, and better ways of connecting agent behavior to real production outcomes.

That is the area I want to keep exploring. How do we make AI systems useful for messy, real-world data work? How do we evaluate them against workflows that actually matter to data teams? And how do we move from impressive demos to systems that people can trust in production?

I’d love to talk with more people thinking about these questions. Connect with me on X @KostasPardalis, LinkedIn, or send me an email at kostas@typedef.ai, and let me know what you would like to hear more about.