Data Is Not Software: What Anthropic's Analytics Infrastructure Reveals About AI and Data

Anthropic published a really important article recently on how they built their internal data infrastructure so Claude can help with self-service analytics.

The article is worth reading because of the results they report. Getting Claude to a point where it can handle a large percentage of internal analytics questions with very high accuracy is a big deal, and it gives us a glimpse into the future of data infrastructure and how AI will transform it.

But the reason I found the post especially important is that Anthropic says very clearly something I have been saying for a while:

That sounds simple, but it is the most important starting point for thinking about AI and data infrastructure.

The reason LLMs are so useful in software does not transfer directly to data. You can’t just point Claude Code at your warehouse and expect the same experience you get when you point it at a software repo.

If you are on a data team experimenting with AI agents, this is probably the wall you are going to hit: the model can write SQL before it understands your data.

Software has a lot of properties that make the agent loop work well. There are many valid ways to solve a problem. There are tests, types, compilers, linters, runtime errors, CI, diffs, and a pretty clear edit/validate loop. Even when the model makes mistakes, the environment gives it many ways to recover.

Data work behaves differently.

In analytics, there is often one correct answer. That answer depends on the correct source, the correct metric definition, the correct grain, the correct filters, and the correct interpretation of the business question.

A SQL query can run successfully and still be wrong. A dashboard can render and still be wrong. A number can look plausible and still be wrong.

Anthropic phrases this very well in their post. For analytics, they say, there is often “only a single correct answer using a single correct source,” and there may be no simple deterministic way to prove correctness.

I mostly agree with that, but I think there is an important nuance.

It is often impossible to prove that the final business answer is correct without a blessed oracle or a human who owns the definition. But there are many useful things we can check about the derivation.

In many cases, we can check whether the query used the canonical metric, whether the source was governed, and whether the join was safe at the grain being used. We can show whether a column actually propagates to the model being queried. We can distinguish whether a semantic fact was declared, inferred, or verified.

And finally, we can trace the answer back to the data assets that support it.

This distinction matters a lot. A meaningful part of AI analytics can move from “the model made a judgment call” to “the system checked whether this derivation is structurally valid.” Once that happens, both agents and humans can reason about the answer more effectively.

That is where the next generation of data infrastructure is going.

What Anthropic had to build

The most interesting part of Anthropic’s post is the amount of infrastructure they had to build around Claude to make it work for analytics.

They did not describe a world where the model gets warehouse access, writes SQL, and everything just works. They describe a full data-specific harness.

They created canonical datasets. They made the semantic layer central to the agent workflow. They routed agents through governed metrics before lower-level assets. They maintained skills that encode business rules and workflows. They treated metadata with the same rigor as transformation code. They built evals, added online validation, and exposed provenance so humans and agents can inspect where answers came from.

That collection of things is the real story. And as you can see, it is a lot. But it was also worth the investment based on what they report.

Claude becomes much more useful when the environment around it has the right context, the right constraints, and the right validation loop. The model still matters, of course, but the system around the model matters just as much.

This is especially clear in the failure modes Anthropic describes.

The first one is ambiguity. A business term like “active users” or “revenue” may map to multiple tables, metrics, grains, sources, filters, and historical definitions. The agent has to know which one is authoritative.

The second one is staleness. Data platforms change constantly. Models change, metrics change, business rules change, and the context available to the agent can become outdated very quickly.

The third one is retrieval failure. The right information may exist somewhere, but the agent may fail to find it or fail to use it correctly.

That last one is especially important. Anthropic says that simply giving the agent access to more raw SQL barely helped.

This matches my experience very strongly.

Old queries are not a clean source of truth. They are full of one-off analyses, stale definitions, migrations, exceptions, experiments, and all the historical ways a data team has answered questions over time. There is a lot of signal there, but there is also a lot of ambiguity.

You can’t automatically mine a clean semantic layer out of a messy warehouse just by pointing an LLM at the mess.

If the organization has ambiguous definitions, the model will usually preserve those ambiguities, or worse, make them look more coherent than they actually are.

That is why I think one of the most important parts of Anthropic’s post is their statement that humans still had to own metric definitions.

The model can help, but someone has to decide what is canonical.

The harness is the product

This is the part that feels most aligned with what we are building at Typedef: the harness is not a sidecar around the product. In many ways, the harness is the product.

I think the right way to think about AI in data is as an agent operating inside a data-specific harness.

That harness needs to understand the concepts that matter in data work.

It needs awareness of metrics, grain, lineage, provenance, and which sources are governed.

It also needs to distinguish whether a semantic fact is declared, inferred, or verified, and know when there is enough evidence to proceed versus when uncertainty should be exposed.

Today, a lot of that context lives in docs, dbt models, BI tools, catalogs, Slack threads, query history, dashboards, and people’s heads. The agent can retrieve pieces of it, but retrieval alone does not make something authoritative.

That is the core issue.

When a user asks a question about “active users,” the hard part is usually not producing a SQL query that counts users.

The hard part is knowing what “active users” actually means inside that organization: the canonical definition, the table that expresses it, the grain, required filters, freshness signals, metric certification, traceability to governed assets, and whether multiple plausible interpretations exist.

A general-purpose agent can guess. A better retrieval system can surface candidates.

But a data-native system should be able to represent the authoritative context directly and make the agent reason over that.

That is the bet we are making at Typedef.

We want agents to operate over a governed substrate where business concepts, metrics, models, grains, tests, lineage, and provenance are represented explicitly. Not as loose prose the agent has to reinterpret every time, but as structured context that can be inspected, checked, and used during the workflow.

This changes the behavior you can expect from the agent.

Instead of trusting the model to pick the right context, the system can verify governance. It can check grain before joining, trace whether a column reaches a model, track the authority of inferred relationships, and degrade explicitly when the evidence is weak.

That means surfacing ambiguity when the grain is unclear, warning when a join may fan out, flagging non-canonical metrics, showing when provenance does not reach a governed source, and routing uncertain answers to human review.

This is the kind of behavior data agents need.

Correctness is not one thing

One thing that often gets lost in conversations about AI analytics is that correctness is not a single binary property.

There is the correctness of the final business answer. That is the hardest thing to prove, because it depends on whether the organization agrees on the definition and whether the question was interpreted correctly.

But there is also correctness of the derivation: whether the agent used the right source, respected the metric definition, joined data safely, aggregated at the right grain, avoided stale models, treated inferred facts with the right level of authority, and routed through governed semantic views instead of bypassing them for lower-level tables.

Those questions are much more tractable.

We should not pretend they solve the entire problem. They don’t. There will always be questions that require human judgment, especially when the business concept itself is ambiguous.

But we should also not throw everything into the “LLMs are probabilistic, humans must review everything” bucket.

A large class of analytics failures are structural. And structural failures should be represented and checked by the infrastructure.

This is where the data world has an opportunity to build something very different from the current stack.

For years, metadata has often been treated as documentation. Useful, but separate from the actual workflow. Something governance teams care about. Something that gets stale. Something that humans read when they need context.

In an AI-native data platform, metadata becomes operational.

The agent needs it to do the work.

The validation system needs it to check the work.

The human reviewer needs it to understand the work.

The eval system needs it to generate and diagnose failures.

The semantic layer, lineage graph, catalog, transformation code, tests, and agent harness start becoming parts of the same loop.

That is a big shift.

What is still hard

Anthropic is very honest about the hard parts that remain. I appreciated that a lot.

The scariest failures in analytics are silent failures.

The query runs. The answer looks reasonable. The agent sounds confident. The stakeholder may not notice the issue until much later, if ever.

Anthropic talks about provenance, reviewer agents, stakeholder correction loops, online validation, evals, and human sign-off for important work. All of these are important.

At Typedef, we are trying to push part of that problem lower in the stack.

Silent wrongness often starts before the final answer exists. It starts when the agent selects the wrong context: no encoded canonical metric, multiple plausible sources without clear authority, implicit grain, lineage the agent cannot reason over, or inferred facts treated as declared truth.

So the answer cannot only live at the review layer. Review matters, but the substrate matters too.

The data platform itself needs to expose authority, uncertainty, provenance, and legality in a way that agents can use.

This does not remove humans from the loop.

Humans should own business definitions, review genuinely ambiguous cases, and approve important changes. Humans still have to decide what the business means when the business itself has not decided yet.

But they should not have to catch every fan-out join by reading SQL. They should not have to manually inspect every answer to see whether the agent used a non-canonical metric. They should not have to guess whether the column in an answer actually traces back to the right source.

The system should catch as much of that as possible before the human ever sees it.

That is how we shrink the surface area of silent wrongness.

We do not make every analytics answer magically provable. We make more of the failure modes visible, typed, and checkable.

Why I am excited

Seeing Anthropic make many of these claims is validating.

It is especially meaningful coming from a team that has pushed coding agents so far. If the people building one of the best coding-agent experiences are saying that data needs a different setup, I think the data world should pay attention.

The deeper reason I am excited is that I think AI gives us a chance to rethink data infrastructure from first principles.

Data infrastructure has had enormous investment. Very smart people have built warehouses, transformation frameworks, catalogs, BI tools, semantic layers, orchestration systems, lineage tools, and governance platforms.

We have made a lot of progress.

And still, data infrastructure remains largely unsolved. AI agents are now making the cracks impossible to ignore.

I think this is a good thing.

AI is forcing us to make the implicit parts of data work explicit, and that is the opportunity.

The future of AI in data will come from building data infrastructure that agents can actually reason over.

That is what we are building at Typedef.

If you are on a data team trying to apply AI to analytics, transformation, governance, or data quality, I think you are going to run into the same problems Anthropic describes.

And if you are thinking about how to build the harness that makes AI actually work for data, I would love to talk.