Data Catalogs Were Never Going to Fix Lineage

Why a decade of prettier graph layouts was the wrong problem, and what AI agents finally change.

The Hairball Problem

Have you ever had to build a lineage graph to make sure you could build or migrate in your data infrastructure safely? You do it, as you should, and you get back a graph showing 2,500+ upstream assets. How do you debug that? Forget debugging, how do you even navigate it visually without missing context?

You can't. You zoom out and it looks like a piece of generative art, you zoom in and you have lost the plot. And that is one dimension, table lineage, or column level lineage if you are lucky. Add how the columns are transformed, or row level lineage, and the picture is not just unreadable, it is impossible to even render.

A Decade of Wrong Diagnosis

We hit this wall building typedef, and the first thing we had to accept is that everyone hits it. The data catalog vendors never solved this. Not because they didn't try, but because at real scale it is genuinely hard given the technology and the HCI methods we had. We treated it as a visualization problem. It is not. It is a human-context problem, and we spent a decade pretending a better layout would fix it.

A real lineage graph at platform scale: zoom out and it's abstract art, zoom in and you've lost the plot.

What Agents Actually Change

Last week the AI world spent enormous energy arguing HTML versus Markdown for agents. Tariq from Anthropic's Claude Code team kicked it off, it did over 4 million views in a day, Karpathy and Simon Willison weighed in, everyone picked a side. Honestly, there is no real argument here, you need both. The machine reads Markdown, the human reads HTML, and Markdown was only ever a way to write text that becomes HTML anyway, itself a human-context problem that we actually solved well.

The interesting part is what an agent can render now that a static tool never could. Not prettier output, a personalized view that builds the right context and presentation to help a person reach comprehension that would otherwise take the time a senior engineer needs to contextualize.

The agent can know what you are actually trying to figure out, summarize the slice of those 2,500 nodes that matters in seconds, and render the view around your question instead of dumping the whole thing in front of you with a zoom control. Instead of building generic tools for navigating hierarchical data and hoping no edge case breaks them, embrace that each user and each question IS an edge case, and use the agent to JIT the tool and the presentation so the person consumes the information in the minimum possible time.

Building This at typedef

Compress lineage into semantically meaningful stages

This is what gets me excited about what we are building at @typedef_ai. We take the platform context we generate and use it with LLMs to communicate the genuinely hard things, the lifecycle of a business concept across the whole platform, how grain propagates and changes relative to meaning, how metrics form families and how their meaning drifts based on the path through the platform. Not as a wall of text from a chatbot, but as something a person can follow, using visual modalities that have been optimized for exactly this for decades. And guess what, that includes HTML.

The same pipeline resolved into stages by the AI Agent

Won't the Agent Just Make Things Up?

A Semantic Graph gives you a grounded, deterministic, focused and progressive view of assets

There is an obvious objection here and it is a fair one. If the agent decides what slice of the graph you see and how it gets drawn, and annotates it to guide you to a conclusion faster, how do you know it is not just making up a convincing picture? LLMs love to be loved, they want to make you happy, and a pretty diagram that is subtly wrong is worse than the 2,500-node piece of art, because that one never lied to you.

This is the part we care about most. The agent gets to decide what matters for your question and how to present it, but it does not get to invent the graph. The facts underneath, how the grain is seeded, how it propagates, which join silently fans out your row count, are derived deterministically from the platform context we generate, not narrated by the model. The agent isn't parsing your SQL live and hoping it got the semantics right. It reasons over context that was already built, validated, and traced. That difference is the whole game.

The result? You get something shaped around your question that you can still trust, and that you can trace back when something goes wrong, because every node and edge in the platform context graph is backed by evidence and not by vibes.

What We're Building

Solving the lineage view problem is just one of the things this enables. That is what we are building at typedef: a way to ask a hard question about your data platform and get back the one slice that answers it, grounded in context that does not let the agent lie to you.

If you have ever closed the tab on a hairball, I would like to hear about it.