Convert PDFs into a Queryable, Agent-Ready Catalog with fenic

A practical workflow that takes parsed PDFs, builds a usable sections index, and exposes it as clean MCP tools—so agents (and humans) can ask real questions instead of skimming page by page.

TL;DR

Turn parsed PDFs into a clean sections table (H1–H3 chunks with paths) and a topics index (e.g., training, SOC, governance).
Publish your curated tables as minimal MCP tools (list_whitepapers, sections_by_topic, and search_sections) so any MCP-capable host can query them.
Keep it small, deterministic, and agent-ready—add vectors/LLM ranking later as an optional layer.
Try it: Open the demo notebook and point it at your own PDFs.

Introduction

Most teams eventually inherit a pile of PDF whitepapers (e.g., governance frameworks, training standards, SOC roadmaps) scattered across folders. You know the answers are "in there somewhere," but searching raw PDFs is slow, token-hungry, and brittle for agents.

This post shows how to build a compact, production-shaped pipeline with fenic that:

Splits parsed PDFs into structured sections
Indexes headings by topics of interest
Exposes the curated dataset as three MCP tools

The goal is to prove the value of curation—simple tools over clean tables.

What you'll learn:

How to split Markdown PDFs into H1–H3 sections with stable paths
How to build a topics index (e.g., training, SOC, governance) from LLM classifications
How to persist tables and expose them as MCP tools in a few lines
How to extend with vectors or LLM ranking later

Why not "just let the agent read my PDFs"?

Short answer: you'll get inconsistent answers, higher latency/cost, and chaotic control over what the model reads. A curated dataset fixes that:

Deterministic scope. You decide the rows and columns—no accidental 200-page token floods.
Human-legible structure. Sections (heading, level, path) are explainable, auditable, and testable.
Reusability. The same tables support search, analytics, and agents without re-ingesting PDFs.

Agents still have a place—just on top of a small, intentional surface area.

What we're building

Data products:

whitepaper_sections — one row per section (H1–H3): id, name, heading, level, content, full_path
whitepaper_topics — one row per (doc, heading, topic): doc_id, name, topic, heading

MCP tools:

Tool	Description
`list_whitepapers()`	PDF names + section counts
`sections_by_topic(topic)`	Rows from the curated topic index
`search_sections(query)`	Case-insensitive substring match

That's it. Three tools, three mental models, and everything flows from the tables.

Step 1: Start from parsed Markdown

We start from a normalized fenic DataFrame. In the notebook, this is deduped_docs_pdf_content_final—the result of downloading PDFs, converting them to Markdown via AI, and deduplicating. The key columns:

Column	Description
`id`	Stable identifier per document
`name`	Filename or title
`markdown_content`	Full body as Markdown
`toc`	Extracted table of contents (optional)
`content_categorization`	LLM classification output (optional)

If you're evaluating your own document estate, add source_uri and ingested_at for provenance.

Step 2: Split Markdown into sections (H1–H3)

The core trick is to chunk each whitepaper by its headings. With fenic's markdown.extract_header_chunks operator, you extract header-scoped chunks at specific levels, then union them into a single column of section structs:

Field	Description
`heading`	The section title
`level`	1, 2, or 3
`content`	The body of that section
`full_path`	Breadcrumb string like `H1 > H2 > H3`

python
# Extract header chunks at each level
pdf_sections_1 = (
    deduped_docs_pdf_content_final
    .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 1).alias("h1_sections"))
)
pdf_sections_2 = (
    deduped_docs_pdf_content_final
    .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 2).alias("h2_sections"))
)
pdf_sections_3 = (
    deduped_docs_pdf_content_final
    .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 3).alias("h3_sections"))
)

Why H1–H3? In practice, this is the sweet spot—it preserves structure without exploding into paragraphs. You can always tighten or loosen later.

Quality checks:

Count of sections per doc
Distribution by level
Sample of full_path strings (valid breadcrumbs?)

Step 3: Build a topics index from LLM classification

Earlier in the notebook, you produced a content_categorization object per document using semantic.extract. This contains lists like sections_about_model_training or sections_about_soc_compliance. Turn those into a long-form topics table with one row per (doc_id, topic, heading):

python
# Build topics DataFrame from LLM classification
topics_long = (
    pdf_filtered_details
    .select(
        "id", "name",
        udf_get_training(fc.col("content_categorization")).alias("training_headings"),
        udf_get_soc(fc.col("content_categorization")).alias("soc_headings"),
        udf_get_governance(fc.col("content_categorization")).alias("governance_headings"),
    )
)

# Explode into normalized rows: (doc_id, name, topic, heading)
topics_norm = _explode_topic(topics_long, "training_headings", "training").union_all(
    _explode_topic(topics_long, "soc_headings", "soc")
).union_all(
    _explode_topic(topics_long, "governance_headings", "governance")
)

The result: whitepaper_topics can answer "show me every Training heading across the catalog" with a simple group-by or join.

Step 4: Persist the tables (idempotent)

Save both DataFrames as tables so your MCP tools can reference them by name:

python
topics_union.write.save_as_table("whitepaper_topics", mode="overwrite")
sections_long.write.save_as_table("whitepaper_sections", mode="overwrite")

Two rules keep things robust:

Idempotent writes — use mode="overwrite" for dev runs so re-execution is safe.
No UDFs inside tool definitions — keep tool queries to built-in expressions so they serialize cleanly.

This avoids the common "UDFExpr cannot be serialized" error when saving catalog tools.

Step 5: Publish MCP tools

With the tables saved, register three MCP tools:

python
# Tool 1: list_whitepapers()
wp_counts = (
    sections_tbl
    .group_by("id", "name")
    .agg(fc.count(fc.lit(1)).alias("section_count"))
    .order_by("name")
)
session.catalog.register_tool("list_whitepapers", wp_counts, "List all whitepapers with section counts")

# Tool 2: sections_by_topic(topic)
topic_param = fc.tool_param("topic", fc.StringType)
by_topic = (
    topics_tbl
    .filter(fc.text.lower(fc.col("topic")) == fc.text.lower(topic_param))
    .join(sections_tbl, (fc.col("doc_id") == fc.col("id")) & (fc.col("heading") == fc.col("heading")))
    .select("name", "topic", "heading", "level", "full_path", "content")
    .order_by("name", "full_path", "level")
)
session.catalog.register_tool("sections_by_topic", by_topic, "Get sections for a given topic")

# Tool 3: search_sections(query)
query_param = fc.tool_param("query", fc.StringType)
search_plan = (
    sections_tbl
    .filter(
        fc.text.lower(fc.col("heading")).contains(fc.text.lower(query_param)) |
        fc.text.lower(fc.col("content")).contains(fc.text.lower(query_param))
    )
    .select("name", "heading", "level", "full_path")
    .order_by("name", "full_path")
)
session.catalog.register_tool("search_sections", search_plan, "Search sections by keyword")

Why each tool matters:

Tool	Purpose
`list_whitepapers()`	Gives users a friendly index—builds confidence your catalog is concrete
`sections_by_topic(topic)`	Demonstrates your curated taxonomy from Step 3
`search_sections(query)`	Simplest possible search—predictable, explainable, fast

Step 6: Smoke-test in a few lines

Use the simple test client from the notebook:

python
# Connect to the MCP server
async with Client(f"http://{HOST}:{PORT}/mcp") as client:
    # Test each tool
    print(await client.call_tool("list_whitepapers", {}))
    print(await client.call_tool("sections_by_topic", {"topic": "training"}))
    print(await client.call_tool("search_sections", {"query": "latency"}))

Each returns a short table (15 rows max) with sensible columns. That's enough to understand "what's in the box" and wire the tools into any MCP-capable host or UI.

What "done" looks like

A printed line like: ✅ MCP HTTP server ready at http://127.0.0.1:54217/mcp
whitepaper_sections with thousands of rows (H1–H3 blocks) and consistent full_paths
whitepaper_topics with a handful of topics and the headings they touch
Three tools you can demo interactively and wire into an MCP host

From here, product and compliance folks can ask concrete questions—"Show me all Governance sections across vendor A and B" or "Where do we talk about training data retention?"—without opening a single PDF.

Extending the demo (when you're ready)

These are intentionally out of scope for the core demo but easy to layer on:

Vector ranking (similar_sections) Precompute an embedding for heading + content with semantic.embed and rank by cosine similarity. Great for "find me similar" retrieval. Keep it capped (result_limit) so payloads stay small.
LLM scoring (qa_sections_llm) Have the model "vote" on the best section for a question. Map a short instruction over candidates and parse a numeric score. Pitfall: ensure your semantic.map instruction is a string literal, and materialize UDF outputs before tool registration.
Contact signals Regex UDFs for emails, URLs, and phone numbers can help compliance and outreach. Do extraction in the table, keep tool definitions UDF-free.
Richer taxonomies Expand from three topics to your real taxonomy. The sections_by_topic join pattern stays the same.

Performance, cost, and safety notes

Keep rows short: Sections (vs. full PDFs) dramatically reduce tokens if you add LLM steps later.
Cache deliberately: When computing embeddings or heavy transforms, cache the intermediate DataFrame you'll reuse.
Idempotency matters: Overwrite in dev; use versioned table names in CI.
No UDFs in tools: Build tool queries from built-in expressions so they serialize cleanly.
Deterministic joins: Link topics to sections via (doc_id, heading) and sort by (name, full_path, level) for predictable outputs.

How teams can use this

Security and compliance search Your security team asks, "Where do we promise data retention limits?" Instead of skimming PDFs, they call search_sections("retention") or sections_by_topic("governance").

Audit prep / due diligence During vendor reviews, export slices to CSV and hand them to legal. Same query every time—no more "someone missed a paragraph on page 63."

Agent surfaces Whether you use a CLI host or a UI, the agent doesn't freestyle answers. It calls deterministic tools and renders results—no hallucinated policy claims.

Docs Ops / editorial Basic counts (list_whitepapers) often reveal "we have 3 policy docs with 300+ sections—we should consolidate" or "governance sections are thin compared to training."

Adopting this in your stack

Open the Colab notebook and point it at your PDFs
Keep header levels small at first (H1–H3 usually suffices)
Save the two tables and wire up the MCP tools
Share the MCP endpoints with your security, docs, and support teams
Iterate on the topics index with real stakeholders—that's where value clicks
Only then consider vector or LLM ranking as optional upgrades

Conclusion

In this demo, you take parsed PDFs, impose a small set of consistent structures, and publish them behind three tiny tools. That's a shape agents, analysts, and engineers can all agree on.

With fenic, the whole pipeline stays in a single, readable DataFrame flow—no glue code, no batching, no mystery prompts hiding in strings.

When you're ready, add vectors or LLM scoring as optional layers. But ship the curated tables first.

Try the demo, then iterate

Open the notebook — run it and replace the sample PDFs with your own
Docs — browse fenic's Markdown, text, and semantic operators
Examples — explore other fenic demos for inspiration

Convert PDFs into a Queryable, Agent-Ready Catalog with fenic

Convert PDFs into a Queryable, Agent-Ready Catalog with fenic

TL;DR

Introduction

Why not "just let the agent read my PDFs"?

What we're building

Step 1: Start from parsed Markdown

Step 2: Split Markdown into sections (H1–H3)

Step 3: Build a topics index from LLM classification

Step 4: Persist the tables (idempotent)

Step 5: Publish MCP tools

Step 6: Smoke-test in a few lines

What "done" looks like

Extending the demo (when you're ready)

Performance, cost, and safety notes

How teams can use this

Adopting this in your stack

Conclusion

Try the demo, then iterate

Found this useful?