<< goback()

Convert PDFs into a Queryable, Agent-Ready Catalog with fenic

Typedef Team

Convert PDFs into a Queryable, Agent-Ready Catalog with fenic

Convert PDFs into a Queryable, Agent-Ready Catalog with fenic

A practical workflow that takes parsed PDFs, builds a usable sections index, and exposes it as clean MCP tools—so agents (and humans) can ask real questions instead of skimming page by page.


TL;DR

  • Turn parsed PDFs into a clean sections table (H1–H3 chunks with paths) and a topics index (e.g., training, SOC, governance).
  • Publish your curated tables as minimal MCP tools (list_whitepapers, sections_by_topic, and search_sections) so any MCP-capable host can query them.
  • Keep it small, deterministic, and agent-ready—add vectors/LLM ranking later as an optional layer.
  • Try it: Open the demo notebook and point it at your own PDFs.

Introduction

Most teams eventually inherit a pile of PDF whitepapers (e.g., governance frameworks, training standards, SOC roadmaps) scattered across folders. You know the answers are "in there somewhere," but searching raw PDFs is slow, token-hungry, and brittle for agents.

This post shows how to build a compact, production-shaped pipeline with fenic that:

  1. Splits parsed PDFs into structured sections
  2. Indexes headings by topics of interest
  3. Exposes the curated dataset as three MCP tools

The goal is to prove the value of curation—simple tools over clean tables.

What you'll learn:

  • How to split Markdown PDFs into H1–H3 sections with stable paths
  • How to build a topics index (e.g., training, SOC, governance) from LLM classifications
  • How to persist tables and expose them as MCP tools in a few lines
  • How to extend with vectors or LLM ranking later

Why not "just let the agent read my PDFs"?

Short answer: you'll get inconsistent answers, higher latency/cost, and chaotic control over what the model reads. A curated dataset fixes that:

  • Deterministic scope. You decide the rows and columns—no accidental 200-page token floods.
  • Human-legible structure. Sections (heading, level, path) are explainable, auditable, and testable.
  • Reusability. The same tables support search, analytics, and agents without re-ingesting PDFs.

Agents still have a place—just on top of a small, intentional surface area.

What we're building

Data products:

  • whitepaper_sections — one row per section (H1–H3): id, name, heading, level, content, full_path
  • whitepaper_topics — one row per (doc, heading, topic): doc_id, name, topic, heading

MCP tools:

ToolDescription
list_whitepapers()PDF names + section counts
sections_by_topic(topic)Rows from the curated topic index
search_sections(query)Case-insensitive substring match

That's it. Three tools, three mental models, and everything flows from the tables.

Step 1: Start from parsed Markdown

We start from a normalized fenic DataFrame. In the notebook, this is deduped_docs_pdf_content_final—the result of downloading PDFs, converting them to Markdown via AI, and deduplicating. The key columns:

ColumnDescription
idStable identifier per document
nameFilename or title
markdown_contentFull body as Markdown
tocExtracted table of contents (optional)
content_categorizationLLM classification output (optional)

If you're evaluating your own document estate, add source_uri and ingested_at for provenance.

Step 2: Split Markdown into sections (H1–H3)

The core trick is to chunk each whitepaper by its headings. With fenic's markdown.extract_header_chunks operator, you extract header-scoped chunks at specific levels, then union them into a single column of section structs:

FieldDescription
headingThe section title
level1, 2, or 3
contentThe body of that section
full_pathBreadcrumb string like H1 > H2 > H3
python
# Extract header chunks at each level
pdf_sections_1 = (
    deduped_docs_pdf_content_final
    .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 1).alias("h1_sections"))
)
pdf_sections_2 = (
    deduped_docs_pdf_content_final
    .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 2).alias("h2_sections"))
)
pdf_sections_3 = (
    deduped_docs_pdf_content_final
    .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 3).alias("h3_sections"))
)

Why H1–H3? In practice, this is the sweet spot—it preserves structure without exploding into paragraphs. You can always tighten or loosen later.

Quality checks:

  • Count of sections per doc
  • Distribution by level
  • Sample of full_path strings (valid breadcrumbs?)

Step 3: Build a topics index from LLM classification

Earlier in the notebook, you produced a content_categorization object per document using semantic.extract. This contains lists like sections_about_model_training or sections_about_soc_compliance. Turn those into a long-form topics table with one row per (doc_id, topic, heading):

python
# Build topics DataFrame from LLM classification
topics_long = (
    pdf_filtered_details
    .select(
        "id", "name",
        udf_get_training(fc.col("content_categorization")).alias("training_headings"),
        udf_get_soc(fc.col("content_categorization")).alias("soc_headings"),
        udf_get_governance(fc.col("content_categorization")).alias("governance_headings"),
    )
)

# Explode into normalized rows: (doc_id, name, topic, heading)
topics_norm = _explode_topic(topics_long, "training_headings", "training").union_all(
    _explode_topic(topics_long, "soc_headings", "soc")
).union_all(
    _explode_topic(topics_long, "governance_headings", "governance")
)

The result: whitepaper_topics can answer "show me every Training heading across the catalog" with a simple group-by or join.

Step 4: Persist the tables (idempotent)

Save both DataFrames as tables so your MCP tools can reference them by name:

python
topics_union.write.save_as_table("whitepaper_topics", mode="overwrite")
sections_long.write.save_as_table("whitepaper_sections", mode="overwrite")

Two rules keep things robust:

  1. Idempotent writes — use mode="overwrite" for dev runs so re-execution is safe.
  2. No UDFs inside tool definitions — keep tool queries to built-in expressions so they serialize cleanly.

This avoids the common "UDFExpr cannot be serialized" error when saving catalog tools.

Step 5: Publish MCP tools

With the tables saved, register three MCP tools:

python
# Tool 1: list_whitepapers()
wp_counts = (
    sections_tbl
    .group_by("id", "name")
    .agg(fc.count(fc.lit(1)).alias("section_count"))
    .order_by("name")
)
session.catalog.register_tool("list_whitepapers", wp_counts, "List all whitepapers with section counts")

# Tool 2: sections_by_topic(topic)
topic_param = fc.tool_param("topic", fc.StringType)
by_topic = (
    topics_tbl
    .filter(fc.text.lower(fc.col("topic")) == fc.text.lower(topic_param))
    .join(sections_tbl, (fc.col("doc_id") == fc.col("id")) & (fc.col("heading") == fc.col("heading")))
    .select("name", "topic", "heading", "level", "full_path", "content")
    .order_by("name", "full_path", "level")
)
session.catalog.register_tool("sections_by_topic", by_topic, "Get sections for a given topic")

# Tool 3: search_sections(query)
query_param = fc.tool_param("query", fc.StringType)
search_plan = (
    sections_tbl
    .filter(
        fc.text.lower(fc.col("heading")).contains(fc.text.lower(query_param)) |
        fc.text.lower(fc.col("content")).contains(fc.text.lower(query_param))
    )
    .select("name", "heading", "level", "full_path")
    .order_by("name", "full_path")
)
session.catalog.register_tool("search_sections", search_plan, "Search sections by keyword")

Why each tool matters:

ToolPurpose
list_whitepapers()Gives users a friendly index—builds confidence your catalog is concrete
sections_by_topic(topic)Demonstrates your curated taxonomy from Step 3
search_sections(query)Simplest possible search—predictable, explainable, fast

Step 6: Smoke-test in a few lines

Use the simple test client from the notebook:

python
# Connect to the MCP server
async with Client(f"http://{HOST}:{PORT}/mcp") as client:
    # Test each tool
    print(await client.call_tool("list_whitepapers", {}))
    print(await client.call_tool("sections_by_topic", {"topic": "training"}))
    print(await client.call_tool("search_sections", {"query": "latency"}))

Each returns a short table (15 rows max) with sensible columns. That's enough to understand "what's in the box" and wire the tools into any MCP-capable host or UI.

What "done" looks like

  • A printed line like: ✅ MCP HTTP server ready at http://127.0.0.1:54217/mcp
  • whitepaper_sections with thousands of rows (H1–H3 blocks) and consistent full_paths
  • whitepaper_topics with a handful of topics and the headings they touch
  • Three tools you can demo interactively and wire into an MCP host

From here, product and compliance folks can ask concrete questions—"Show me all Governance sections across vendor A and B" or "Where do we talk about training data retention?"—without opening a single PDF.

Extending the demo (when you're ready)

These are intentionally out of scope for the core demo but easy to layer on:

  1. Vector ranking (similar_sections) Precompute an embedding for heading + content with semantic.embed and rank by cosine similarity. Great for "find me similar" retrieval. Keep it capped (result_limit) so payloads stay small.

  2. LLM scoring (qa_sections_llm) Have the model "vote" on the best section for a question. Map a short instruction over candidates and parse a numeric score. Pitfall: ensure your semantic.map instruction is a string literal, and materialize UDF outputs before tool registration.

  3. Contact signals Regex UDFs for emails, URLs, and phone numbers can help compliance and outreach. Do extraction in the table, keep tool definitions UDF-free.

  4. Richer taxonomies Expand from three topics to your real taxonomy. The sections_by_topic join pattern stays the same.

Performance, cost, and safety notes

  • Keep rows short: Sections (vs. full PDFs) dramatically reduce tokens if you add LLM steps later.
  • Cache deliberately: When computing embeddings or heavy transforms, cache the intermediate DataFrame you'll reuse.
  • Idempotency matters: Overwrite in dev; use versioned table names in CI.
  • No UDFs in tools: Build tool queries from built-in expressions so they serialize cleanly.
  • Deterministic joins: Link topics to sections via (doc_id, heading) and sort by (name, full_path, level) for predictable outputs.

How teams can use this

Security and compliance search Your security team asks, "Where do we promise data retention limits?" Instead of skimming PDFs, they call search_sections("retention") or sections_by_topic("governance").

Audit prep / due diligence During vendor reviews, export slices to CSV and hand them to legal. Same query every time—no more "someone missed a paragraph on page 63."

Agent surfaces Whether you use a CLI host or a UI, the agent doesn't freestyle answers. It calls deterministic tools and renders results—no hallucinated policy claims.

Docs Ops / editorial Basic counts (list_whitepapers) often reveal "we have 3 policy docs with 300+ sections—we should consolidate" or "governance sections are thin compared to training."

Adopting this in your stack

  1. Open the Colab notebook and point it at your PDFs
  2. Keep header levels small at first (H1–H3 usually suffices)
  3. Save the two tables and wire up the MCP tools
  4. Share the MCP endpoints with your security, docs, and support teams
  5. Iterate on the topics index with real stakeholders—that's where value clicks
  6. Only then consider vector or LLM ranking as optional upgrades

Conclusion

In this demo, you take parsed PDFs, impose a small set of consistent structures, and publish them behind three tiny tools. That's a shape agents, analysts, and engineers can all agree on.

With fenic, the whole pipeline stays in a single, readable DataFrame flow—no glue code, no batching, no mystery prompts hiding in strings.

When you're ready, add vectors or LLM scoring as optional layers. But ship the curated tables first.


Try the demo, then iterate

  • Open the notebook — run it and replace the sample PDFs with your own
  • Docs — browse fenic's Markdown, text, and semantic operators
  • Examples — explore other fenic demos for inspiration