Convert PDFs into a Queryable, Agent-Ready Catalog with fenic
A practical workflow that takes parsed PDFs, builds a usable sections index, and exposes it as clean MCP tools—so agents (and humans) can ask real questions instead of skimming page by page.
TL;DR
- Turn parsed PDFs into a clean sections table (H1–H3 chunks with paths) and a topics index (e.g., training, SOC, governance).
- Publish your curated tables as minimal MCP tools (
list_whitepapers,sections_by_topic, andsearch_sections) so any MCP-capable host can query them. - Keep it small, deterministic, and agent-ready—add vectors/LLM ranking later as an optional layer.
- Try it: Open the demo notebook and point it at your own PDFs.
Introduction
Most teams eventually inherit a pile of PDF whitepapers (e.g., governance frameworks, training standards, SOC roadmaps) scattered across folders. You know the answers are "in there somewhere," but searching raw PDFs is slow, token-hungry, and brittle for agents.
This post shows how to build a compact, production-shaped pipeline with fenic that:
- Splits parsed PDFs into structured sections
- Indexes headings by topics of interest
- Exposes the curated dataset as three MCP tools
The goal is to prove the value of curation—simple tools over clean tables.
What you'll learn:
- How to split Markdown PDFs into H1–H3 sections with stable paths
- How to build a topics index (e.g., training, SOC, governance) from LLM classifications
- How to persist tables and expose them as MCP tools in a few lines
- How to extend with vectors or LLM ranking later
Why not "just let the agent read my PDFs"?
Short answer: you'll get inconsistent answers, higher latency/cost, and chaotic control over what the model reads. A curated dataset fixes that:
- Deterministic scope. You decide the rows and columns—no accidental 200-page token floods.
- Human-legible structure. Sections (heading, level, path) are explainable, auditable, and testable.
- Reusability. The same tables support search, analytics, and agents without re-ingesting PDFs.
Agents still have a place—just on top of a small, intentional surface area.
What we're building
Data products:
whitepaper_sections— one row per section (H1–H3):id, name, heading, level, content, full_pathwhitepaper_topics— one row per (doc, heading, topic):doc_id, name, topic, heading
MCP tools:
| Tool | Description |
|---|---|
list_whitepapers() | PDF names + section counts |
sections_by_topic(topic) | Rows from the curated topic index |
search_sections(query) | Case-insensitive substring match |
That's it. Three tools, three mental models, and everything flows from the tables.
Step 1: Start from parsed Markdown
We start from a normalized fenic DataFrame. In the notebook, this is deduped_docs_pdf_content_final—the result of downloading PDFs, converting them to Markdown via AI, and deduplicating. The key columns:
| Column | Description |
|---|---|
id | Stable identifier per document |
name | Filename or title |
markdown_content | Full body as Markdown |
toc | Extracted table of contents (optional) |
content_categorization | LLM classification output (optional) |
If you're evaluating your own document estate, add source_uri and ingested_at for provenance.
Step 2: Split Markdown into sections (H1–H3)
The core trick is to chunk each whitepaper by its headings. With fenic's markdown.extract_header_chunks operator, you extract header-scoped chunks at specific levels, then union them into a single column of section structs:
| Field | Description |
|---|---|
heading | The section title |
level | 1, 2, or 3 |
content | The body of that section |
full_path | Breadcrumb string like H1 > H2 > H3 |
python# Extract header chunks at each level pdf_sections_1 = ( deduped_docs_pdf_content_final .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 1).alias("h1_sections")) ) pdf_sections_2 = ( deduped_docs_pdf_content_final .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 2).alias("h2_sections")) ) pdf_sections_3 = ( deduped_docs_pdf_content_final .select("id", fc.markdown.extract_header_chunks(fc.col("markdown_content"), 3).alias("h3_sections")) )
Why H1–H3? In practice, this is the sweet spot—it preserves structure without exploding into paragraphs. You can always tighten or loosen later.
Quality checks:
- Count of sections per doc
- Distribution by
level - Sample of
full_pathstrings (valid breadcrumbs?)
Step 3: Build a topics index from LLM classification
Earlier in the notebook, you produced a content_categorization object per document using semantic.extract. This contains lists like sections_about_model_training or sections_about_soc_compliance. Turn those into a long-form topics table with one row per (doc_id, topic, heading):
python# Build topics DataFrame from LLM classification topics_long = ( pdf_filtered_details .select( "id", "name", udf_get_training(fc.col("content_categorization")).alias("training_headings"), udf_get_soc(fc.col("content_categorization")).alias("soc_headings"), udf_get_governance(fc.col("content_categorization")).alias("governance_headings"), ) ) # Explode into normalized rows: (doc_id, name, topic, heading) topics_norm = _explode_topic(topics_long, "training_headings", "training").union_all( _explode_topic(topics_long, "soc_headings", "soc") ).union_all( _explode_topic(topics_long, "governance_headings", "governance") )
The result: whitepaper_topics can answer "show me every Training heading across the catalog" with a simple group-by or join.
Step 4: Persist the tables (idempotent)
Save both DataFrames as tables so your MCP tools can reference them by name:
pythontopics_union.write.save_as_table("whitepaper_topics", mode="overwrite") sections_long.write.save_as_table("whitepaper_sections", mode="overwrite")
Two rules keep things robust:
- Idempotent writes — use
mode="overwrite"for dev runs so re-execution is safe. - No UDFs inside tool definitions — keep tool queries to built-in expressions so they serialize cleanly.
This avoids the common "UDFExpr cannot be serialized" error when saving catalog tools.
Step 5: Publish MCP tools
With the tables saved, register three MCP tools:
python# Tool 1: list_whitepapers() wp_counts = ( sections_tbl .group_by("id", "name") .agg(fc.count(fc.lit(1)).alias("section_count")) .order_by("name") ) session.catalog.register_tool("list_whitepapers", wp_counts, "List all whitepapers with section counts") # Tool 2: sections_by_topic(topic) topic_param = fc.tool_param("topic", fc.StringType) by_topic = ( topics_tbl .filter(fc.text.lower(fc.col("topic")) == fc.text.lower(topic_param)) .join(sections_tbl, (fc.col("doc_id") == fc.col("id")) & (fc.col("heading") == fc.col("heading"))) .select("name", "topic", "heading", "level", "full_path", "content") .order_by("name", "full_path", "level") ) session.catalog.register_tool("sections_by_topic", by_topic, "Get sections for a given topic") # Tool 3: search_sections(query) query_param = fc.tool_param("query", fc.StringType) search_plan = ( sections_tbl .filter( fc.text.lower(fc.col("heading")).contains(fc.text.lower(query_param)) | fc.text.lower(fc.col("content")).contains(fc.text.lower(query_param)) ) .select("name", "heading", "level", "full_path") .order_by("name", "full_path") ) session.catalog.register_tool("search_sections", search_plan, "Search sections by keyword")
Why each tool matters:
| Tool | Purpose |
|---|---|
list_whitepapers() | Gives users a friendly index—builds confidence your catalog is concrete |
sections_by_topic(topic) | Demonstrates your curated taxonomy from Step 3 |
search_sections(query) | Simplest possible search—predictable, explainable, fast |
Step 6: Smoke-test in a few lines
Use the simple test client from the notebook:
python# Connect to the MCP server async with Client(f"http://{HOST}:{PORT}/mcp") as client: # Test each tool print(await client.call_tool("list_whitepapers", {})) print(await client.call_tool("sections_by_topic", {"topic": "training"})) print(await client.call_tool("search_sections", {"query": "latency"}))
Each returns a short table (15 rows max) with sensible columns. That's enough to understand "what's in the box" and wire the tools into any MCP-capable host or UI.
What "done" looks like
- A printed line like:
✅ MCP HTTP server ready at http://127.0.0.1:54217/mcp whitepaper_sectionswith thousands of rows (H1–H3 blocks) and consistentfull_pathswhitepaper_topicswith a handful of topics and the headings they touch- Three tools you can demo interactively and wire into an MCP host
From here, product and compliance folks can ask concrete questions—"Show me all Governance sections across vendor A and B" or "Where do we talk about training data retention?"—without opening a single PDF.
Extending the demo (when you're ready)
These are intentionally out of scope for the core demo but easy to layer on:
-
Vector ranking (
similar_sections) Precompute an embedding forheading + contentwithsemantic.embedand rank by cosine similarity. Great for "find me similar" retrieval. Keep it capped (result_limit) so payloads stay small. -
LLM scoring (
qa_sections_llm) Have the model "vote" on the best section for a question. Map a short instruction over candidates and parse a numeric score. Pitfall: ensure yoursemantic.mapinstruction is a string literal, and materialize UDF outputs before tool registration. -
Contact signals Regex UDFs for emails, URLs, and phone numbers can help compliance and outreach. Do extraction in the table, keep tool definitions UDF-free.
-
Richer taxonomies Expand from three topics to your real taxonomy. The
sections_by_topicjoin pattern stays the same.
Performance, cost, and safety notes
- Keep rows short: Sections (vs. full PDFs) dramatically reduce tokens if you add LLM steps later.
- Cache deliberately: When computing embeddings or heavy transforms, cache the intermediate DataFrame you'll reuse.
- Idempotency matters: Overwrite in dev; use versioned table names in CI.
- No UDFs in tools: Build tool queries from built-in expressions so they serialize cleanly.
- Deterministic joins: Link topics to sections via
(doc_id, heading)and sort by(name, full_path, level)for predictable outputs.
How teams can use this
Security and compliance search
Your security team asks, "Where do we promise data retention limits?" Instead of skimming PDFs, they call search_sections("retention") or sections_by_topic("governance").
Audit prep / due diligence During vendor reviews, export slices to CSV and hand them to legal. Same query every time—no more "someone missed a paragraph on page 63."
Agent surfaces Whether you use a CLI host or a UI, the agent doesn't freestyle answers. It calls deterministic tools and renders results—no hallucinated policy claims.
Docs Ops / editorial
Basic counts (list_whitepapers) often reveal "we have 3 policy docs with 300+ sections—we should consolidate" or "governance sections are thin compared to training."
Adopting this in your stack
- Open the Colab notebook and point it at your PDFs
- Keep header levels small at first (H1–H3 usually suffices)
- Save the two tables and wire up the MCP tools
- Share the MCP endpoints with your security, docs, and support teams
- Iterate on the topics index with real stakeholders—that's where value clicks
- Only then consider vector or LLM ranking as optional upgrades
Conclusion
In this demo, you take parsed PDFs, impose a small set of consistent structures, and publish them behind three tiny tools. That's a shape agents, analysts, and engineers can all agree on.
With fenic, the whole pipeline stays in a single, readable DataFrame flow—no glue code, no batching, no mystery prompts hiding in strings.
When you're ready, add vectors or LLM scoring as optional layers. But ship the curated tables first.
Try the demo, then iterate
- Open the notebook — run it and replace the sample PDFs with your own
- Docs — browse fenic's Markdown, text, and semantic operators
- Examples — explore other fenic demos for inspiration
