The Problem: PDFs Are Black Boxes for AI
You've got 50 PDF whitepapers—compliance docs, vendor assessments, training materials. Someone asks: "What do our policies say about data retention?"
Your options today:
| Approach | Problem |
|---|---|
| Dump PDFs into an LLM | Token limits, inconsistent answers, expensive |
| Keyword search (Ctrl+F) | Misses semantic matches, manual, slow |
| Vector search / RAG | Chunks lack context, retrieval is fuzzy |
| Let the agent "figure it out" | Hallucinations, no auditability |
The real issue: PDFs aren't structured for machines. Headers, sections, and topics exist visually but aren't accessible programmatically.
The Solution: Structure First, Search Second
Instead of throwing raw PDFs at an AI, build a curated knowledge layer:
- Parse PDFs into Markdown (preserving headings)
- Split into sections (H1, H2, H3 chunks)
- Index sections by topic (training, compliance, governance)
- Expose as queryable tools your agent can call
The result: deterministic, auditable answers. No token floods. No hallucinations about policy claims.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PDFs │ ──▶ │ Markdown │ ──▶ │ Sections │ ──▶ │ Query Tools │
│ │ │ (parsed) │ │ Table │ │ (API/MCP) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Step 1: Convert PDFs to Markdown
First, get your PDFs into a structured text format. Markdown preserves heading hierarchy, which is critical for the next step.
Options:
- PyMuPDF — fast, local, good for simple layouts
- pdfplumber — better table extraction
- Marker — ML-based, handles complex layouts
- Vision LLMs (GPT-4V, Gemini) — best quality for messy PDFs, but slower/costlier
python# Example with marker (ML-based conversion) from marker.convert import convert_single_pdf markdown_content, images, metadata = convert_single_pdf("whitepaper.pdf")
Goal: One Markdown string per document with intact #, ##, ### headers.
Step 2: Split Markdown into Sections
Now chunk each document by its headings. This preserves semantic boundaries—far better than arbitrary 500-token splits.
pythonimport re def extract_sections(markdown: str, max_level: int = 3) -> list[dict]: """Extract sections from Markdown by header level.""" sections = [] # Match headers H1-H3 pattern = r'^(#{1,' + str(max_level) + r'})\s+(.+)$' lines = markdown.split('\n') current_section = None content_lines = [] for line in lines: match = re.match(pattern, line) if match: # Save previous section if current_section: current_section['content'] = '\n'.join(content_lines).strip() sections.append(current_section) level = len(match.group(1)) heading = match.group(2).strip() current_section = { 'heading': heading, 'level': level, 'content': '' } content_lines = [] else: content_lines.append(line) # Don't forget the last section if current_section: current_section['content'] = '\n'.join(content_lines).strip() sections.append(current_section) return sections
Pro tip: Add a full_path field like "Introduction > Security > Data Retention" for context. This helps when sections have generic names like "Overview."
Step 3: Build a Topics Index
Sections are great, but you also want to query by topic—"show me everything about training" across all documents.
Two approaches:
A) Keyword-based (fast, no LLM):
pythonTOPIC_KEYWORDS = { 'training': ['training', 'model training', 'fine-tuning', 'dataset'], 'compliance': ['soc', 'gdpr', 'hipaa', 'compliance', 'audit'], 'security': ['encryption', 'authentication', 'access control'], } def classify_section(section: dict) -> list[str]: text = (section['heading'] + ' ' + section['content']).lower() return [topic for topic, keywords in TOPIC_KEYWORDS.items() if any(kw in text for kw in keywords)]
B) LLM-based (more accurate):
python# Use an LLM to classify sections into topics # Returns structured output like: # {"topics": ["training", "compliance"], "confidence": 0.92}
Store the result as a topics table: (doc_id, section_heading, topic).
Step 4: Store in a Queryable Format
Now persist your sections and topics so they're queryable. Options:
| Storage | Best For |
|---|---|
| SQLite | Local dev, small datasets |
| PostgreSQL | Production, full-text search |
| DuckDB | Analytics, larger datasets |
| Parquet files | Data pipelines, versioning |
pythonimport sqlite3 conn = sqlite3.connect('knowledge_base.db') # Create sections table conn.execute(''' CREATE TABLE IF NOT EXISTS sections ( id INTEGER PRIMARY KEY, doc_name TEXT, heading TEXT, level INTEGER, content TEXT, full_path TEXT ) ''') # Create topics index conn.execute(''' CREATE TABLE IF NOT EXISTS topics ( doc_name TEXT, heading TEXT, topic TEXT ) ''') # Insert your data...
Step 5: Expose as Query Tools
The final step: make your knowledge base accessible to AI agents (or humans) via simple query tools.
Three essential tools:
pythondef list_documents() -> list[dict]: """Return all documents with section counts.""" return conn.execute(''' SELECT doc_name, COUNT(*) as section_count FROM sections GROUP BY doc_name ''').fetchall() def search_sections(query: str) -> list[dict]: """Search sections by keyword (case-insensitive).""" return conn.execute(''' SELECT doc_name, heading, level, full_path, content FROM sections WHERE heading LIKE ? OR content LIKE ? ORDER BY doc_name, full_path LIMIT 20 ''', (f'%{query}%', f'%{query}%')).fetchall() def sections_by_topic(topic: str) -> list[dict]: """Get all sections tagged with a specific topic.""" return conn.execute(''' SELECT s.doc_name, s.heading, s.level, s.full_path, s.content FROM sections s JOIN topics t ON s.doc_name = t.doc_name AND s.heading = t.heading WHERE LOWER(t.topic) = LOWER(?) ORDER BY s.doc_name, s.full_path ''', (topic,)).fetchall()
For AI agents, expose these as:
- Function calling (OpenAI, Anthropic)
- MCP tools (Model Context Protocol) — learn more
- REST API endpoints
Why This Beats RAG for Structured Documents
| Factor | Vector RAG | Structured Sections |
|---|---|---|
| Retrieval accuracy | Fuzzy, depends on embedding quality | Exact, deterministic |
| Auditability | Hard to explain why chunk was retrieved | Clear: "matched heading X in doc Y" |
| Token efficiency | Often retrieves redundant chunks | Returns only relevant sections |
| Maintenance | Re-embed on any change | Just update the table |
RAG still has a place—for semantic "find similar" queries. But for structured documents with clear headings and topics, a curated sections index is faster, cheaper, and more reliable.
Real-World Use Cases
Compliance search
"Where do we mention data retention?" → search_sections("retention") returns exact sections with document names and paths.
Vendor due diligence
"Compare security sections across all vendor docs" → sections_by_topic("security") returns a structured comparison.
Agent tool use
Your AI agent calls search_sections("encryption") instead of guessing. Deterministic results, no hallucinations.
Editorial/docs ops "Which docs have thin governance sections?" → Query section counts by topic to find gaps.
Optional: Add Vector Search Later
Once your structured foundation is solid, you can layer on embeddings for semantic search:
pythonfrom sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') def add_embeddings(sections): for section in sections: text = f"{section['heading']}\n\n{section['content']}" section['embedding'] = model.encode(text) return sections def similar_sections(query: str, top_k: int = 5): query_embedding = model.encode(query) # ... compute cosine similarity, return top_k
But ship the structured search first. You'll be surprised how far keyword + topic indexing gets you.
FAQ
How do I make a PDF searchable for AI?
Convert the PDF to Markdown (preserving headers), split it into sections by heading level (H1-H3), and store the sections in a database. Then expose simple query functions your AI agent can call—like search_sections(query) or sections_by_topic(topic).
What's the best way to chunk PDFs for LLMs?
Chunk by semantic boundaries (headings), not arbitrary token counts. Split at H1, H2, and H3 headers to preserve document structure. Include a "full path" breadcrumb (e.g., "Security > Authentication > OAuth") so chunks have context.
Should I use RAG or structured search for PDFs?
Use structured search (sections + topics) for documents with clear hierarchies like whitepapers, policies, and technical docs. Use RAG/vector search for unstructured content or "find similar" queries. Often, combining both works best.
How do I expose my PDF knowledge base to AI agents?
Create simple query functions (list_documents, search_sections, sections_by_topic) and expose them via function calling (OpenAI/Anthropic), MCP tools, or REST APIs. The agent calls these tools instead of reading raw PDFs.
What tools can convert PDFs to Markdown?
- PyMuPDF — fast, local, good for simple layouts
- pdfplumber — better for tables
- Marker — ML-based, handles complex layouts
- Vision LLMs (GPT-4V, Gemini) — highest quality but slower
Next Steps
- Start small — try this with 5-10 PDFs first
- Validate section quality — spot-check that headers are extracted correctly
- Iterate on topics — start with 3-5 topics, expand based on real queries
- Add vector search later — only if keyword + topic search isn't enough
Tools and Libraries
For a production-ready implementation of this pattern, check out fenic—a Python framework for building semantic data pipelines. It handles PDF parsing, section extraction, topic classification, and MCP tool generation in a single DataFrame flow.
Related tutorial: Convert PDFs into a Queryable, Agent-Ready Catalog with fenic — a detailed walkthrough using fenic's built-in operators.
Have questions? Found a better approach? Open an issue or check out the fenic docs.
