<< goback()

How to Make PDFs Searchable for AI Agents: A Python Guide

Typedef Team

How to Make PDFs Searchable for AI Agents: A Python Guide

The Problem: PDFs Are Black Boxes for AI

You've got 50 PDF whitepapers—compliance docs, vendor assessments, training materials. Someone asks: "What do our policies say about data retention?"

Your options today:

ApproachProblem
Dump PDFs into an LLMToken limits, inconsistent answers, expensive
Keyword search (Ctrl+F)Misses semantic matches, manual, slow
Vector search / RAGChunks lack context, retrieval is fuzzy
Let the agent "figure it out"Hallucinations, no auditability

The real issue: PDFs aren't structured for machines. Headers, sections, and topics exist visually but aren't accessible programmatically.

The Solution: Structure First, Search Second

Instead of throwing raw PDFs at an AI, build a curated knowledge layer:

  1. Parse PDFs into Markdown (preserving headings)
  2. Split into sections (H1, H2, H3 chunks)
  3. Index sections by topic (training, compliance, governance)
  4. Expose as queryable tools your agent can call

The result: deterministic, auditable answers. No token floods. No hallucinations about policy claims.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│    PDFs     │ ──▶ │  Markdown   │ ──▶ │  Sections   │ ──▶ │ Query Tools │
│             │     │  (parsed)   │     │   Table     │     │  (API/MCP)  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Step 1: Convert PDFs to Markdown

First, get your PDFs into a structured text format. Markdown preserves heading hierarchy, which is critical for the next step.

Options:

  • PyMuPDF — fast, local, good for simple layouts
  • pdfplumber — better table extraction
  • Marker — ML-based, handles complex layouts
  • Vision LLMs (GPT-4V, Gemini) — best quality for messy PDFs, but slower/costlier
python
# Example with marker (ML-based conversion)
from marker.convert import convert_single_pdf

markdown_content, images, metadata = convert_single_pdf("whitepaper.pdf")

Goal: One Markdown string per document with intact #, ##, ### headers.

Step 2: Split Markdown into Sections

Now chunk each document by its headings. This preserves semantic boundaries—far better than arbitrary 500-token splits.

python
import re

def extract_sections(markdown: str, max_level: int = 3) -> list[dict]:
    """Extract sections from Markdown by header level."""
    sections = []
    # Match headers H1-H3
    pattern = r'^(#{1,' + str(max_level) + r'})\s+(.+)$'

    lines = markdown.split('\n')
    current_section = None
    content_lines = []

    for line in lines:
        match = re.match(pattern, line)
        if match:
            # Save previous section
            if current_section:
                current_section['content'] = '\n'.join(content_lines).strip()
                sections.append(current_section)

            level = len(match.group(1))
            heading = match.group(2).strip()
            current_section = {
                'heading': heading,
                'level': level,
                'content': ''
            }
            content_lines = []
        else:
            content_lines.append(line)

    # Don't forget the last section
    if current_section:
        current_section['content'] = '\n'.join(content_lines).strip()
        sections.append(current_section)

    return sections

Pro tip: Add a full_path field like "Introduction > Security > Data Retention" for context. This helps when sections have generic names like "Overview."

Step 3: Build a Topics Index

Sections are great, but you also want to query by topic—"show me everything about training" across all documents.

Two approaches:

A) Keyword-based (fast, no LLM):

python
TOPIC_KEYWORDS = {
    'training': ['training', 'model training', 'fine-tuning', 'dataset'],
    'compliance': ['soc', 'gdpr', 'hipaa', 'compliance', 'audit'],
    'security': ['encryption', 'authentication', 'access control'],
}

def classify_section(section: dict) -> list[str]:
    text = (section['heading'] + ' ' + section['content']).lower()
    return [topic for topic, keywords in TOPIC_KEYWORDS.items()
            if any(kw in text for kw in keywords)]

B) LLM-based (more accurate):

python
# Use an LLM to classify sections into topics
# Returns structured output like:
# {"topics": ["training", "compliance"], "confidence": 0.92}

Store the result as a topics table: (doc_id, section_heading, topic).

Step 4: Store in a Queryable Format

Now persist your sections and topics so they're queryable. Options:

StorageBest For
SQLiteLocal dev, small datasets
PostgreSQLProduction, full-text search
DuckDBAnalytics, larger datasets
Parquet filesData pipelines, versioning
python
import sqlite3

conn = sqlite3.connect('knowledge_base.db')

# Create sections table
conn.execute('''
    CREATE TABLE IF NOT EXISTS sections (
        id INTEGER PRIMARY KEY,
        doc_name TEXT,
        heading TEXT,
        level INTEGER,
        content TEXT,
        full_path TEXT
    )
''')

# Create topics index
conn.execute('''
    CREATE TABLE IF NOT EXISTS topics (
        doc_name TEXT,
        heading TEXT,
        topic TEXT
    )
''')

# Insert your data...

Step 5: Expose as Query Tools

The final step: make your knowledge base accessible to AI agents (or humans) via simple query tools.

Three essential tools:

python
def list_documents() -> list[dict]:
    """Return all documents with section counts."""
    return conn.execute('''
        SELECT doc_name, COUNT(*) as section_count
        FROM sections
        GROUP BY doc_name
    ''').fetchall()

def search_sections(query: str) -> list[dict]:
    """Search sections by keyword (case-insensitive)."""
    return conn.execute('''
        SELECT doc_name, heading, level, full_path, content
        FROM sections
        WHERE heading LIKE ? OR content LIKE ?
        ORDER BY doc_name, full_path
        LIMIT 20
    ''', (f'%{query}%', f'%{query}%')).fetchall()

def sections_by_topic(topic: str) -> list[dict]:
    """Get all sections tagged with a specific topic."""
    return conn.execute('''
        SELECT s.doc_name, s.heading, s.level, s.full_path, s.content
        FROM sections s
        JOIN topics t ON s.doc_name = t.doc_name AND s.heading = t.heading
        WHERE LOWER(t.topic) = LOWER(?)
        ORDER BY s.doc_name, s.full_path
    ''', (topic,)).fetchall()

For AI agents, expose these as:

  • Function calling (OpenAI, Anthropic)
  • MCP tools (Model Context Protocol) — learn more
  • REST API endpoints

Why This Beats RAG for Structured Documents

FactorVector RAGStructured Sections
Retrieval accuracyFuzzy, depends on embedding qualityExact, deterministic
AuditabilityHard to explain why chunk was retrievedClear: "matched heading X in doc Y"
Token efficiencyOften retrieves redundant chunksReturns only relevant sections
MaintenanceRe-embed on any changeJust update the table

RAG still has a place—for semantic "find similar" queries. But for structured documents with clear headings and topics, a curated sections index is faster, cheaper, and more reliable.

Real-World Use Cases

Compliance search "Where do we mention data retention?"search_sections("retention") returns exact sections with document names and paths.

Vendor due diligence "Compare security sections across all vendor docs"sections_by_topic("security") returns a structured comparison.

Agent tool use Your AI agent calls search_sections("encryption") instead of guessing. Deterministic results, no hallucinations.

Editorial/docs ops "Which docs have thin governance sections?" → Query section counts by topic to find gaps.

Optional: Add Vector Search Later

Once your structured foundation is solid, you can layer on embeddings for semantic search:

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def add_embeddings(sections):
    for section in sections:
        text = f"{section['heading']}\n\n{section['content']}"
        section['embedding'] = model.encode(text)
    return sections

def similar_sections(query: str, top_k: int = 5):
    query_embedding = model.encode(query)
    # ... compute cosine similarity, return top_k

But ship the structured search first. You'll be surprised how far keyword + topic indexing gets you.


FAQ

How do I make a PDF searchable for AI?

Convert the PDF to Markdown (preserving headers), split it into sections by heading level (H1-H3), and store the sections in a database. Then expose simple query functions your AI agent can call—like search_sections(query) or sections_by_topic(topic).

What's the best way to chunk PDFs for LLMs?

Chunk by semantic boundaries (headings), not arbitrary token counts. Split at H1, H2, and H3 headers to preserve document structure. Include a "full path" breadcrumb (e.g., "Security > Authentication > OAuth") so chunks have context.

Should I use RAG or structured search for PDFs?

Use structured search (sections + topics) for documents with clear hierarchies like whitepapers, policies, and technical docs. Use RAG/vector search for unstructured content or "find similar" queries. Often, combining both works best.

How do I expose my PDF knowledge base to AI agents?

Create simple query functions (list_documents, search_sections, sections_by_topic) and expose them via function calling (OpenAI/Anthropic), MCP tools, or REST APIs. The agent calls these tools instead of reading raw PDFs.

What tools can convert PDFs to Markdown?

  • PyMuPDF — fast, local, good for simple layouts
  • pdfplumber — better for tables
  • Marker — ML-based, handles complex layouts
  • Vision LLMs (GPT-4V, Gemini) — highest quality but slower

Next Steps

  1. Start small — try this with 5-10 PDFs first
  2. Validate section quality — spot-check that headers are extracted correctly
  3. Iterate on topics — start with 3-5 topics, expand based on real queries
  4. Add vector search later — only if keyword + topic search isn't enough

Tools and Libraries

For a production-ready implementation of this pattern, check out fenic—a Python framework for building semantic data pipelines. It handles PDF parsing, section extraction, topic classification, and MCP tool generation in a single DataFrame flow.

Related tutorial: Convert PDFs into a Queryable, Agent-Ready Catalog with fenic — a detailed walkthrough using fenic's built-in operators.


Have questions? Found a better approach? Open an issue or check out the fenic docs.