<< goback()

How to Handle Token Limits and Rate Limits in Large-Scale LLM Inference

Typedef Team

How to Handle Token Limits and Rate Limits in Large-Scale LLM Inference

Large-scale LLM inference operations fail when they hit provider limits. Building production systems requires precise control over token consumption and request throughput. Fenic, an opinionated DataFrame framework from Typedef.ai, provides built-in mechanisms to handle these constraints while maximizing inference throughput.

Defining the Challenge

When processing thousands of documents through LLMs, two bottlenecks emerge:

Token limits constrain how much text can be processed in a single request and how many tokens the model generates in its response.

Rate limits restrict the number of requests and tokens processed per minute across your entire application.

Both constraints force you to choose between reliability and throughput. Push too hard and requests fail. Throttle too conservatively and pipeline latency explodes.

Configuring Rate Limits in Fenic

Fenic handles rate limiting at the session level through explicit model configuration. Each model provider configuration accepts rate limit parameters that control request and token throughput.

Setting Per-Model Rate Limits

Configure rate limits when defining your session's semantic models:

python
from fenic.api.session import SessionConfig, SemanticConfig
from fenic.api.session.config import OpenAILanguageModel

config = SessionConfig(
    app_name="production_pipeline",
    semantic=SemanticConfig(
        language_models={
            "gpt4": OpenAILanguageModel(
                model_name="gpt-4.1-nano",
                rpm=100,  # Requests per minute
                tpm=100000  # Tokens per minute
            )
        }
    )
)

The framework automatically throttles requests to stay within these limits. No manual retry logic or exponential backoff required.

Provider-Specific Rate Limit Controls

Different providers require different rate limit configurations:

Anthropic models separate input and output token limits:

python
from fenic.api.session.config import AnthropicLanguageModel

config = SemanticConfig(
    language_models={
        "claude": AnthropicLanguageModel(
            model_name="claude-3-5-haiku-latest",
            rpm=100,
            input_tpm=50000,   # Input token limit
            output_tpm=10000   # Output token limit
        )
    }
)

This granular control matters when processing high-volume inference where input tokens far exceed output tokens or vice versa.

Google models use combined token limits:

python
from fenic.api.session.config import GoogleDeveloperLanguageModel

config = SemanticConfig(
    language_models={
        "gemini": GoogleDeveloperLanguageModel(
            model_name="gemini-2.0-flash",
            rpm=100,
            tpm=1000000
        )
    }
)

Multi-Model Rate Limit Strategies

Production systems often route requests across multiple models for cost optimization or capability requirements. Configure multiple models with independent rate limits:

python
config = SemanticConfig(
    language_models={
        "fast_model": OpenAILanguageModel(
            model_name="gpt-4.1-nano",
            rpm=500,
            tpm=500000
        ),
        "complex_model": AnthropicLanguageModel(
            model_name="claude-opus-4-0",
            rpm=50,
            input_tpm=25000,
            output_tpm=5000
        )
    },
    default_language_model="fast_model"
)

Route simple tasks to the fast model and advanced reasoning to the more constrained model:

python
# Uses fast_model (default)
df.with_column(
    "summary",
    fc.semantic.map(
        "Summarize this in one sentence: {{ content }}",
        content=fc.col("content")
    )
)

# Uses complex_model
df.semantic.map(
    "Analyze the logical structure: {{ text }}",
    text=fc.col("content"),
    model_alias="complex_model"
)

Controlling Token Consumption

Rate limits control request throughput, but individual operations still need token-level control. Fenic provides multiple mechanisms for constraining token usage per inference call.

Output Token Limits

The max_output_tokens parameter caps generation length:

python
df.with_column(
    "summary",
    fc.semantic.map(fc.col("text"), "Generate a brief summary")
)
# Note: max_output_tokens configuration would need to be set at session/model level

This prevents runaway generation costs and ensures predictable latency. The default is 512 tokens, but production workloads should set explicit limits based on expected output length.

Structured Extraction with Token Control

When extracting structured data, combine Pydantic schemas with output token limits:

python
from pydantic import BaseModel, Field

class KeyPoints(BaseModel):
    main_idea: str = Field(description="Primary concept in 10 words")
    supporting_facts: list[str] = Field(description="3-5 key supporting points")

df.with_column(
    "extracted",
    fc.semantic.extract(fc.col("article"), KeyPoints)
)
# Note: max_output_tokens configuration would need to be set at session/model level

The schema constrains output structure while the token limit prevents excessive elaboration.

Reasoning Token Budgets

Extended reasoning models like Claude Opus 4 and OpenAI o-series consume tokens during their internal reasoning process. Control this overhead through model profiles.

For Anthropic models:

python
config = SessionConfig(
    semantic=SemanticConfig(
        language_models={
            "claude": AnthropicLanguageModel(
                model_name="claude-opus-4-0",
                rpm=100,
                input_tpm=100000,
                output_tpm=20000,
                profiles={
                    "quick": AnthropicLanguageModel.Profile(
                        thinking_token_budget=1024
                    ),
                    "thorough": AnthropicLanguageModel.Profile(
                        thinking_token_budget=8192
                    )
                },
                default_profile="quick"
            )
        }
    )
)

For OpenAI models:

python
config = SessionConfig(
    semantic=SemanticConfig(
        language_models={
            "o4": OpenAILanguageModel(
                model_name="o4-mini",
                rpm=1000,
                tpm=1000000,
                profiles={
                    "fast": OpenAILanguageModel.Profile(
                        reasoning_effort="low"
                    ),
                    "deep": OpenAILanguageModel.Profile(
                        reasoning_effort="high"
                    )
                },
                default_profile="fast"
            )
        }
    )
)

Select profiles at inference time:

python
from fenic.core.types import ModelAlias

# Use default quick profile
df.with_column(
    "analysis",
    fc.semantic.map(fc.col("content"), "Analyze")
)

# Use thorough profile for complex tasks
df.with_column(
    "proof",
    fc.semantic.map(fc.col("claim"), "Construct a formal proof")
)

This approach lets you dynamically allocate reasoning budgets based on task complexity while maintaining rate limit compliance.

Managing Concurrent Inference Operations

Rate limits and token limits intersect with concurrency. Running more requests in parallel hits rate limits faster but reduces overall pipeline latency.

Async UDFs for Concurrent I/O

Fenic's async UDF system provides controlled parallelism for I/O-bound operations:

python
import fenic as fc
from fenic.core.types import StringType
import aiohttp

@fc.async_udf(
    return_type=StringType,
    max_concurrency=20,  # Control parallel request count
    timeout_seconds=10,   # Per-request timeout
    num_retries=3        # Automatic retry on failure
)
async def call_external_api(doc_id: str) -> str:
    async with aiohttp.ClientSession() as session:
        async with session.get(f"https://api.example.com/{doc_id}") as resp:
            return await resp.text()

df.select(
    fc.col("document_id"),
    call_external_api(fc.col("document_id")).alias("enriched_data")
)

The max_concurrency parameter prevents overwhelming downstream systems while the retry logic handles transient failures. Fenic maintains input row order and handles resource cleanup automatically.

MCP Server Concurrency Control

When deploying inference tools through Model Context Protocol servers, set concurrency limits at the server level:

python
from fenic.api.mcp import create_mcp_server, run_mcp_server_sync
from fenic.api.session import Session

session = Session.get_or_create()
tools = session.catalog.list_tools()

server = create_mcp_server(
    session,
    "InferenceServer",
    tools=tools,
    concurrency_limit=8  # Limit concurrent tool executions
)

run_mcp_server_sync(
    server,
    transport="http",
    stateless_http=True,
    port=8000
)

This prevents your inference server from launching hundreds of simultaneous LLM calls that would immediately hit rate limits.

Tracking Token Usage and Costs

Production systems require visibility into token consumption and inference costs. Fenic tracks comprehensive metrics for every query execution.

Query-Level Metrics

Every DataFrame operation returns execution metrics:

python
result = df.with_column(
    "summary",
    fc.semantic.map(fc.col("content"), "Summarize")
).collect()

# Access session metrics
session = fc.get_session()
metrics = session.get_metrics()
print(f"Tokens used: {metrics.token_count}")
print(f"Estimated cost: {metrics.estimated_cost}")
print(f"Output rows: {metrics.num_output_rows}")
print(f"Total LM cost: ${metrics.total_lm_metrics.cost}")
print(f"Input tokens: {metrics.total_lm_metrics.num_uncached_input_tokens}")
print(f"Output tokens: {metrics.total_lm_metrics.num_output_tokens}")

Persistent Metrics Storage

Fenic automatically logs execution metrics to a local table for historical analysis:

python
# Query all metrics
metrics_df = session.table("fenic_system.query_metrics")

# Find expensive queries
expensive_queries = session.sql("""
    SELECT
        execution_id,
        total_lm_cost,
        total_lm_requests,
        execution_time_ms
    FROM {df}
    WHERE total_lm_cost > 1.0
    ORDER BY total_lm_cost DESC
""", df=metrics_df)

expensive_queries.show()

Analyze token consumption patterns:

python
# Aggregate token usage by time window
token_usage = session.sql("""
    SELECT
        DATE(CAST(end_ts AS TIMESTAMP)) as date,
        SUM(total_lm_input_tokens) as total_input,
        SUM(total_lm_output_tokens) as total_output,
        SUM(total_lm_cost) as daily_cost
    FROM {df}
    WHERE CAST(end_ts AS TIMESTAMP) >= CURRENT_DATE - INTERVAL 7 DAYS
    GROUP BY date
    ORDER BY date
""", df=metrics_df)

token_usage.show()

This data informs rate limit tuning and cost optimization decisions.

Practical Rate Limit Strategies

Effective rate limit management requires balancing multiple constraints.

Tiered Rate Limit Configuration

Match rate limits to provider tier pricing:

python
config = SemanticConfig(
    language_models={
        "budget_tier": OpenAILanguageModel(
            model_name="gpt-4.1-nano",
            rpm=100,      # Free tier limit
            tpm=40000
        ),
        "production_tier": OpenAILanguageModel(
            model_name="gpt-4.1-nano",
            rpm=5000,     # Enterprise tier limit
            tpm=2000000
        )
    }
)

Route development work to budget tier and production traffic to the higher-capacity model.

Dynamic Model Selection

Select models based on rate limit availability:

python
from fenic.core.types import ModelAlias

def get_available_model(task_complexity: str) -> str:
    # Check current rate limit usage in metrics
    recent_usage = session.table("fenic_system.query_metrics")

    # Simple heuristic: use fast model if available
    if task_complexity == "simple":
        return "fast_model"
    else:
        return "complex_model"

# Apply in pipeline
df = df.with_column(
    "complexity",
    fc.when(fc.col("doc_length") < 1000, "simple")
     .otherwise("complex")
)

Batch Size Optimization

When processing large datasets, group operations to maximize throughput under rate limits:

python
# Process in chunks that respect rate limits
chunk_size = 100  # Adjust based on rpm limits

for chunk_df in df.iter_chunks(chunk_size):
    result = chunk_df.semantic.map(
        "Process: {{ text }}",
        text=fc.col("content")
    ).collect()

    # Process results
    process_results(result.data)

Error Handling and Recovery

Rate limit violations and token limit errors require explicit handling.

Provider Key Validation

Fenic validates API keys at session creation to fail fast:

python
from fenic.api.session import Session

try:
    session = Session.get_or_create(config)
except Exception as e:
    print(f"Configuration error: {e}")
    # Handle missing or invalid keys

This eliminates runtime failures from misconfigured credentials.

Automatic Retry Logic

Async UDFs handle transient failures automatically:

python
@fc.async_udf(
    return_type=StringType,
    num_retries=5,           # Retry up to 5 times
    timeout_seconds=30,      # 30-second timeout per attempt
    max_concurrency=10
)
async def robust_inference(text: str) -> str:
    # Fenic handles exponential backoff automatically
    return await call_llm(text)

Failed requests after all retries return null rather than failing the entire batch, maintaining pipeline resilience.

Rate Limit Backoff

Fenic implements exponential backoff internally when hitting rate limits. The framework automatically throttles requests and resumes when capacity becomes available.

Monitor this behavior through metrics:

python
metrics = result.metrics
print(f"Retry count: {metrics.total_retry_count}")
print(f"Throttled duration: {metrics.throttled_duration_ms}ms")

Best Practices

  1. Set explicit rate limits matching your provider tier. Don't rely on defaults.
  2. Configure token budgets based on expected output length. Start conservative and increase based on metrics.
  3. Use model profiles to separate fast inference from deep reasoning workloads.
  4. Monitor metrics continuously. Query the metrics table regularly to identify bottlenecks.
  5. Implement tiered models. Route simple tasks to high-throughput models and advanced tasks to more capable but rate-limited models.
  6. Control concurrency at both the UDF level and server level to prevent rate limit breaches.
  7. Fail fast with validation. Let Fenic validate provider keys at session creation rather than discovering errors at runtime.

Implementation Example

Putting these patterns together:

python
from fenic.api.session import Session, SessionConfig, SemanticConfig
from fenic.api.session.config import OpenAILanguageModel, AnthropicLanguageModel
import fenic as fc

# Configure session with rate limits
config = SessionConfig(
    app_name="production_pipeline",
    semantic=SemanticConfig(
        language_models={
            "fast": OpenAILanguageModel(
                model_name="gpt-4.1-nano",
                rpm=1000,
                tpm=500000
            ),
            "reasoning": AnthropicLanguageModel(
                model_name="claude-opus-4-0",
                rpm=100,
                input_tpm=50000,
                output_tpm=10000,
                profiles={
                    "quick": AnthropicLanguageModel.Profile(
                        thinking_token_budget=2048
                    ),
                    "thorough": AnthropicLanguageModel.Profile(
                        thinking_token_budget=8192
                    )
                },
                default_profile="quick"
            )
        },
        default_language_model="fast"
    )
)

session = Session.get_or_create(config)

# Load data
df = session.read.csv("documents.csv")

# Simple tasks use fast model 
df = df.with_column(
    "summary",
    fc.semantic.map(fc.col("text"), "Summarize in one sentence")
)

# Complex tasks use reasoning model
df = df.with_column(
    "analysis",
    fc.semantic.map(fc.col("text"), "Analyze the argument structure")
)

# Execute and get metrics
result = df.collect("polars")

# Review token usage
print(f"Total cost: ${result.metrics.total_lm_metrics.cost}")
print(f"Input tokens: {result.metrics.total_lm_metrics.num_uncached_input_tokens}")
print(f"Output tokens: {result.metrics.total_lm_metrics.num_output_tokens}")

# Query historical metrics
metrics_df = session.table("fenic_system.query_metrics")
cost_analysis = session.sql("""
    SELECT
        SUM(total_lm_cost) as total_spend,
        AVG(execution_time_ms) as avg_latency_ms
    FROM {df}
    WHERE session_id = '{session_id}'
""", df=metrics_df, session_id=session.id)
cost_analysis.show()

Conclusion

Handling token limits and rate limits in large-scale LLM inference requires explicit configuration and continuous monitoring. Fenic provides the primitives needed to build reliable, cost-effective inference pipelines:

  • Declarative rate limit configuration per model
  • Fine-grained token budget control
  • Automatic throttling and retry logic
  • Comprehensive metrics tracking
  • Concurrent execution with bounded parallelism

These mechanisms transform ad-hoc LLM scripts into production-grade data pipelines. Configure your constraints explicitly, monitor actual usage through metrics, and let the framework handle the complexity of reliable inference at scale.

For more information:

Share this page
the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.