Large-scale LLM inference operations fail when they hit provider limits. Building production systems requires precise control over token consumption and request throughput. Fenic, an opinionated DataFrame framework from Typedef.ai, provides built-in mechanisms to handle these constraints while maximizing inference throughput.
Defining the Challenge
When processing thousands of documents through LLMs, two bottlenecks emerge:
Token limits constrain how much text can be processed in a single request and how many tokens the model generates in its response.
Rate limits restrict the number of requests and tokens processed per minute across your entire application.
Both constraints force you to choose between reliability and throughput. Push too hard and requests fail. Throttle too conservatively and pipeline latency explodes.
Configuring Rate Limits in Fenic
Fenic handles rate limiting at the session level through explicit model configuration. Each model provider configuration accepts rate limit parameters that control request and token throughput.
Setting Per-Model Rate Limits
Configure rate limits when defining your session's semantic models:
pythonfrom fenic.api.session import SessionConfig, SemanticConfig from fenic.api.session.config import OpenAILanguageModel config = SessionConfig( app_name="production_pipeline", semantic=SemanticConfig( language_models={ "gpt4": OpenAILanguageModel( model_name="gpt-4.1-nano", rpm=100, # Requests per minute tpm=100000 # Tokens per minute ) } ) )
The framework automatically throttles requests to stay within these limits. No manual retry logic or exponential backoff required.
Provider-Specific Rate Limit Controls
Different providers require different rate limit configurations:
Anthropic models separate input and output token limits:
pythonfrom fenic.api.session.config import AnthropicLanguageModel config = SemanticConfig( language_models={ "claude": AnthropicLanguageModel( model_name="claude-3-5-haiku-latest", rpm=100, input_tpm=50000, # Input token limit output_tpm=10000 # Output token limit ) } )
This granular control matters when processing high-volume inference where input tokens far exceed output tokens or vice versa.
Google models use combined token limits:
pythonfrom fenic.api.session.config import GoogleDeveloperLanguageModel config = SemanticConfig( language_models={ "gemini": GoogleDeveloperLanguageModel( model_name="gemini-2.0-flash", rpm=100, tpm=1000000 ) } )
Multi-Model Rate Limit Strategies
Production systems often route requests across multiple models for cost optimization or capability requirements. Configure multiple models with independent rate limits:
pythonconfig = SemanticConfig( language_models={ "fast_model": OpenAILanguageModel( model_name="gpt-4.1-nano", rpm=500, tpm=500000 ), "complex_model": AnthropicLanguageModel( model_name="claude-opus-4-0", rpm=50, input_tpm=25000, output_tpm=5000 ) }, default_language_model="fast_model" )
Route simple tasks to the fast model and advanced reasoning to the more constrained model:
python# Uses fast_model (default) df.with_column( "summary", fc.semantic.map( "Summarize this in one sentence: {{ content }}", content=fc.col("content") ) ) # Uses complex_model df.semantic.map( "Analyze the logical structure: {{ text }}", text=fc.col("content"), model_alias="complex_model" )
Controlling Token Consumption
Rate limits control request throughput, but individual operations still need token-level control. Fenic provides multiple mechanisms for constraining token usage per inference call.
Output Token Limits
The max_output_tokens parameter caps generation length:
pythondf.with_column( "summary", fc.semantic.map(fc.col("text"), "Generate a brief summary") ) # Note: max_output_tokens configuration would need to be set at session/model level
This prevents runaway generation costs and ensures predictable latency. The default is 512 tokens, but production workloads should set explicit limits based on expected output length.
Structured Extraction with Token Control
When extracting structured data, combine Pydantic schemas with output token limits:
pythonfrom pydantic import BaseModel, Field class KeyPoints(BaseModel): main_idea: str = Field(description="Primary concept in 10 words") supporting_facts: list[str] = Field(description="3-5 key supporting points") df.with_column( "extracted", fc.semantic.extract(fc.col("article"), KeyPoints) ) # Note: max_output_tokens configuration would need to be set at session/model level
The schema constrains output structure while the token limit prevents excessive elaboration.
Reasoning Token Budgets
Extended reasoning models like Claude Opus 4 and OpenAI o-series consume tokens during their internal reasoning process. Control this overhead through model profiles.
For Anthropic models:
pythonconfig = SessionConfig( semantic=SemanticConfig( language_models={ "claude": AnthropicLanguageModel( model_name="claude-opus-4-0", rpm=100, input_tpm=100000, output_tpm=20000, profiles={ "quick": AnthropicLanguageModel.Profile( thinking_token_budget=1024 ), "thorough": AnthropicLanguageModel.Profile( thinking_token_budget=8192 ) }, default_profile="quick" ) } ) )
For OpenAI models:
pythonconfig = SessionConfig( semantic=SemanticConfig( language_models={ "o4": OpenAILanguageModel( model_name="o4-mini", rpm=1000, tpm=1000000, profiles={ "fast": OpenAILanguageModel.Profile( reasoning_effort="low" ), "deep": OpenAILanguageModel.Profile( reasoning_effort="high" ) }, default_profile="fast" ) } ) )
Select profiles at inference time:
pythonfrom fenic.core.types import ModelAlias # Use default quick profile df.with_column( "analysis", fc.semantic.map(fc.col("content"), "Analyze") ) # Use thorough profile for complex tasks df.with_column( "proof", fc.semantic.map(fc.col("claim"), "Construct a formal proof") )
This approach lets you dynamically allocate reasoning budgets based on task complexity while maintaining rate limit compliance.
Managing Concurrent Inference Operations
Rate limits and token limits intersect with concurrency. Running more requests in parallel hits rate limits faster but reduces overall pipeline latency.
Async UDFs for Concurrent I/O
Fenic's async UDF system provides controlled parallelism for I/O-bound operations:
pythonimport fenic as fc from fenic.core.types import StringType import aiohttp @fc.async_udf( return_type=StringType, max_concurrency=20, # Control parallel request count timeout_seconds=10, # Per-request timeout num_retries=3 # Automatic retry on failure ) async def call_external_api(doc_id: str) -> str: async with aiohttp.ClientSession() as session: async with session.get(f"https://api.example.com/{doc_id}") as resp: return await resp.text() df.select( fc.col("document_id"), call_external_api(fc.col("document_id")).alias("enriched_data") )
The max_concurrency parameter prevents overwhelming downstream systems while the retry logic handles transient failures. Fenic maintains input row order and handles resource cleanup automatically.
MCP Server Concurrency Control
When deploying inference tools through Model Context Protocol servers, set concurrency limits at the server level:
pythonfrom fenic.api.mcp import create_mcp_server, run_mcp_server_sync from fenic.api.session import Session session = Session.get_or_create() tools = session.catalog.list_tools() server = create_mcp_server( session, "InferenceServer", tools=tools, concurrency_limit=8 # Limit concurrent tool executions ) run_mcp_server_sync( server, transport="http", stateless_http=True, port=8000 )
This prevents your inference server from launching hundreds of simultaneous LLM calls that would immediately hit rate limits.
Tracking Token Usage and Costs
Production systems require visibility into token consumption and inference costs. Fenic tracks comprehensive metrics for every query execution.
Query-Level Metrics
Every DataFrame operation returns execution metrics:
pythonresult = df.with_column( "summary", fc.semantic.map(fc.col("content"), "Summarize") ).collect() # Access session metrics session = fc.get_session() metrics = session.get_metrics() print(f"Tokens used: {metrics.token_count}") print(f"Estimated cost: {metrics.estimated_cost}") print(f"Output rows: {metrics.num_output_rows}") print(f"Total LM cost: ${metrics.total_lm_metrics.cost}") print(f"Input tokens: {metrics.total_lm_metrics.num_uncached_input_tokens}") print(f"Output tokens: {metrics.total_lm_metrics.num_output_tokens}")
Persistent Metrics Storage
Fenic automatically logs execution metrics to a local table for historical analysis:
python# Query all metrics metrics_df = session.table("fenic_system.query_metrics") # Find expensive queries expensive_queries = session.sql(""" SELECT execution_id, total_lm_cost, total_lm_requests, execution_time_ms FROM {df} WHERE total_lm_cost > 1.0 ORDER BY total_lm_cost DESC """, df=metrics_df) expensive_queries.show()
Analyze token consumption patterns:
python# Aggregate token usage by time window token_usage = session.sql(""" SELECT DATE(CAST(end_ts AS TIMESTAMP)) as date, SUM(total_lm_input_tokens) as total_input, SUM(total_lm_output_tokens) as total_output, SUM(total_lm_cost) as daily_cost FROM {df} WHERE CAST(end_ts AS TIMESTAMP) >= CURRENT_DATE - INTERVAL 7 DAYS GROUP BY date ORDER BY date """, df=metrics_df) token_usage.show()
This data informs rate limit tuning and cost optimization decisions.
Practical Rate Limit Strategies
Effective rate limit management requires balancing multiple constraints.
Tiered Rate Limit Configuration
Match rate limits to provider tier pricing:
pythonconfig = SemanticConfig( language_models={ "budget_tier": OpenAILanguageModel( model_name="gpt-4.1-nano", rpm=100, # Free tier limit tpm=40000 ), "production_tier": OpenAILanguageModel( model_name="gpt-4.1-nano", rpm=5000, # Enterprise tier limit tpm=2000000 ) } )
Route development work to budget tier and production traffic to the higher-capacity model.
Dynamic Model Selection
Select models based on rate limit availability:
pythonfrom fenic.core.types import ModelAlias def get_available_model(task_complexity: str) -> str: # Check current rate limit usage in metrics recent_usage = session.table("fenic_system.query_metrics") # Simple heuristic: use fast model if available if task_complexity == "simple": return "fast_model" else: return "complex_model" # Apply in pipeline df = df.with_column( "complexity", fc.when(fc.col("doc_length") < 1000, "simple") .otherwise("complex") )
Batch Size Optimization
When processing large datasets, group operations to maximize throughput under rate limits:
python# Process in chunks that respect rate limits chunk_size = 100 # Adjust based on rpm limits for chunk_df in df.iter_chunks(chunk_size): result = chunk_df.semantic.map( "Process: {{ text }}", text=fc.col("content") ).collect() # Process results process_results(result.data)
Error Handling and Recovery
Rate limit violations and token limit errors require explicit handling.
Provider Key Validation
Fenic validates API keys at session creation to fail fast:
pythonfrom fenic.api.session import Session try: session = Session.get_or_create(config) except Exception as e: print(f"Configuration error: {e}") # Handle missing or invalid keys
This eliminates runtime failures from misconfigured credentials.
Automatic Retry Logic
Async UDFs handle transient failures automatically:
python@fc.async_udf( return_type=StringType, num_retries=5, # Retry up to 5 times timeout_seconds=30, # 30-second timeout per attempt max_concurrency=10 ) async def robust_inference(text: str) -> str: # Fenic handles exponential backoff automatically return await call_llm(text)
Failed requests after all retries return null rather than failing the entire batch, maintaining pipeline resilience.
Rate Limit Backoff
Fenic implements exponential backoff internally when hitting rate limits. The framework automatically throttles requests and resumes when capacity becomes available.
Monitor this behavior through metrics:
pythonmetrics = result.metrics print(f"Retry count: {metrics.total_retry_count}") print(f"Throttled duration: {metrics.throttled_duration_ms}ms")
Best Practices
- Set explicit rate limits matching your provider tier. Don't rely on defaults.
- Configure token budgets based on expected output length. Start conservative and increase based on metrics.
- Use model profiles to separate fast inference from deep reasoning workloads.
- Monitor metrics continuously. Query the metrics table regularly to identify bottlenecks.
- Implement tiered models. Route simple tasks to high-throughput models and advanced tasks to more capable but rate-limited models.
- Control concurrency at both the UDF level and server level to prevent rate limit breaches.
- Fail fast with validation. Let Fenic validate provider keys at session creation rather than discovering errors at runtime.
Implementation Example
Putting these patterns together:
pythonfrom fenic.api.session import Session, SessionConfig, SemanticConfig from fenic.api.session.config import OpenAILanguageModel, AnthropicLanguageModel import fenic as fc # Configure session with rate limits config = SessionConfig( app_name="production_pipeline", semantic=SemanticConfig( language_models={ "fast": OpenAILanguageModel( model_name="gpt-4.1-nano", rpm=1000, tpm=500000 ), "reasoning": AnthropicLanguageModel( model_name="claude-opus-4-0", rpm=100, input_tpm=50000, output_tpm=10000, profiles={ "quick": AnthropicLanguageModel.Profile( thinking_token_budget=2048 ), "thorough": AnthropicLanguageModel.Profile( thinking_token_budget=8192 ) }, default_profile="quick" ) }, default_language_model="fast" ) ) session = Session.get_or_create(config) # Load data df = session.read.csv("documents.csv") # Simple tasks use fast model df = df.with_column( "summary", fc.semantic.map(fc.col("text"), "Summarize in one sentence") ) # Complex tasks use reasoning model df = df.with_column( "analysis", fc.semantic.map(fc.col("text"), "Analyze the argument structure") ) # Execute and get metrics result = df.collect("polars") # Review token usage print(f"Total cost: ${result.metrics.total_lm_metrics.cost}") print(f"Input tokens: {result.metrics.total_lm_metrics.num_uncached_input_tokens}") print(f"Output tokens: {result.metrics.total_lm_metrics.num_output_tokens}") # Query historical metrics metrics_df = session.table("fenic_system.query_metrics") cost_analysis = session.sql(""" SELECT SUM(total_lm_cost) as total_spend, AVG(execution_time_ms) as avg_latency_ms FROM {df} WHERE session_id = '{session_id}' """, df=metrics_df, session_id=session.id) cost_analysis.show()
Conclusion
Handling token limits and rate limits in large-scale LLM inference requires explicit configuration and continuous monitoring. Fenic provides the primitives needed to build reliable, cost-effective inference pipelines:
- Declarative rate limit configuration per model
- Fine-grained token budget control
- Automatic throttling and retry logic
- Comprehensive metrics tracking
- Concurrent execution with bounded parallelism
These mechanisms transform ad-hoc LLM scripts into production-grade data pipelines. Configure your constraints explicitly, monitor actual usage through metrics, and let the framework handle the complexity of reliable inference at scale.
For more information:

