Skip to main content

Cost-Optimized Routing

WorldFlow AI's cost optimizer automatically classifies incoming queries and routes them to the cheapest LLM provider that meets your quality requirements. Simple queries go to economy models; complex queries go to premium models. You get lower costs without sacrificing quality where it matters.

How It Works

Every request flows through a three-step pipeline:

  1. Classify -- WorldFlow AI analyzes the query's task type (generation, classification, extraction, summarization, conversation, code generation) and complexity tier (simple, moderate, complex, frontier).
  2. Score -- Each available provider is scored on five dimensions: cost, quality, latency, health, and cache affinity. The scoring weights depend on your routing strategy.
  3. Select -- The highest-scoring provider handles the request.
info

The entire pipeline runs in-process with zero I/O. Classification uses keyword heuristics on the system prompt and last user message. No external calls, no added latency.

Routing Strategies

Configure your workspace's default strategy via the Routing API.

StrategyCost WeightQuality WeightBest For
cost_optimized0.500.20High-volume workloads where cost is the primary concern
quality_first0.100.50Customer-facing applications where quality is paramount
balanced0.300.30General-purpose workloads needing a mix of cost and quality

Setting Your Strategy

curl -X PUT https://api.worldflowai.com/api/v1/routing/config \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"strategy": "cost_optimized",
"rules": []
}'

Strategy Scoring Weights

Each strategy applies different weights to the composite scoring formula:

composite = w_cost * cost + w_quality * quality + w_latency * latency
+ w_health * health + w_cache * cache_affinity
StrategyCostQualityLatencyHealthCache Affinity
cost_optimized0.500.200.150.100.05
quality_first0.100.500.150.150.10
balanced0.300.300.200.100.10

You can also provide custom weights:

curl -X PUT https://api.worldflowai.com/api/v1/routing/config \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"strategy": "balanced",
"rules": [],
"weights": {
"cost": 0.40,
"quality": 0.25,
"latency": 0.20,
"health": 0.10,
"cacheAffinity": 0.05
}
}'
warning

Weights must sum to 1.0.

Per-Request Overrides

Override the workspace strategy on any individual request using the X-WorldFlow-Routing header.

Header ValueBehavior
autoUse the workspace's configured strategy
cheapestPick the cheapest available model (quality threshold = 0)
fastestPick the lowest-latency model
fixed:<model_id>Pin to a specific model, bypassing the optimizer
fixed:chain:<chain_id>Execute a multi-model chain

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
base_url="https://api.worldflowai.com/v1",
api_key="your-worldflow-api-key",
)

# Let the optimizer choose the cheapest capable model
response = client.chat.completions.create(
model="auto", # model field is ignored when routing is active
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_headers={"X-WorldFlow-Routing": "cheapest"},
)

# Pin to a specific model for this request
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a recursive Fibonacci in Rust"}],
extra_headers={"X-WorldFlow-Routing": "fixed:gpt-4o"},
)

TypeScript

import OpenAI from "openai";

const client = new OpenAI({
baseURL: "https://api.worldflowai.com/v1",
apiKey: "your-worldflow-api-key",
});

const response = await client.chat.completions.create(
{
model: "auto",
messages: [{ role: "user", content: "Classify this email as spam or not" }],
},
{
headers: { "X-WorldFlow-Routing": "auto" },
}
);

curl

curl https://api.worldflowai.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-H "X-WorldFlow-Routing: cheapest" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Summarize this article"}]
}'

Reading Cost Headers

Every response includes cost optimizer headers so you can track exactly what happened:

HeaderExample ValueDescription
x-worldflow-provideropenaiWhich provider handled the request
x-worldflow-modelgpt-4o-miniWhich model was used
x-worldflow-cost0.000450Actual cost in USD for this request
x-worldflow-cost-saved0.008550Savings vs. the most expensive alternative
x-worldflow-routing-reasonauto_cost_optimizedWhy this model was selected

Reading Headers in Python

response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"X-WorldFlow-Routing": "auto"},
)

# Access headers from the raw response
raw = response._response
print(f"Provider: {raw.headers.get('x-worldflow-provider')}")
print(f"Model: {raw.headers.get('x-worldflow-model')}")
print(f"Cost: ${raw.headers.get('x-worldflow-cost')}")
print(f"Saved: ${raw.headers.get('x-worldflow-cost-saved')}")
print(f"Reason: {raw.headers.get('x-worldflow-routing-reason')}")

Routing Reasons

ReasonMeaning
auto_cost_optimizedOptimizer selected cheapest model meeting quality threshold
cheapest_availableCheapest mode selected the absolute cheapest model
fastest_availableFastest mode selected the lowest-latency model
fixed_modelRequest pinned to a specific model
chain_routingRequest delegated to a multi-model chain
fallbackNo model met the quality threshold; fallback model used
cheapest_capableRouting engine selected cheapest capable model
quality_preferredQuality-first strategy preferred a higher-quality model
budget_constrainedBudget pressure forced a cheaper model
cache_affinityA model with warm cache was preferred
policy_overrideA static routing rule or force-provider hint matched

Monitoring Savings

The analytics API provides aggregate savings data broken down by time period, complexity tier, and provider.

Get Savings Summary

# Monthly savings (default)
curl https://api.worldflowai.com/api/v1/routing/analytics/savings \
-H "Authorization: Bearer $TOKEN"

# Weekly savings
curl "https://api.worldflowai.com/api/v1/routing/analytics/savings?period=week" \
-H "Authorization: Bearer $TOKEN"

Response:

{
"period": "month",
"totalRequests": 45000,
"totalActualCostCents": 12500,
"totalCounterfactualCostCents": 34000,
"totalSavingsCents": 21500,
"savingsPercent": 63.2,
"byComplexity": [
{
"complexity": "simple",
"requestCount": 30000,
"actualCostCents": 4500,
"counterfactualCostCents": 22000,
"savingsCents": 17500
}
],
"byProvider": [
{
"providerId": "...",
"providerName": "openai",
"requestCount": 25000,
"actualCostCents": 7500
}
]
}

List Routing Decisions

Review individual routing decisions for debugging:

curl "https://api.worldflowai.com/api/v1/routing/decisions?limit=10&complexity=simple" \
-H "Authorization: Bearer $TOKEN"

Simulate a Routing Decision

Test how the optimizer would route a query without sending it to a provider:

curl -X POST https://api.worldflowai.com/api/v1/routing/simulate \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "Extract all email addresses from this document",
"strategy": "auto"
}'

Response:

{
"modelId": "gemini-1.5-flash",
"reason": "auto_cost_optimized",
"estimatedCostCents": 0,
"complexity": "extraction",
"domain": "general",
"alternatives": [
{ "modelId": "gpt-4o-mini", "score": 0.70 },
{ "modelId": "claude-3-haiku-20240307", "score": 0.68 }
]
}

Quality Feedback Loop

The optimizer improves over time through quality feedback. You can submit feedback explicitly or let WorldFlow AI collect implicit signals from response metadata.

Explicit Feedback

curl -X POST https://api.worldflowai.com/api/v1/routing/quality/feedback \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"routingDecisionId": "550e8400-e29b-41d4-a716-446655440000",
"score": 0.85,
"confidence": 0.9,
"source": "user"
}'

Implicit Signals

After receiving an LLM response, submit response metadata for automatic quality scoring:

curl -X POST https://api.worldflowai.com/api/v1/routing/quality/signals \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"routingDecisionId": "550e8400-e29b-41d4-a716-446655440000",
"latencyMs": 450,
"inputTokens": 100,
"outputTokens": 200,
"completed": true,
"truncated": false,
"error": false
}'

WorldFlow AI computes an implicit quality score from these signals:

  • Completion: +0.4
  • No error: +0.3
  • Latency under 2s: +0.15
  • Reasonable output length (10-2000 tokens): +0.15
  • Truncation penalty: -0.2

View Quality Profiles

Quality profiles track per-model, per-domain quality using Bayesian estimation:

curl "https://api.worldflowai.com/api/v1/routing/quality/profiles?limit=10" \
-H "Authorization: Bearer $TOKEN"

Deterministic Routing Rules

For predictable routing, define rules that override the optimizer for specific complexity/domain combinations:

curl -X PUT https://api.worldflowai.com/api/v1/routing/config \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"strategy": "cost_optimized",
"rules": [
{
"complexity": "simple",
"routeTo": "gpt-4o-mini",
"priority": 1
},
{
"domain": "code_generation",
"routeTo": "claude-3-5-sonnet-20241022",
"priority": 2
}
]
}'

Rules are evaluated by priority (higher first). The first matching rule wins. If no rule matches, the optimizer's scoring pipeline runs.

Extended Routing Rules

Beyond complexity and domain matching, rules support pluggable conditions for fine-grained control over routing decisions.

Pluggable Conditions

Each rule can include one or more conditions:

ConditionTypeDescription
headerMatchobjectMatch on HTTP request headers. Keys are header names (case-insensitive), values are exact matches. Example: {"x-user-tier": "premium"}
modelPatternstringGlob pattern matched against the requested model name. Examples: "gpt-4*", "claude-*", "gemini-1.5-*"
contentMatchstring[]Keywords searched in the query text (case-insensitive). If any keyword is found, the condition matches. Example: ["urgent", "critical"]
metadataMatchobjectMatch on tenant metadata supplied via the X-Tenant-Metadata header (JSON-encoded). Keys are metadata field names, values are exact matches.
timeWindowobjectRestrict the rule to specific times. Fields: startHour (0-23), endHour (0-23), daysOfWeek (array of "mon" through "sun").

Condition Operators

When a rule has multiple conditions, the conditionOperator field controls how they combine:

OperatorBehavior
"and"All conditions must match (default)
"or"Any single condition matching is sufficient

A/B Testing

Rules support weighted traffic splitting for A/B experiments. Add weight, experimentId, and variant fields to create experiment groups:

curl -X PUT https://api.worldflowai.com/api/v1/routing/config \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"strategy": "balanced",
"rules": [
{
"experimentId": "exp-2026-q1-model-eval",
"variant": "control",
"routeTo": "gpt-4o",
"weight": 80,
"priority": 10
},
{
"experimentId": "exp-2026-q1-model-eval",
"variant": "treatment",
"routeTo": "claude-3-5-sonnet-20241022",
"weight": 20,
"priority": 10
}
]
}'

Requests are assigned to a variant using consistent hashing on the tenant ID and experiment ID, so a given tenant always sees the same variant. Weights are relative within the same experimentId and priority level.

Comprehensive Example

This rule matches premium-tier users requesting a GPT-4 model with urgent content, routes them to gpt-4o, and enrolls them in an experiment with 80% weight:

curl -X PUT https://api.worldflowai.com/api/v1/routing/config \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"strategy": "cost_optimized",
"rules": [
{
"headerMatch": {"x-user-tier": "premium"},
"modelPattern": "gpt-4*",
"contentMatch": ["urgent", "critical"],
"conditionOperator": "and",
"routeTo": "gpt-4o",
"experimentId": "exp-premium-routing",
"variant": "fast-track",
"weight": 80,
"priority": 5
},
{
"headerMatch": {"x-user-tier": "premium"},
"modelPattern": "gpt-4*",
"contentMatch": ["urgent", "critical"],
"conditionOperator": "and",
"routeTo": "gpt-4o-mini",
"experimentId": "exp-premium-routing",
"variant": "economy",
"weight": 20,
"priority": 5
}
]
}'

Extended Routing Reasons

These routing reasons appear in the x-worldflow-routing-reason header when extended rules match:

ReasonMeaning
experiment_splitA/B traffic split rule selected the variant
header_matchA header-matching rule determined the route
content_matchContent keyword matching determined the route

Budget-Aware Routing

When budget limits are configured, the optimizer automatically adjusts:

Budget ZoneRemainingBehavior
Healthy>50%Route per strategy
Warning20-50%Bias toward cheaper models (cost weight multiplier increases)
Critical<20%Force economy tier unless query is Frontier complexity
Exceeded0%Block, warn, or auto-failover to cheapest (configurable)
caution

When a provider's budget enters the Critical zone, only Frontier-complexity queries will be routed to premium models. All other traffic is forced to economy tier to preserve budget for high-priority requests.