Skip to main content

Cache Threshold Tuning

WorldFlow AI matches incoming queries against cached responses using vector similarity. The cache hit threshold controls how similar a query must be to a cached entry before the cached response is returned. This guide covers threshold configuration, simulation, the playground, request coalescing, and cache invalidation strategies.

Global Threshold Settings

Get Current Settings

GET /api/v1/thresholds

Requires the ViewMetrics permission.

curl -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/thresholds"

The response includes the global cache hit threshold, partial hit threshold, and TTL settings for both L1 (in-memory) and L2 (persistent) cache tiers.

Update Settings

PUT /api/v1/thresholds

Requires the ManageConfig permission. Only provided fields are updated; omitted fields retain their current values.

curl -X PUT -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
"https://gateway.example.com/api/v1/thresholds" \
-d '{
"cacheHitThreshold": 0.85,
"partialHitThreshold": 0.70,
"l1TtlSecs": 7200,
"l2TtlSecs": 86400
}'

Key fields

FieldTypeDescription
cacheHitThresholdfloat (0-1)Minimum cosine similarity for a full cache hit. Higher values are stricter (fewer hits, higher precision).
partialHitThresholdfloat (0-1)Minimum similarity for a partial hit (returned with a lower-confidence flag).
l1TtlSecsintegerTime-to-live for L1 (in-memory) cache entries, in seconds.
l2TtlSecsintegerTime-to-live for L2 (persistent) cache entries, in seconds.

Choosing a Threshold

ThresholdHit RatePrecisionBest For
0.90 - 0.95LowVery highRegulated industries, medical/legal content
0.80 - 0.89MediumHighGeneral production workloads (recommended starting point)
0.70 - 0.79HighMediumHigh-volume FAQ or support bots where approximate answers are acceptable
tip

Start at 0.85 and use the simulation endpoint to measure the impact before lowering.

Tenant Overrides

Override global thresholds for specific tenants. Overrides are matched by tenant ID and take precedence over global settings.

List Overrides

GET /api/v1/thresholds/tenants
ParameterTypeDescription
tenantIdstringFilter by tenant ID
pageintegerPage number
pageSizeintegerItems per page
curl -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/thresholds/tenants"

Create or Update an Override

PUT /api/v1/thresholds/tenants/{tenant_id}

Requires the ManageConfig permission. Creates the override if it does not exist, or updates it.

curl -X PUT -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
"https://gateway.example.com/api/v1/thresholds/tenants/enterprise-acme" \
-d '{
"cacheHitThreshold": 0.92,
"partialHitThreshold": 0.80,
"l1TtlSecs": 3600
}'

Delete an Override

DELETE /api/v1/thresholds/tenants/{tenant_id}

The tenant falls back to global settings after deletion.

curl -X DELETE -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/thresholds/tenants/enterprise-acme"

Simulation

Before changing thresholds in production, simulate the impact against historical data.

GET /api/v1/thresholds/simulate

Requires the ViewMetrics permission.

Query parameters

ParameterTypeDescription
cacheHitThresholdfloatProposed cache hit threshold
partialHitThresholdfloatProposed partial hit threshold
tenantIdstringScope simulation to a specific tenant
periodstringHistorical window: day, week, or month
curl -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/thresholds/simulate?cacheHitThreshold=0.80&period=week"

The response includes the projected hit rate, estimated cost savings delta, and a breakdown of how many historical queries would shift between hit, partial hit, and miss at the proposed threshold.

Example workflow:

  1. Get current settings: GET /api/v1/thresholds
  2. Simulate a lower threshold: GET /api/v1/thresholds/simulate?cacheHitThreshold=0.80&period=week
  3. Review projected hit rate and savings.
  4. If acceptable, apply: PUT /api/v1/thresholds with {"cacheHitThreshold": 0.80}
  5. Monitor /api/v1/analytics/savings for the next 24 hours.

Playground

The playground lets you test how specific queries interact with the cache before they reach production.

Preview Cache Behavior

POST /api/v1/playground/preview

Tests a query against the live cache without making an LLM call. Returns the similarity score, predicted cache tier, and latency estimate.

curl -X POST -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-H "X-Synapse-Preview-Cache: true" \
"https://gateway.example.com/api/v1/playground/preview" \
-d '{
"query": "What is the weather today?",
"model": "gpt-4o"
}'

Generate Query Variations

POST /api/v1/playground/variations

Generates semantically similar rephrasing of a query using an LLM, then predicts cache behavior for each variation. Variations are grouped by which cache entry they would share.

Request body

FieldTypeRequiredDescription
querystringYesOriginal query (max 10,000 characters)
modelstringNoModel for variation generation
thresholdfloat (0-1)NoSimilarity threshold for grouping
curl -X POST -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
"https://gateway.example.com/api/v1/playground/variations" \
-d '{
"query": "What is machine learning?",
"model": "gpt-4o",
"threshold": 0.85
}'
{
"originalQuery": "What is machine learning?",
"variations": [
{
"query": "Can you explain machine learning?",
"similarity": 0.94,
"prediction": {
"willHitCache": true,
"tier": "L1",
"confidence": 0.97
}
},
{
"query": "Define ML for me",
"similarity": 0.82,
"prediction": {
"willHitCache": false,
"tier": "MISS",
"confidence": 0.88
}
}
],
"cacheGroups": [
{
"representative": "What is machine learning?",
"members": [
"What is machine learning?",
"Can you explain machine learning?"
],
"similarity": 0.94
}
],
"processingTimeMs": 1250
}

This helps you understand which natural-language variations of a query will share a cache entry and which will miss.

Request Coalescing

When multiple identical (or near-identical) requests arrive concurrently, WorldFlow AI can coalesce them into a single upstream LLM call. The first request ("original") is forwarded; subsequent requests ("piggybacked") wait for the original's response.

Response headers indicate coalescing status:

  • X-Coalesced: original -- This request was the one forwarded to the LLM.
  • X-Coalesced: piggybacked -- This request reused the original's response.

Get Coalescer Statistics

GET /api/v1/coalescer/stats

Requires the ViewMetrics permission. Returns counters for total coalesced requests, pending groups, and hit rates.

curl -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/coalescer/stats"

Flush the Coalescer

POST /api/v1/coalescer/flush

Requires the ManageConfig permission. Clears all in-flight coalescing entries. Use this after deploying a configuration change that affects cache keys.

curl -X POST -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/coalescer/flush"

Savings Analytics

Track the cost impact of your cache configuration over time.

GET /api/v1/analytics/savings

Query parameters

ParameterTypeDescription
granularitystringTime bucket: hour, day, week, month, quarter, or year (default day)
startDatestring (date)Start of the range (ISO 8601)
endDatestring (date)End of the range (ISO 8601)
tenantIdstringScope to a specific tenant
modelstringScope to a specific model
curl -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/analytics/savings?granularity=day&startDate=2025-06-01&endDate=2025-06-15"
{
"data": [
{
"period": "2025-06-01T00:00:00Z",
"totalRequests": 15000,
"cacheHits": 10500,
"cacheMisses": 4500,
"hitRate": 0.70,
"tokensSaved": 3200000,
"estimatedSavings": 48.00,
"actualCost": 20.50
}
],
"summary": {
"totalRequests": 225000,
"totalCacheHits": 157500,
"overallHitRate": 0.70,
"totalTokensSaved": 48000000,
"totalEstimatedSavings": 720.00,
"totalActualCost": 307.50
},
"metadata": {
"granularity": "day",
"executionTimeMs": 45
}
}

Request Logs

Browse and search individual request logs to debug cache behavior.

List Logs

GET /api/v1/logs
ParameterTypeDescription
startDatestring (datetime)Filter from this timestamp
endDatestring (datetime)Filter until this timestamp
statusstringFilter by response status
tierstringFilter by cache tier (L1, L2, MISS, BYPASS)
modelstringFilter by model
searchstringFree-text search in query content
pageintegerPage number
pageSizeintegerItems per page
curl -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/logs?tier=MISS&model=gpt-4o&pageSize=5"

Get Log Detail

GET /api/v1/logs/{id}

Returns the full request and response bodies, similarity scores, and timing breakdown.

Find Similar Logs

GET /api/v1/logs/{id}/similar

Performs a vector similarity search to find other log entries whose queries are semantically close to the given entry.

ParameterTypeDescription
limitintegerMaximum results (default 10)
curl -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/logs/log-123.../similar?limit=5"

Cache Invalidation Strategies

TTL-Based Expiration

Set l1TtlSecs and l2TtlSecs in the threshold settings. Entries expire automatically after the configured duration. Shorter TTLs keep the cache fresh but reduce hit rates.

Cache Bypass Header

Any request can bypass the cache entirely by including the X-Synapse-Skip-Cache: true header. This is useful for:

  • Testing new model versions against live traffic.
  • Forcing a fresh response for a specific query.
  • Debugging unexpected cache behavior.

The bypassed response includes source: "bypass" in the synapse_metadata and an X-Synapse-Cache-Status: BYPASS response header.

Tenant-Level Threshold Overrides

Use tenant overrides to effectively reduce caching for specific customers. Setting a tenant's cacheHitThreshold to 1.0 disables semantic matching for that tenant (only exact duplicates hit).

Coalescer Flush

After a major configuration change (threshold adjustment, model swap, or guardrail update), flush the coalescer to ensure in-flight requests are not matched against stale groupings:

curl -X POST -H "Authorization: Bearer $API_KEY" \
"https://gateway.example.com/api/v1/coalescer/flush"

Tuning Workflow

  1. Baseline. Record current hit rate and savings from /api/v1/analytics/savings?granularity=day&period=7d.
  2. Simulate. Try a new threshold via /api/v1/thresholds/simulate?cacheHitThreshold=0.80&period=week.
  3. Playground. Test representative queries with /api/v1/playground/variations to verify grouping behavior.
  4. Apply. Update global settings or add a tenant override.
  5. Flush. Clear the coalescer with POST /api/v1/coalescer/flush.
  6. Monitor. Watch /api/v1/analytics/savings and /api/v1/logs for the next 24-48 hours. Roll back if precision drops below your quality bar.