Skip to main content

OpenAI-Compatible Proxy API

WorldFlow AI provides a drop-in replacement for the OpenAI API. Point your existing OpenAI SDK at the WorldFlow AI base URL and get transparent semantic caching with zero code changes.

Chat Completions

POST /v1/chat/completions

Fully compatible with the OpenAI Chat Completions API. WorldFlow AI checks the semantic cache before forwarding to OpenAI.

Request

Same schema as OpenAI. Key fields:

FieldTypeRequiredDescription
modelstringyesModel name (e.g., "gpt-4o", "gpt-4o-mini")
messagesarrayyesConversation messages
streambooleannoEnable streaming (default: false)
temperaturenumbernoSampling temperature
max_tokensintegernoMaximum response tokens
toolsarraynoTool/function definitions
tool_choicestring/objectnoTool selection strategy

Example

curl -X POST https://api.worldflowai.com/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain semantic caching in 2 sentences."}
],
"temperature": 0.7,
"max_tokens": 200
}'

Response

Same schema as OpenAI, with additional headers:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Semantic caching stores LLM responses indexed by the meaning of queries rather than exact text matches. When a new query is semantically similar to a cached one, the stored response is returned instantly, saving both cost and latency."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 42,
"total_tokens": 67
}
}

Response Headers

HeaderValuesDescription
X-Cache-StatusHIT, MISSWhether the response was served from cache
X-Request-IDUUIDRequest identifier for debugging

Streaming

Set "stream": true to receive Server-Sent Events (SSE). WorldFlow AI handles streaming for both cache hits and misses:

  • Cache miss: Streams from OpenAI while caching the full response
  • Cache hit: Reconstructs the SSE stream from the cached response
curl -X POST https://api.worldflowai.com/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'

Tool Calls

Tool calls (function calling) are supported. The request and response formats match OpenAI exactly.

Python SDK

from openai import OpenAI

client = OpenAI(
base_url="https://api.worldflowai.com/v1",
api_key="YOUR_SYNAPSE_JWT_TOKEN",
)

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is a REST API?"}],
)
print(response.choices[0].message.content)

TypeScript SDK

import OpenAI from "openai";

const client = new OpenAI({
baseURL: "https://api.worldflowai.com/v1",
apiKey: "YOUR_SYNAPSE_JWT_TOKEN",
});

const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "What is a REST API?" }],
});
console.log(response.choices[0].message.content);

List Models

GET /v1/models

Returns the list of available models.

curl https://api.worldflowai.com/v1/models \
-H "Authorization: Bearer $TOKEN"

Response

{
"object": "list",
"data": [
{
"id": "gpt-4o",
"object": "model",
"created": 1700000000,
"owned_by": "openai"
}
]
}

Cache Behavior

What Gets Cached

The cache key is derived from the semantic embedding of the user's messages. Identical or semantically similar queries produce cache hits.

What Bypasses Cache

  • Requests with "stream": true and unique system prompts may have lower hit rates
  • Tool call responses are cached based on the query, not the tool output
  • Set X-Synapse-Skip-Cache: true header to bypass caching for a specific request

Cache-Control Headers

HeaderValuesDescription
X-Synapse-Skip-CachetrueBypass cache for this request
X-Synapse-Workspace-ContextstringWorkspace context for cache scoping
X-Synapse-Code-ContextJSONDetailed code context for cache invalidation

Multi-Turn Context Caching

WorldFlow AI implements the ContextCache paper's two-stage retrieval architecture for intelligent multi-turn conversation caching. This allows cache hits for conversations that are semantically similar (not just identical).

How It Works

  1. Turn Embeddings: Each message in the conversation is embedded independently
  2. Context Fusion: Turn embeddings are fused via multi-head self-attention into a single context embedding
  3. Stage 1 (Coarse): HNSW search finds candidates by last query embedding similarity (>=0.85)
  4. Stage 2 (Fine): Candidates are scored using weighted combination:
    final_score = 0.3 * query_similarity + 0.7 * context_similarity
    Cache hit requires final_score >= 0.92

Enabling Context Caching

Context caching is enabled by default. For optimal cache reuse, clients should provide a consistent session_id:

curl -X POST https://api.worldflowai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"model": "gpt-4o-mini",
"session_id": "user-123-session-abc",
"messages": [
{"role": "user", "content": "What is Python?"},
{"role": "assistant", "content": "Python is a programming language."},
{"role": "user", "content": "Tell me more"}
]
}'

Important: Without session_id, a new UUID is generated per request, which prevents session-level turn embedding reuse.

Context Cache Response

The response includes a synapse_metadata object with cache status details:

{
"choices": [...],
"synapse_metadata": {
"cache_hit": true,
"cache_tier": "l2_context",
"similarity": 1.0,
"context_similarity": 0.95,
"session_id": "user-123-session-abc"
}
}
FieldTypeDescription
cache_hitbooleanWhether the response was served from cache
cache_tierstringCache tier that produced the hit (e.g., "l2_context")
similaritynumberQuery similarity score
context_similaritynumberContext similarity score from Stage 2
session_idstringSession identifier used for this request

Configuration

VariableDescriptionDefault
SYNAPSE_CONTEXT_CACHE__ENABLEDEnable context-aware cachingtrue
SYNAPSE_CONTEXT_CACHE__STAGE1_THRESHOLDStage 1 query similarity threshold0.85
SYNAPSE_CONTEXT_CACHE__CONTEXT_HIT_THRESHOLDFinal score threshold for cache hit0.92
SYNAPSE_CONTEXT_CACHE__QUERY_WEIGHTWeight for query similarity0.3
SYNAPSE_CONTEXT_CACHE__CONTEXT_WEIGHTWeight for context similarity0.7
SYNAPSE_CONTEXT_CACHE__MAX_TURNSMaximum turns to consider20