Skip to main content

Core Concepts

Three-Tier Cache Architecture

WorldFlow AI uses a three-tier cache hierarchy to serve queries at the lowest possible latency:

Query ──embed──> Vector ──search L0/L1/L2──> Match found? ──yes──> Return cached response
──no──> Forward to LLM, cache response
TierBackendLatencyDescription
L0 (GPU --- CAGRA)NVIDIA CAGRA on-device indexSub-1msGPU-resident graph-based nearest neighbor search. Checked first for KV-cache acceleration workloads.
L1 (Redis)In-memory HNSW indexSub-5msCPU-side in-memory vector index. First tier checked for semantic cache lookups.
L2 (Milvus)Persistent vector database10-50msDurable vector store. Checked on L1 miss. Handles cold-start and long-tail queries.
info

L0 is used exclusively by the KV-cache inference acceleration path (SemBlend). The semantic response cache uses L1 and L2. All three tiers share the same embedding model for consistency.

KV-Cache Inference Acceleration (SemBlend)

SemBlend is WorldFlow AI's KV-cache reuse engine. When a new prompt is semantically similar to a previously processed prompt, SemBlend injects the donor's precomputed key-value cache into the GPU, skipping redundant prefill computation entirely.

New prompt ──embed──> L0 search ──donor found──> Inject KV cache + RoPE correction ──> Skip prefill
──no donor──> Normal prefill (cold path)

Key properties:

  • TTFT speedup: Up to 12x at 32K tokens, 8x at 16K, 4x at 8K context length
  • Quality preservation: Perplexity ratio within 1.0-1.07 of cold baseline across datasets
  • RoPE correction: Positional encodings are corrected when donor and recipient token boundaries differ, ensuring mathematical equivalence
  • Transparent: Works with any vLLM-served model. No model fine-tuning required.
tip

SemBlend is most effective for workloads with repeated long-context patterns: RAG pipelines, multi-turn conversations, document summarization, and customer support. For short prompts (<2K tokens), the semantic response cache (L1/L2) is more efficient.

L0 GPU Cache Configuration

The L0 GPU cache uses NVIDIA's cuVS CAGRA algorithm running directly in GPU HBM for sub-2ms ANN search. It is optional and disabled by default.

VariableDescriptionDefault
SYNAPSE_L0_CACHE_ENABLEDEnable L0 GPU cachefalse
SYNAPSE_L0_CACHE_DEVICE_IDCUDA device index0
SYNAPSE_L0_CACHE_MAX_MEMORY_MBHBM budget for L0 cache2500
SYNAPSE_L0_CACHE_EMBEDDING_DIMEmbedding dimension1024
SYNAPSE_L0_CACHE_MAX_ENTRIESMaximum cached vectors524288
SYNAPSE_L0_CACHE_SIMILARITY_THRESHOLDCosine similarity threshold0.85
SYNAPSE_L0_CACHE_BATCH_SIZEQueries per CAGRA batch16
SYNAPSE_L0_CACHE_BATCH_TIMEOUT_USMax wait for batch fill500

Key design features:

  • Index pool pattern: Multiple pre-built indexes for concurrent search without blocking on rebuilds
  • L1 auto-promotion: L1 cache hits are automatically promoted into L0, keeping the GPU cache warm
  • Adaptive memory management: 3-tier pressure system (Normal/Pressure/Critical) yields GPU memory to inference workloads when needed

Semantic Cache

WorldFlow AI embeds every LLM query into a vector and searches for semantically similar past queries. If a match exceeds the similarity threshold (default 0.85), the cached response is returned instantly instead of calling the LLM provider.

The proxy is a drop-in replacement for OpenAI and Anthropic APIs. Point your SDK at WorldFlow AI's base URL, and caching happens transparently.

Multi-Turn Context Caching

WorldFlow AI implements the ContextCache paper's two-stage retrieval architecture for intelligent multi-turn conversation caching. This allows cache hits for conversations that are semantically similar, not just identical.

How it works:

  1. Turn Embeddings: Each message in the conversation is embedded independently
  2. Context Fusion: Turn embeddings are fused via multi-head self-attention into a single context embedding
  3. Stage 1 (Coarse): HNSW search finds candidates by last query embedding similarity (≥0.85)
  4. Stage 2 (Fine): Candidates are scored using a weighted combination:
    final_score = 0.3 × query_similarity + 0.7 × context_similarity
    Cache hit requires final_score ≥ 0.92

For optimal cache reuse, clients should provide a consistent session_id in requests:

curl https://api.worldflowai.com/v1/chat/completions \
-H "Authorization: Bearer $SYNAPSE_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"session_id": "user-123-session-abc",
"messages": [
{"role": "user", "content": "What is Python?"},
{"role": "assistant", "content": "Python is a programming language."},
{"role": "user", "content": "Tell me more"}
]
}'
info

Without session_id, a new UUID is generated per request, which prevents session-level turn embedding reuse. Always pass a consistent session_id for multi-turn conversations.

Context cache responses include a synapse_metadata object:

{
"choices": [...],
"synapse_metadata": {
"cache_hit": true,
"cache_tier": "l2_context",
"similarity": 1.0,
"context_similarity": 0.95,
"session_id": "user-123-session-abc"
}
}

Configuration:

VariableDescriptionDefault
SYNAPSE_CONTEXT_CACHE__ENABLEDEnable context-aware cachingtrue
SYNAPSE_CONTEXT_CACHE__STAGE1_THRESHOLDStage 1 query similarity threshold0.85
SYNAPSE_CONTEXT_CACHE__CONTEXT_HIT_THRESHOLDFinal score threshold for cache hit0.92
SYNAPSE_CONTEXT_CACHE__QUERY_WEIGHTWeight for query similarity0.3
SYNAPSE_CONTEXT_CACHE__CONTEXT_WEIGHTWeight for context similarity0.7
SYNAPSE_CONTEXT_CACHE__MAX_TURNSMaximum turns to consider20

Memory Layer (GCC Model)

The memory system is based on the GCC (Git-Context-Controller) architecture. It uses git-like primitives to give agents persistent knowledge:

GCC OperationHTTP EndpointAnalogy
COMMIT (Store)POST /projects/{id}/storegit commit --- save a milestone
CONTEXT (Recall)GET /projects/{id}/recallgit log --- retrieve history
BRANCHPOST /projects/{id}/branchesgit branch --- parallel workstreams
MERGEPOST /projects/{id}/mergegit merge --- combine branches

Projects

A project is the top-level container. It has a name, a living roadmap document, and contains branches with milestones.

Project
├── roadmap (living document)
├── main (default branch)
│ ├── milestone-1
│ ├── milestone-2
│ └── milestone-3
└── feature-x (parallel branch)
├── milestone-1
└── milestone-2

Milestones

A milestone is a snapshot of progress on a branch. Every milestone has three fields from the GCC paper:

FieldDescriptionExample
branchPurposeWhy this branch exists"Implement user authentication"
cumulativeProgressWhat's been done so far"Created login form, added JWT validation, wrote tests"
thisContributionWhat this specific milestone adds"Added password reset flow with email verification"

Milestones are immutable and append-only. Each has a monotonic sequence number and a content hash for deduplication.

Branches

Branches represent parallel workstreams within a project. They start as forks of a parent branch (usually main) and can be merged back or abandoned.

Branch states:

  • active --- work in progress
  • merged --- milestones synthesized into target branch
  • abandoned --- dead end, preserved for history

Recall Views

The recall endpoint supports five granularity levels:

ViewReturnsUse Case
overviewProject summary + branch listSession start, agent orientation
branchMilestones on a branch (paginated)Deep dive into a workstream
milestoneSingle milestone detailInspect a specific checkpoint
logOTA reasoning trace entriesDebug agent behavior
metadataMetadata segmentCustom metadata retrieval

OTA Log

The log endpoint captures continuous reasoning traces (Observe-Think-Act). Unlike milestones which are curated snapshots, log entries are high-frequency and capture the agent's real-time thought process.

Log phases:

  • observation --- what the agent noticed
  • thought --- reasoning about the observation
  • action --- what the agent decided to do

Contributors

Contributors map human identities to agent IDs. A single developer might use Claude Code, Cursor, and a custom agent --- the contributor model links all their agent IDs to one persona.

Contributor: "Alice"
├── agent_id: "claude-code-alice-macbook"
├── agent_id: "cursor-alice-work"
└── agent_id: "custom-agent-alice"

This enables per-person activity views across all projects and agent types.

External Sources

External sources ingest context from tools your team already uses:

Source TypeWhat It Ingests
slackChannel messages, thread discussions
jiraTicket descriptions, comments, status changes
confluencePage content, updates
githubPR descriptions, review comments

Ingested content becomes searchable alongside agent milestones, providing a unified view of project knowledge.

Intelligence Layer

The intelligence layer sits on top of the memory graph. It answers natural language questions by searching across projects, milestones, contributors, and external sources.

POST /api/v1/memory/intelligence/query
{
"question": "What did Alice work on last week?",
"timeRange": "7d"
}

Response includes a synthesized answer with source citations pointing back to specific milestones.

The action endpoint can execute follow-up actions based on intelligence results (e.g., create a JIRA ticket, post to Slack).

Promote (Cache to Memory)

When a cached LLM response proves valuable (high reuse score), it can be promoted from the ephemeral cache to long-term memory. This bridges the semantic cache and the memory layer.

Cache entry (high reuse) ──promote──> Long-term knowledge entry