Custom Providers & Chains
This guide covers registering self-hosted or third-party OpenAI-compatible endpoints as custom providers, and defining multi-model chains that execute sequential model calls through a single API request.
Registering Custom Providers
Any endpoint that implements the OpenAI chat completions API can be registered as a WorldFlow AI provider. This includes:
- vLLM -- self-hosted open-source model inference
- NVIDIA Triton -- production inference server
- TGI (Text Generation Inference) -- HuggingFace inference server
- Ollama -- local model serving
- Any OpenAI-compatible proxy
Step 1: Create a Custom Integration
Register the endpoint as an integration with provider type CUSTOM:
curl -X POST https://api.worldflowai.com/api/v1/model-catalog/integrations \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Internal vLLM Cluster",
"providerType": "CUSTOM",
"apiKey": "internal-api-key",
"baseUrl": "https://vllm.internal.example.com/v1",
"isActive": true
}'
For endpoints that do not require authentication, provide a placeholder API key.
Step 2: Create the Provider
curl -X POST https://api.worldflowai.com/api/v1/model-catalog/providers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "vLLM Internal",
"integrationId": "<integration-id>",
"providerType": "CUSTOM",
"isEnabled": true,
"priority": 1
}'
Step 3: Define Models
Register each model served by the endpoint:
curl -X POST https://api.worldflowai.com/api/v1/model-catalog/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"modelId": "llama-3.1-70b",
"providerId": "<provider-id>",
"displayName": "LLaMA 3.1 70B (Internal)",
"inputCostPer1kTokens": 0.0,
"outputCostPer1kTokens": 0.0,
"maxContextLength": 131072,
"capabilities": ["chat", "tools"],
"qualityScore": 0.85,
"isActive": true
}'
For self-hosted models, set token costs to 0 (or your internal chargeback rate). The optimizer will strongly prefer these models when cost-optimized routing is active.
Step 4: Verify
curl -X POST https://api.worldflowai.com/api/v1/model-catalog/integrations/{id}/verify \
-H "Authorization: Bearer $TOKEN"
Dynamic Registration
All catalog operations take effect immediately. You do not need to restart WorldFlow AI. Register, update, or remove providers at any time through the API.
# Update the base URL of an existing integration (e.g., after a migration)
curl -X PUT https://api.worldflowai.com/api/v1/model-catalog/integrations/{id} \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"baseUrl": "https://vllm-v2.internal.example.com/v1"
}'
Example: Registering a Triton Endpoint
# Integration
curl -X POST https://api.worldflowai.com/api/v1/model-catalog/integrations \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Triton Vision Models",
"providerType": "CUSTOM",
"apiKey": "triton-internal",
"baseUrl": "https://triton.internal.example.com/v2/models",
"isActive": true
}'
# Provider
curl -X POST https://api.worldflowai.com/api/v1/model-catalog/providers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Triton Vision",
"integrationId": "<triton-integration-id>",
"providerType": "CUSTOM",
"isEnabled": true,
"priority": 1
}'
# Model
curl -X POST https://api.worldflowai.com/api/v1/model-catalog/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"modelId": "pixtral-12b",
"providerId": "<triton-provider-id>",
"displayName": "Pixtral 12B (Vision)",
"inputCostPer1kTokens": 0.0,
"outputCostPer1kTokens": 0.0,
"maxContextLength": 32768,
"capabilities": ["chat", "vision"],
"qualityScore": 0.75,
"isActive": true
}'
Multi-Model Chains
Chains let you define a sequence of model calls where each step's output feeds into the next. This is useful for pipelines like:
- Vision model to extract text from an image, then a language model to analyze it
- A fast model to draft, then a premium model to refine
- Classification, then specialized processing based on the class
Defining a Chain
Register a chain as a model with a chain: prefix in the model ID. The chain definition is stored in the model's metadata.
Currently, chains are executed through the routing system using fixed:chain:<chain_id> routing. Define the chain steps in your application and route the initial request through WorldFlow AI.
Chain-Aware Routing
Use the X-WorldFlow-Routing header with fixed:chain:<chain_id> to invoke a chain:
curl https://api.worldflowai.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-H "X-WorldFlow-Routing: fixed:chain:vision-analysis" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Analyze this image"}]
}'
When the router sees chain:<id>, it delegates to the chain execution engine, which runs each step sequentially and returns the final result.
End-to-End Example: 3-Step Vision Pipeline
This example shows a pipeline that processes an image through three models:
- Vision model (Pixtral 12B) -- extracts text and objects from the image
- Language model (LLaMA 3.1 70B) -- analyzes and structures the extracted content
- Summarization model (GPT-4o-mini) -- produces a concise summary
Register the Models
First, ensure all three models are registered in the catalog (see steps above for custom providers).
Execute the Pipeline
Using the OpenAI SDK with per-step routing:
from openai import OpenAI
client = OpenAI(
base_url="https://api.worldflowai.com/v1",
api_key="your-worldflow-api-key",
)
# Step 1: Vision extraction
vision_response = client.chat.completions.create(
model="pixtral-12b",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text and describe objects in this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
],
}
],
extra_headers={"X-WorldFlow-Routing": "fixed:pixtral-12b"},
)
extracted = vision_response.choices[0].message.content
# Step 2: Analysis
analysis_response = client.chat.completions.create(
model="llama-3.1-70b",
messages=[
{"role": "system", "content": "You are an expert analyst. Structure the following extracted content into categories."},
{"role": "user", "content": extracted},
],
extra_headers={"X-WorldFlow-Routing": "fixed:llama-3.1-70b"},
)
analysis = analysis_response.choices[0].message.content
# Step 3: Summarization
summary_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Produce a 2-sentence executive summary."},
{"role": "user", "content": analysis},
],
extra_headers={"X-WorldFlow-Routing": "fixed:gpt-4o-mini"},
)
summary = summary_response.choices[0].message.content
print(summary)
Each step is independently routed and individually cached by WorldFlow AI. If the same image is processed again, the vision extraction step may be served from cache.
Monitoring Chain Costs
Since each step is a separate request, cost headers are returned per-step. Aggregate costs using the analytics API:
curl "https://api.worldflowai.com/api/v1/routing/usage?limit=10" \
-H "Authorization: Bearer $TOKEN"
Mixing Custom and Cloud Providers
Custom providers participate in the same routing logic as cloud providers. The optimizer scores all enabled models equally. To control the mix:
- Set quality scores accurately -- self-hosted models with lower quality scores will only receive simple queries in
cost_optimizedmode - Use provider allowlists -- restrict routing to specific providers for certain query types via routing rules
- Set cost to zero -- self-hosted models with zero cost dominate in cost-optimized routing
- Pin critical queries -- use
fixed:<model>routing for queries that must go to a specific model