Multi-Model Routing

A/B test LLMs on real user traffic

Route users to different models with one CEL rule, hash sticky per user for clean A/B math, and measure what ships in production instead of offline evals.

Start routing See the docs

Experiment

Config

variant_a openai/gpt-5.4variant_b anthropic/claude-sonnet-4-6

50/50 · sticky by user

Result

action flip weight to 100% B

Day 7 · B wins CSAT 4.6 vs 4.1

Router

Real traffic, real math, real winner.

Sticky per-user routing, CEL-defined splits, and per-variant metrics you can correlate to CSAT and conversion. Promote the winner by flipping a weight.

Start a test in a minute

Works with

Router

No new DSL to learn, no ticket to file.

Write one rule in the portal and the split goes live. Route by user hash, tier, or any metadata, and your app code doesn't move.

Define the split in one rule

CEL

// Route A (50%) vs Route B (50%), sticky per user
if user_hash(user_id) % 2 == 0:
  model = 'openai/gpt-5.4'
else:
  model = 'anthropic/claude-sonnet-4-6'
// same user, same variant, every request

Start a test in a minute

Works with

Router

No new DSL to learn, no ticket to file.

Write one rule in the portal and the split goes live. Route by user hash, tier, or any metadata, and your app code doesn't move.

Define the split in one rule

CEL

// Route A (50%) vs Route B (50%), sticky per user
if user_hash(user_id) % 2 == 0:
  model = 'openai/gpt-5.4'
else:
  model = 'anthropic/claude-sonnet-4-6'
// same user, same variant, every request

Sticky per user

Works with

Router

Same user, same variant, every request.

Router hashes the user ID so variant assignment stays stable for the experiment. Real A/B math, not random sampling noise.

Sticky per user

user_a421

A · gpt-5.4

user_b918

B · claude-sonnet-4-6

user_c204

A · gpt-5.4

user_a421

A · gpt-5.4

same as before

Per-request coin flips break your A/B math. Sticky routing makes it real.

Sticky per user

user_a421

A · gpt-5.4

user_b918

B · claude-sonnet-4-6

user_c204

A · gpt-5.4

user_a421

A · gpt-5.4

same as before

Per-request coin flips break your A/B math. Sticky routing makes it real.

Sticky per user

Works with

Router

Same user, same variant, every request.

Router hashes the user ID so variant assignment stays stable for the experiment. Real A/B math, not random sampling noise.

Variants side by side

Works with

Router

Per-variant latency, cost, and quality in one view.

Watch each variant's latency, cost, and error rate in one view, with 7-day trends. Decide on your workload, not a benchmark leaderboard.

Variants side by side

last 7 days

A · gpt-5.4

B · claude-sonnet-4-6

latency p50

820ms

1120ms

cost per call

$0.004

$0.012

csat (1-5)

4.1

4.6

Variants side by side

Works with

Router

Per-variant latency, cost, and quality in one view.

Watch each variant's latency, cost, and error rate in one view, with 7-day trends. Decide on your workload, not a benchmark leaderboard.

Variants side by side

last 7 days

A · gpt-5.4

B · claude-sonnet-4-6

latency p50

820ms

1120ms

cost per call

$0.004

$0.012

csat (1-5)

4.1

4.6

Ship the winner Friday afternoon

Works with

Router

Promote by flipping a weight.

Move a slider to 100% and the winning model goes live. No redeploy, no rolling upgrade, no one waiting on you to merge a PR.

Swap winner without redeploying

Winner goes live without a deploy.

Flip the weight to 100% in the portal. No redeploy, no rebuild, no rolling upgrade. Your app code doesn't know which variant is in production.

Swap winner without redeploying

Winner goes live without a deploy.

Flip the weight to 100% in the portal. No redeploy, no rebuild, no rolling upgrade. Your app code doesn't know which variant is in production.

Ship the winner Friday afternoon

Works with

Router

Promote by flipping a weight.

Move a slider to 100% and the winning model goes live. No redeploy, no rolling upgrade, no one waiting on you to merge a PR.

No eval framework to install

Works with

Router

Real traffic beats benchmark traffic.

Offline evals score curated test sets. Router runs the experiment on your actual users with your actual prompts and business metrics.

No eval framework to install

Braintrust, Langfuse, PromptLayer are great for offline eval.

Router runs the test on your live users. Real prompts, real sessions, real business metrics.

No eval framework to install

Works with

Router

Real traffic beats benchmark traffic.

Offline evals score curated test sets. Router runs the experiment on your actual users with your actual prompts and business metrics.

No eval framework to install

Braintrust, Langfuse, PromptLayer are great for offline eval.

Router runs the test on your live users. Real prompts, real sessions, real business metrics.

Graduate to production

Works with

Router

Same config, different weights.

Staging at 10%. Shadow at 50%. Production at 100%. Every stage is the same config with a different traffic weight. Rollback is a weight change.

Promote by ramping the weight

Staging

10%

10% of users

Shadow test

50%

50% sticky split

Production

100%

100% traffic

Every stage is the same config with a different weight. Rollback is a weight change.

Promote by ramping the weight

Staging

10%

10% of users

Shadow test

50%

50% sticky split

Production

100%

100% traffic

Every stage is the same config with a different weight. Rollback is a weight change.

Graduate to production

Works with

Router

Same config, different weights.

Staging at 10%. Shadow at 50%. Production at 100%. Every stage is the same config with a different traffic weight. Rollback is a weight change.

FAQ

Because the winning model for your workload depends on your prompts, your user distribution, and your business metrics, not a benchmark. Benchmarks tell you something; real traffic tells you the truth.

Sticky routing hashes the user ID so the same user always lands on the same variant for the experiment. Per-request coin flips break A/B math by showing the same user both variants across turns, the signal becomes noise. Sticky routing makes the math real.

CEL (Common Expression Language) is the rule language Router uses for conditional routing. Write expressions like `user_hash(user_id) % 2 == 0` or `user.tier == 'premium'`. Human-readable, no custom DSL.

Router tracks latency, cost, tokens, and errors natively per variant. For business metrics (CSAT, conversion, session length), pass them through the `user` field on requests and correlate in your own analytics. The winner is whatever your business cares about, not a benchmark.

Change the weight to 100% in the portal. No redeploy, no rebuild. Rollback is changing the weight back. Your app code never references the variant directly.

Router is free during Research Preview. Zero markup on underlying model costs during the experiment. Pay the pass-through rate for whichever variant served the request.

Offline eval (Braintrust, Langfuse, PromptLayer) is great for curated test sets, prompt regression, tool-use correctness, structured-output schema checks. Live experimentation tells you what happens on your actual users with your actual prompts. Use both; they answer different questions.

Yes. Router is OpenAI SDK compatible. The experiment lives in the portal; your code just calls the Router endpoint with a user ID field. No new client library, no gateway-specific API.