Get started
Multi-Model Routing

A/B test LLMs on real user traffic

Route users to different models with one CEL rule, hash sticky per user for clean A/B math, and measure what ships in production instead of offline evals.
Experiment
Config
variant_a openai/gpt-5.4variant_b anthropic/claude-sonnet-4-6

50/50 · sticky by user

Result
action flip weight to 100% B

Day 7 · B wins CSAT 4.6 vs 4.1

Powered by
Router

Real traffic, real math, real winner.

Sticky per-user routing, CEL-defined splits, and per-variant metrics you can correlate to CSAT and conversion. Promote the winner by flipping a weight.
Start a test in a minute
Works with
Router

No new DSL to learn, no ticket to file.

Write one rule in the portal and the split goes live. Route by user hash, tier, or any metadata, and your app code doesn't move.
Define the split in one rule
CEL
// Route A (50%) vs Route B (50%), sticky per user
if user_hash(user_id) % 2 == 0:
model = 'openai/gpt-5.4'
else:
model = 'anthropic/claude-sonnet-4-6'
// same user, same variant, every request
Sticky per user
Works with
Router

Same user, same variant, every request.

Router hashes the user ID so variant assignment stays stable for the experiment. Real A/B math, not random sampling noise.
Sticky per user
user_a421
A · gpt-5.4
user_b918
B · claude-sonnet-4-6
user_c204
A · gpt-5.4
user_a421
A · gpt-5.4
same as before
Per-request coin flips break your A/B math. Sticky routing makes it real.
Variants side by side
Works with
Router

Per-variant latency, cost, and quality in one view.

Watch each variant's latency, cost, and error rate in one view, with 7-day trends. Decide on your workload, not a benchmark leaderboard.
Variants side by side
last 7 days
A · gpt-5.4
B · claude-sonnet-4-6
latency p50
820ms
1120ms
cost per call
$0.004
$0.012
csat (1-5)
4.1
4.6
Ship the winner Friday afternoon
Works with
Router

Promote by flipping a weight.

Move a slider to 100% and the winning model goes live. No redeploy, no rolling upgrade, no one waiting on you to merge a PR.
Swap winner without redeploying
Winner goes live without a deploy.
Flip the weight to 100% in the portal. No redeploy, no rebuild, no rolling upgrade. Your app code doesn't know which variant is in production.
No eval framework to install
Works with
Router

Real traffic beats benchmark traffic.

Offline evals score curated test sets. Router runs the experiment on your actual users with your actual prompts and business metrics.
No eval framework to install
Braintrust, Langfuse, PromptLayer are great for offline eval.
Router runs the test on your live users. Real prompts, real sessions, real business metrics.
Graduate to production
Works with
Router

Same config, different weights.

Staging at 10%. Shadow at 50%. Production at 100%. Every stage is the same config with a different traffic weight. Rollback is a weight change.
Promote by ramping the weight
Staging
10%
10% of users
Shadow test
50%
50% sticky split
Production
100%
100% traffic
Every stage is the same config with a different weight. Rollback is a weight change.

FAQ

Because the winning model for your workload depends on your prompts, your user distribution, and your business metrics, not a benchmark. Benchmarks tell you something; real traffic tells you the truth.
Sticky routing hashes the user ID so the same user always lands on the same variant for the experiment. Per-request coin flips break A/B math by showing the same user both variants across turns, the signal becomes noise. Sticky routing makes the math real.
CEL (Common Expression Language) is the rule language Router uses for conditional routing. Write expressions like `user_hash(user_id) % 2 == 0` or `user.tier == 'premium'`. Human-readable, no custom DSL.
Router tracks latency, cost, tokens, and errors natively per variant. For business metrics (CSAT, conversion, session length), pass them through the `user` field on requests and correlate in your own analytics. The winner is whatever your business cares about, not a benchmark.
Change the weight to 100% in the portal. No redeploy, no rebuild. Rollback is changing the weight back. Your app code never references the variant directly.
Router is free during Research Preview. Zero markup on underlying model costs during the experiment. Pay the pass-through rate for whichever variant served the request.
Offline eval (Braintrust, Langfuse, PromptLayer) is great for curated test sets, prompt regression, tool-use correctness, structured-output schema checks. Live experimentation tells you what happens on your actual users with your actual prompts. Use both; they answer different questions.
Yes. Router is OpenAI SDK compatible. The experiment lives in the portal; your code just calls the Router endpoint with a user ID field. No new client library, no gateway-specific API.

Test on real traffic. Ship the winner.

Sticky per-user routing, CEL splits, per-variant analytics, and one-weight promotion. No eval framework required.
Copyright © 2021-2026 Inworld AI
Multi-Model Routing: Test GPT-5 vs Claude on Real Traffic | Inworld AI