Realtime Inference

Top open models for up to half the cost

Run the top open models at up to 50% below the public third-party rate, served with the same inference expertise behind our realtime voice models. If the model you need is not in our lineup, we optimize it for you.

Get Started View Docs

Up to 50%

Below Public Rate

From $5/hr

Dedicated GPU

Routing Markup

We applied the same inference expertise that makes our realtime voice models fast and cheap to the top open models. The result: faster, more reliable, and more cost-effective than the public alternatives.

Faster

The serving stack tuned for sub-second realtime voice now serves your LLM tokens.

More reliable

Production-grade serving built for always-on consumer apps, not batch jobs.

More cost-effective

Up to 50% below the public third-party rate, with no markup on top.

Up to half the cost for your LLMs

Run the top open models at up to 50% below the public third-party rate. Three of the ten highest-volume consumer apps we work with are already moving their LLM workloads onto it, including one processing more than 600 billion tokens a day (self-reported).

Up to 50% below the public third-party rate
Better latency and reliability than public APIs
No markup. You pay only for what you use

View pricing

The same top open models, up to 50% below the public third-party rate

Up to half the cost for your LLMs

You payYou save

Inworld realtime inference serves the top open models up to 50% below what you would pay elsewhere, with better latency and reliability.

Up to half the cost for your LLMs

Up to 50% below the public third-party rate
Better latency and reliability than public APIs
No markup. You pay only for what you use

View pricing

The same top open models, up to 50% below the public third-party rate

Up to half the cost for your LLMs

You payYou save

Inworld realtime inference serves the top open models up to 50% below what you would pay elsewhere, with better latency and reliability.

Any model, optimized for you

If the model you need is not in our existing options, we optimize it for you. The same inference team that tunes our hosted open models brings your model onto the same serving stack, with the same latency, reliability, and cost profile.

Top open models served and tuned out of the box
Custom optimization for any model you bring
OpenAI-compatible API surface, one-line model swaps

Talk to our team

inference_request.json

{
"model": "your-model",
"messages": [...],
"max_tokens": 2048
}

inference_request.json

{
"model": "your-model",
"messages": [...],
"max_tokens": 2048
}

Any model, optimized for you

Top open models served and tuned out of the box
Custom optimization for any model you bring
OpenAI-compatible API surface, one-line model swaps

Talk to our team

Works through the Router

Realtime inference is served through the same Router endpoint as every third-party model. Mix our first-party hosted models with third-party providers in one config, route each request to whichever wins on cost or quality, and keep one bill.

Mix first-party and third-party models for maximum flexibility
Route per request, per user, or per task
One endpoint, one bill, unified analytics

Explore the Router

router_config.json

{
"routes": [
"inworld/models/gemma-4-31b-it",
"anthropic/claude-sonnet-4-6"
],
"optimize_for": "cost"
}

Works through the Router

Mix first-party and third-party models for maximum flexibility
Route per request, per user, or per task
One endpoint, one bill, unified analytics

Explore the Router

router_config.json

{
"routes": [
"inworld/models/gemma-4-31b-it",
"anthropic/claude-sonnet-4-6"
],
"optimize_for": "cost"
}

When fixed compute wins

Dedicated GPUs from $5 per GPU-hour, less than half a hyperscaler's on-demand rate. Run unlimited inference on capacity you've provisioned, and switch from per-token to fixed once volume justifies it.

Dedicated GPUs from $5 per GPU-hour
Unlimited throughput within provisioned capacity
Move to fixed compute when the economics make sense for you

Explore dedicated compute

Variable token cost vs dedicated GPUs as volume grows (Illustrative)

When fixed compute wins

Per-token (variable)Dedicated GPU (fixed, from $5/GPU-hr)

Source: Illustrative; Inworld $5/GPU-hr, GCP H100 ~$11/GPU-hr.

Variable token cost vs dedicated GPUs as volume grows (Illustrative)

When fixed compute wins

Per-token (variable)Dedicated GPU (fixed, from $5/GPU-hr)

Source: Illustrative; Inworld $5/GPU-hr, GCP H100 ~$11/GPU-hr.

When fixed compute wins

Dedicated GPUs from $5 per GPU-hour
Unlimited throughput within provisioned capacity
Move to fixed compute when the economics make sense for you

Explore dedicated compute

FAQ

Inworld serves the top open models at up to 50% below the public third-party rate, with better latency and reliability. See pricing for the current matrix.

No. The efficiency comes from owning the layer and optimizing the inference, not from cutting corners. The same team that optimized our voice models tuned the serving, so it runs up to 50% below the public third-party rate while improving latency and reliability.

The leading open models, served through the same Router endpoint, with new models added as they prove out. The current list is on the Router page.

Per-token inference is variable and scales with usage; dedicated GPUs are a fixed cost that wins once your volume crosses the break-even point. Dedicated capacity starts at $5 per GPU-hour, about half a hyperscaler's on-demand rate. The compute page covers dedicated NVIDIA GPU deployment.

No. Routed models pass through at provider rates with no markup, while a typical gateway adds about 5%. You pay only for what you use.

Yes. Pricing falls as your total spend grows, and it falls per layer. Spend on one layer lowers the others, on one combined commit. Pair realtime inference with Realtime TTS-2 and STT through the Realtime API for one bill across the stack.

Explore pricing or talk to our team about dedicated capacity and volume pricing.

Scale inference without scaling costs

Optimized open models up to 50% below the public third-party rate, or dedicated GPUs from $5 per GPU-hour. Today's prices are the ceiling, not the floor.

Explore Pricing Contact Sales

Products

Developers

Socials