Get started
Realtime Inference

Top open models for up to half the cost

Run the top open models at up to 50% below the public third-party rate, served with the same inference expertise behind our realtime voice models. If the model you need is not in our lineup, we optimize it for you.
Up to 50%
Below Public Rate
From $5/hr
Dedicated GPU
0%
Routing Markup
We applied the same inference expertise that makes our realtime voice models fast and cheap to the top open models. The result: faster, more reliable, and more cost-effective than the public alternatives.

Faster

The serving stack tuned for sub-second realtime voice now serves your LLM tokens.

More reliable

Production-grade serving built for always-on consumer apps, not batch jobs.

More cost-effective

Up to 50% below the public third-party rate, with no markup on top.

Up to half the cost for your LLMs

Run the top open models at up to 50% below the public third-party rate. Three of the ten highest-volume consumer apps we work with are already moving their LLM workloads onto it, including one processing more than 600 billion tokens a day (self-reported).

  • Up to 50% below the public third-party rate
  • Better latency and reliability than public APIs
  • No markup. You pay only for what you use
View pricing
The same top open models, up to 50% below the public third-party rate

Up to half the cost for your LLMs

You payYou save
Inworld realtime inference serves the top open models up to 50% below what you would pay elsewhere, with better latency and reliability.

Any model, optimized for you

If the model you need is not in our existing options, we optimize it for you. The same inference team that tunes our hosted open models brings your model onto the same serving stack, with the same latency, reliability, and cost profile.

  • Top open models served and tuned out of the box
  • Custom optimization for any model you bring
  • OpenAI-compatible API surface, one-line model swaps
Talk to our team
inference_request.json
{
"model": "your-model",
"messages": [...],
"max_tokens": 2048
}

Works through the Router

Realtime inference is served through the same Router endpoint as every third-party model. Mix our first-party hosted models with third-party providers in one config, route each request to whichever wins on cost or quality, and keep one bill.

  • Mix first-party and third-party models for maximum flexibility
  • Route per request, per user, or per task
  • One endpoint, one bill, unified analytics
Explore the Router
router_config.json
{
"routes": [
"inworld/models/gemma-4-31b-it",
"anthropic/claude-sonnet-4-6"
],
"optimize_for": "cost"
}

When fixed compute wins

Dedicated GPUs from $5 per GPU-hour, less than half a hyperscaler's on-demand rate. Run unlimited inference on capacity you've provisioned, and switch from per-token to fixed once volume justifies it.

  • Dedicated GPUs from $5 per GPU-hour
  • Unlimited throughput within provisioned capacity
  • Move to fixed compute when the economics make sense for you
Explore dedicated compute
Variable token cost vs dedicated GPUs as volume grows (Illustrative)

When fixed compute wins

Per-token (variable)Dedicated GPU (fixed, from $5/GPU-hr)
Source: Illustrative; Inworld $5/GPU-hr, GCP H100 ~$11/GPU-hr.

FAQ

Inworld serves the top open models at up to 50% below the public third-party rate, with better latency and reliability. See pricing for the current matrix.
No. The efficiency comes from owning the layer and optimizing the inference, not from cutting corners. The same team that optimized our voice models tuned the serving, so it runs up to 50% below the public third-party rate while improving latency and reliability.
The leading open models, served through the same Router endpoint, with new models added as they prove out. The current list is on the Router page.
Per-token inference is variable and scales with usage; dedicated GPUs are a fixed cost that wins once your volume crosses the break-even point. Dedicated capacity starts at $5 per GPU-hour, about half a hyperscaler's on-demand rate. The compute page covers dedicated NVIDIA GPU deployment.
No. Routed models pass through at provider rates with no markup, while a typical gateway adds about 5%. You pay only for what you use.
Yes. Pricing falls as your total spend grows, and it falls per layer. Spend on one layer lowers the others, on one combined commit. Pair realtime inference with Realtime TTS-2 and STT through the Realtime API for one bill across the stack.
Explore pricing or talk to our team about dedicated capacity and volume pricing.

Scale inference without scaling costs

Optimized open models up to 50% below the public third-party rate, or dedicated GPUs from $5 per GPU-hour. Today's prices are the ceiling, not the floor.
Copyright © 2021-2026 Inworld AI
Realtime Inference: Top Open Models, Up to 50% Below Public Rates