By Michael Ermolenko, CTO and Co-founder, Inworld AI
Last updated: May 2026
Hosting open-source LLMs in production in 2026 means picking between three approaches: self-hosting on cloud GPUs, managed inference providers, and routing layers with first-party optimized open-source models. Inworld AI's
Realtime Router lets builders pick the right model for each user, scenario, and price point across 200+ LLMs in a single API. Its 1P track,
Realtime Inference, runs Inworld-optimized Gemma 4, DeepSeek V3.2/V4, and MiniMax-M2.5 on NVIDIA B200 GPUs, built to run open-source LLMs at consumer-scale cost with realtime latency. The 3P track aggregates the same providers most teams compare against: OpenAI, Anthropic, Google, Mistral, DeepSeek, Groq, Fireworks, and DeepInfra (
gpt-oss-120b is routed via DeepInfra here). The full inference stack is detailed in the "How does Inworld optimize hosted open-source models?" section below.
This guide compares the three paths against the open-source LLMs that are production-ready in 2026, then shows the code to call optimized open-source models via an OpenAI-compatible endpoint.
What is the best way to host open-source LLMs in production?
Three options, ranked by ops load:
- Self-host on cloud GPUs. Rent H100, H200, or B200 instances from AWS, GCP, Azure, Oracle, CoreWeave, Lambda, or a neocloud. Run vLLM, SGLang, or TensorRT-LLM yourself. Maximum control, maximum engineering ownership.
- Managed hosting. Use providers like Fireworks, Cerebras, Together.ai, DeepInfra, Modal, or SambaNova. They run the inference stack; you call an API. Per-token or per-second billing.
- Routing layer with 1P-optimized open-source models. Use a router that exposes both 3P providers and a first-party track of self-optimized open-source models. Inworld Realtime Router is the example. One API, OpenAI SDK compatible, with the 1P track tuned for cache-friendly production workloads.
The right answer is workload-dependent. The rest of this page makes the trade-offs concrete.
Why are developers moving off pure self-hosting?
Self-hosting a frontier open-source LLM in 2026 means owning a deep stack: GPU procurement, driver and CUDA management, vLLM or SGLang serving, quantization, speculative decoders, KV cache tuning, request batching, load balancing, autoscaling, observability, and on-call. Teams that ship voice products, consumer apps, or agents do not want all of that on their roadmap.
The trade-offs that push teams off DIY:
- GPU supply. B200 capacity is tight in 2026. Multi-month commitments are common.
- Optimization gap. Out-of-box vLLM throughput is often 3 to 4x below a tuned deployment. FlashInfer kernels, speculative decoding, and NVFP4 quantization are where the wins live.
- Variable traffic. Consumer apps spike. Reserved capacity priced for peak burns money at trough.
- Tail latency. Sub-second TTFT on cache-friendly workloads requires ongoing tuning, not a one-time deploy.
For teams that have steady, predictable load and a dedicated inference team, self-hosting still wins. For everyone else, managed or routed inference removes the floor of work.
How do hosted, serverless, and dedicated GPU options compare?
Which open-source LLMs are production-ready in 2026?
The current frontier of open-weight models suitable for production:
Inworld Realtime Inference (the 1P track of the Router) hosts Gemma 4, DeepSeek V3.2/V4, and MiniMax-M2.5 directly. gpt-oss-120b is available through the 3P track via DeepInfra. Most of these models are also hosted on Fireworks, Together, DeepInfra, and SambaNova; Cerebras offers gpt-oss-120b and select Gemma variants on Wafer-Scale Engine.
How does managed hosting break down across providers?
The managed hosting space has consolidated around five active providers plus the routing layer. Each has a different bet.
The honest framing: each of these is genuinely good at its bet. Fireworks ships real speculative decoding. Cerebras silicon is fast on the models it supports. Together's ATLAS and GPU clusters are real. DeepInfra is a cost leader. Modal's developer experience is best-in-class for Python teams. The Inworld Realtime Router differentiator is the combination of 1P-optimized open-source models, 3P aggregation, and a voice stack on the same auth fabric.
How does Inworld optimize hosted open-source models?
Realtime Inference (the 1P track of Realtime Router) runs Gemma 4, DeepSeek V3.2/V4, and MiniMax-M2.5 on a tuned inference stack:
- vLLM as the serving engine, with a custom FlashInfer kernel patch (flashinfer-ai PR #2959).
- Speculative decoding with model-specific draft pairings.
- KV cache optimization for cache-friendly input-heavy workloads.
- NVFP4 quantization on NVIDIA B200 GPUs.
- Custom kernels for the high-throughput paths.
On Gemma 4 31B Dense, the FlashInfer patch yielded approximately 27K tokens per second at p50 TTFT of 1.7s, roughly 4x the throughput of the pre-patch baseline (which was approximately 6.5K tokens per second at p50 TTFT around 3s). MiniMax-M2.5 reaches approximately 22K tokens per second on B200 in the same stack.
These numbers are workload-dependent. They apply to long-input, cache-friendly traffic that benefits from KV reuse and aggressive prefill batching. Short-input, low-context workloads see less of the optimization headroom.
Where does first-party hosting beat aggregation?
For three workload shapes, 1P hosting on the same stack as TTS and STT outperforms 3P aggregation:
- Voice agents. When the LLM call sits between STT and TTS in a single conversation, a shared auth and inference fabric removes a network hop and removes provider-mismatch tail latency. Inworld Realtime API does this with the Router as the LLM layer.
- High cache-hit-rate workloads. Character chat and roleplay apps reuse system prompts and persona context across requests. Janitor AI runs approximately 600B tokens per day on Inworld-hosted fine-tuned Gemma with cache-hit-rate as a core operational metric.
- Long-session consumer apps. Latitude runs DeepSeek V3.2 on a dedicated Inworld cluster for AI Game Master sessions. Yonder runs the same V3.2 cluster for its workload. AI Roguelite migrated Gemma 4 31B traffic and reports it outperforming DeepSeek-reasoning for their case.
For sparse traffic, generic chat, or pure aggregation across providers, 3P routing is fine on its own.
How do I call Inworld 1P open-source models via the Realtime Router?
The Realtime Router is OpenAI Chat Completions compatible. Drop in the OpenAI SDK, change the base URL, and pass a 1P or 3P routed model ID.
import os
import requests
import base64
api_key = os.environ["INWORLD_API_KEY"]
auth = "Basic " + base64.b64encode(api_key.encode()).decode()
# Inworld Realtime Router, OpenAI Chat Completions compatible
# 3P open-source via DeepInfra; for 1P Realtime Inference use inworld/ prefix (e.g. inworld/gemma-4-26b)
response = requests.post(
"https://api.inworld.ai/v1/chat/completions",
headers={
"Authorization": auth,
"Content-Type": "application/json",
},
json={
"model": "deepinfra/openai/gpt-oss-120b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize hosted vLLM in one sentence."},
],
},
timeout=60,
)
data = response.json()
print(data["choices"][0]["message"]["content"])
Switching to DeepSeek V4-Pro is a one-line change:
json={
"model": "deepseek/deepseek-v4-pro",
"messages": [...],
}
Same auth, same endpoint, same SDK. The OpenAI SDK itself also works directly:
from openai import OpenAI
client = OpenAI(
base_url="https://api.inworld.ai/v1",
api_key=os.environ["INWORLD_API_KEY"],
default_headers={"Authorization": auth}, # Basic auth, NOT Bearer
)
completion = client.chat.completions.create(
model="deepseek/deepseek-v4-pro",
messages=[{"role": "user", "content": "Explain speculative decoding briefly."}],
)
print(completion.choices[0].message.content)
Failover and live A/B testing across 1P and 3P models are configured per-route, not per-request.
What about voice in the same stack?
For voice apps the LLM is usually one third of the latency budget. Pairing optimized open-source LLM inference with
Realtime TTS and
Realtime STT in the same auth fabric removes integration overhead versus assembling a separate inference provider, a separate TTS vendor, and a separate STT vendor. The
Realtime API wraps STT plus LLM plus TTS into one WebSocket session and consumes the Router under the hood.
This combination is the practical reason consumer voice apps run their LLM on the same stack as their voice models.
How should I decide between the three paths?
A short decision tree:
- Pick self-hosting when sovereignty rules out third-party inference, when load is steady and reserved capacity makes sense, and when you already have an inference engineering team.
- Pick managed hosting when you want to start in a day, when traffic is variable, when you are A/B testing across many models, or when you need cost-focused per-token billing on a broad catalog. Fireworks for fine-tune-plus-serve, Cerebras for raw token-per-second on supported models, Together for mixed serverless and clusters, DeepInfra for cost, Modal for Python-native custom stacks.
- Pick Realtime Inference (the 1P track of Realtime Router) when you want optimized hosted Gemma 4, DeepSeek V3.2/V4, or MiniMax-M2.5 in the same API that already routes across OpenAI, Anthropic, Google, Mistral, Groq, Fireworks, and DeepInfra (with
gpt-oss-120b available via DeepInfra on the 3P track), especially when voice is in the pipeline. Capacity on the 1P track is allocated to production workloads. Contact for current availability.
FAQ
What is the best way to host open-source LLMs in production?
Three paths exist. Self-host on cloud GPUs (full control, highest engineering load). Managed hosting via providers like Fireworks, Cerebras, Together, DeepInfra, and Modal (zero ops, per-token billing, shared capacity). Inworld AI's Realtime Router with Realtime Inference, the 1P track of Inworld-optimized open-source models (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5) built to run open-source LLMs at consumer-scale cost with realtime latency. See the dedicated inference-stack section above for full detail. Pick by ops capacity, latency budget, and whether you also need voice in the same stack.
Which open-source LLMs are production-ready in 2026?
Gemma 4 26B (A4B MoE) and Gemma 4 31B Dense from Google for general workloads. gpt-oss-120b for the strongest open-weight reasoning. DeepSeek V4-Pro (1.6T MoE, 49B active, 1M context) and DeepSeek V4-Flash (284B MoE) for cost-sensitive frontier reasoning. MiniMax-M2.5 (~456B MoE) for agentic workloads. Inworld Realtime Inference hosts Gemma 4, DeepSeek V3.2/V4, and MiniMax-M2.5 on the 1P track; gpt-oss-120b is available on the 3P track via DeepInfra. All four families are also hosted at competing managed providers.
How does Inworld optimize hosted open-source models?
Inworld AI runs Realtime Inference (the 1P track of the Router) with hosted Gemma 4, DeepSeek V3.2/V4, and MiniMax-M2.5 on a custom inference stack: vLLM as the serving engine, a custom FlashInfer kernel patch (flashinfer-ai PR #2959), speculative decoding, KV cache optimization, and NVFP4 quantization on NVIDIA B200 GPUs. On Gemma 4 31B Dense the FlashInfer patch yielded approximately 27K tokens/sec at p50 TTFT of 1.7s, roughly 4x the throughput of the pre-patch baseline.
When should I self-host instead of using a managed provider?
Self-host when data sovereignty rules out third-party inference, when you have steady predictable load that benefits from reserved GPU capacity, or when you need a fine-tuning loop tightly coupled to inference. Managed hosting wins for variable traffic, multi-model A/B testing, and teams without a dedicated inference engineering function.
Can I use the OpenAI SDK with Inworld Realtime Router?
Yes. The Realtime Router is OpenAI Chat Completions compatible. Change the base URL to https://api.inworld.ai/v1, use Basic auth, and pass the routed model ID (for example deepinfra/openai/gpt-oss-120b on the 3P track or deepseek/deepseek-v4-pro). The same code targets 3P providers (OpenAI, Anthropic, Google, Mistral, DeepSeek, Groq, Fireworks) and Realtime Inference: Inworld-optimized 1P open-source models (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5).
How does Inworld Realtime Router compare to Fireworks, Cerebras, Together, DeepInfra, and Modal?
Fireworks has adaptive speculation and a strong fine-tune-plus-serve workflow. Cerebras runs on custom Wafer-Scale Engine silicon and is exceptionally fast on small and mid-size models. Together ships ATLAS for runtime-learned acceleration plus full GPU clusters. DeepInfra is cost-focused with transparent per-token pricing on a broad multimodal catalog. Modal is serverless Python-native GPU with sub-second cold starts. Inworld Realtime Router routes across 200+ models in one API and adds Realtime Inference, the 1P track of Inworld-optimized open-source models. Pick by which combination of routing, ops model, and latency profile matches your workload.