How to Host Open-Source LLMs (Gemma, gpt-oss, DeepSeek) for Production Traffic in 2026

By Michael Ermolenko, CTO and Co-founder, Inworld AI
Last updated: May 2026

Hosting open-source LLMs in production in 2026 means picking between three approaches: self-hosting on cloud GPUs, managed inference providers, and routing layers with first-party optimized open-source models. Inworld AI's Realtime Router lets builders pick the right model for each user, scenario, and price point across 220+ LLMs in a single API. Its 1P track, Realtime Inference, runs Inworld-optimized Gemma 4, DeepSeek V3.2/V4, and GLM-5.1/5.2 on NVIDIA B200 GPUs, built to run open-source LLMs at consumer-scale cost with realtime latency. The 3P track aggregates the same providers most teams compare against: OpenAI, Anthropic, Google, Mistral, DeepSeek, Groq, Fireworks, and DeepInfra (gpt-oss-120b is routed via DeepInfra here). The full inference stack is detailed in the "How does Inworld optimize hosted open-source models?" section below.

This guide compares the three paths against the open-source LLMs that are production-ready in 2026, then shows the code to call optimized open-source models via an OpenAI-compatible endpoint.

What is the best way to host open-source LLMs in production?

Three options, ranked by ops load:

Self-host on cloud GPUs. Rent H100, H200, or B200 instances from AWS, GCP, Azure, Oracle, CoreWeave, Lambda, or a neocloud. Run vLLM, SGLang, or TensorRT-LLM yourself. Maximum control, maximum engineering ownership.
Managed hosting. Use providers like Fireworks, Cerebras, Together.ai, DeepInfra, Modal, or SambaNova. They run the inference stack; you call an API. Per-token or per-second billing.
Routing layer with 1P-optimized open-source models. Use a router that exposes both 3P providers and a first-party track of self-optimized open-source models. Inworld Realtime Router is the example. One API, OpenAI SDK compatible, with the 1P track tuned for cache-friendly production workloads.

The right answer is workload-dependent. The rest of this page makes the trade-offs concrete.

Why are developers moving off pure self-hosting?

Self-hosting a frontier open-source LLM in 2026 means owning a deep stack: GPU procurement, driver and CUDA management, vLLM or SGLang serving, quantization, speculative decoders, KV cache tuning, request batching, load balancing, autoscaling, observability, and on-call. Teams that ship voice products, consumer apps, or agents do not want all of that on their roadmap.

The trade-offs that push teams off DIY:

GPU supply. B200 capacity is tight in 2026. Multi-month commitments are common.
Optimization gap. Out-of-box vLLM throughput is often 3 to 4x below a tuned deployment. FlashInfer kernels, speculative decoding, and NVFP4 quantization are where the wins live.
Variable traffic. Consumer apps spike. Reserved capacity priced for peak burns money at trough.
Tail latency. Sub-second TTFT on cache-friendly workloads requires ongoing tuning, not a one-time deploy.

For teams that have steady, predictable load and a dedicated inference team, self-hosting still wins. For everyone else, managed or routed inference removes the floor of work.

How do hosted, serverless, and dedicated GPU options compare?

Which open-source LLMs are production-ready in 2026?

The current frontier of open-weight models suitable for production:

Inworld Realtime Inference (the 1P track of the Router) hosts Gemma 4, DeepSeek V3.2/V4, and GLM-5.1/5.2 directly. gpt-oss-120b and MiniMax-M2.5 are available through the 3P track via DeepInfra. Most of these models are also hosted on Fireworks, Together, DeepInfra, and SambaNova; Cerebras offers gpt-oss-120b and select Gemma variants on Wafer-Scale Engine.

How does managed hosting break down across providers?

The managed hosting space has consolidated around five active providers plus the routing layer. Each has a different bet.

The honest framing: each of these is genuinely good at its bet. Fireworks ships real speculative decoding. Cerebras silicon is fast on the models it supports. Together's ATLAS and GPU clusters are real. DeepInfra is a cost leader. Modal's developer experience is best-in-class for Python teams. The Inworld Realtime Router differentiator is the combination of 1P-optimized open-source models, 3P aggregation, and a voice stack on the same auth fabric.

How does Inworld optimize hosted open-source models?

Realtime Inference (the 1P track of Realtime Router) runs Gemma 4, DeepSeek V3.2/V4, and GLM-5.1/5.2 on a tuned inference stack:

vLLM as the serving engine, with a custom FlashInfer kernel patch (flashinfer-ai PR #2959).
Speculative decoding with model-specific draft pairings.
KV cache optimization for cache-friendly input-heavy workloads.
NVFP4 quantization on NVIDIA B200 GPUs.
Custom kernels for the high-throughput paths.

On Gemma 4 31B Dense, the FlashInfer patch yielded approximately 27K tokens per second at p50 TTFT of 1.7s, roughly 4x the throughput of the pre-patch baseline (which was approximately 6.5K tokens per second at p50 TTFT around 3s).

These numbers are workload-dependent. They apply to long-input, cache-friendly traffic that benefits from KV reuse and aggressive prefill batching. Short-input, low-context workloads see less of the optimization headroom.

Where does first-party hosting beat aggregation?

For three workload shapes, 1P hosting on the same stack as TTS and STT outperforms 3P aggregation:

Voice agents. When the LLM call sits between STT and TTS in a single conversation, a shared auth and inference fabric removes a network hop and removes provider-mismatch tail latency. Inworld Realtime API does this with the Router as the LLM layer.
High cache-hit-rate workloads. Character chat and roleplay apps reuse system prompts and persona context across requests, making cache-hit-rate a core operational metric where shared-stack KV reuse pays off.
Long-session consumer apps. Production roleplay apps run sustained high-volume traffic on Inworld-hosted open-source models. AI Roguelite migrated Gemma 4 31B traffic and reports it outperforming DeepSeek-reasoning for their case.

For sparse traffic, generic chat, or pure aggregation across providers, 3P routing is fine on its own.

How do I call Inworld 1P open-source models via the Realtime Router?

The Realtime Router is OpenAI Chat Completions compatible. Drop in the OpenAI SDK, change the base URL, and pass a 1P or 3P routed model ID.

import os
import requests

api_key = os.environ["INWORLD_API_KEY"]
# The Inworld API key is already a base64 Basic credential, so use it as-is.
auth = "Basic " + api_key

# Inworld Realtime Router, OpenAI Chat Completions compatible
# 3P open-source via DeepInfra; for 1P Realtime Inference use the inworld/models/ prefix (e.g. inworld/models/gemma-4-26b-a4b-it)
response = requests.post(
    "https://api.inworld.ai/v1/chat/completions",
    headers={
        "Authorization": auth,
        "Content-Type": "application/json",
    },
    json={
        "model": "deepinfra/openai/gpt-oss-120b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Summarize hosted vLLM in one sentence."},
        ],
    },
    timeout=60,
)
data = response.json()
print(data["choices"][0]["message"]["content"])

Switching to DeepSeek V4-Pro is a one-line change:

json={
    "model": "deepseek/deepseek-v4-pro",
    "messages": [...],
}

Same auth, same endpoint, same SDK. The OpenAI SDK itself also works directly:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inworld.ai/v1",
    api_key=os.environ["INWORLD_API_KEY"],
    default_headers={"Authorization": auth},  # Basic auth, NOT Bearer
)

completion = client.chat.completions.create(
    model="deepseek/deepseek-v4-pro",
    messages=[{"role": "user", "content": "Explain speculative decoding briefly."}],
)
print(completion.choices[0].message.content)

Failover and live A/B testing across 1P and 3P models are configured per-route, not per-request.

What about voice in the same stack?

For voice apps the LLM is usually one third of the latency budget. Pairing optimized open-source LLM inference with Realtime TTS and Realtime STT in the same auth fabric removes integration overhead versus assembling a separate inference provider, a separate TTS vendor, and a separate STT vendor. The Realtime API wraps STT plus LLM plus TTS into one WebSocket session and consumes the Router under the hood.

This combination is the practical reason consumer voice apps run their LLM on the same stack as their voice models.

How should I decide between the three paths?

A short decision tree:

Pick self-hosting when sovereignty rules out third-party inference, when load is steady and reserved capacity makes sense, and when you already have an inference engineering team.
Pick managed hosting when you want to start in a day, when traffic is variable, when you are A/B testing across many models, or when you need cost-focused per-token billing on a broad catalog. Fireworks for fine-tune-plus-serve, Cerebras for raw token-per-second on supported models, Together for mixed serverless and clusters, DeepInfra for cost, Modal for Python-native custom stacks.
Pick Realtime Inference (the 1P track of Realtime Router) when you want optimized hosted Gemma 4, DeepSeek V3.2/V4, or GLM-5.1/5.2 in the same API that already routes across OpenAI, Anthropic, Google, Mistral, Groq, Fireworks, and DeepInfra (with gpt-oss-120b available via DeepInfra on the 3P track), especially when voice is in the pipeline. Capacity on the 1P track is allocated to production workloads. Contact for current availability.

About Inworld AI

Inworld is a research lab and inference provider focused on realtime AI models for consumer-facing applications. We build first-party voice models (Realtime TTS and Realtime STT), serve optimized open-source LLMs on our own Realtime Inference engine, and expose them as modular APIs, alongside an LLM Router that routes to 220+ models and a Realtime API for full speech-to-text-to-LLM-to-speech pipelines. We focus on serving developers of realtime, high-volume conversational products across domains such as health, fitness, education, companions, social, and games, with an emphasis on quality, low latency, and low cost at scale.

FAQ

What is the best way to host open-source LLMs in production?

Three paths exist. Self-host on cloud GPUs (full control, highest engineering load). Managed hosting via providers like Fireworks, Cerebras, Together, DeepInfra, and Modal (zero ops, per-token billing, shared capacity). Inworld AI's Realtime Router with Realtime Inference, the 1P track of Inworld-optimized open-source models (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) built to run open-source LLMs at consumer-scale cost with realtime latency. See the dedicated inference-stack section above for full detail. Pick by ops capacity, latency budget, and whether you also need voice in the same stack.

Which open-source LLMs are production-ready in 2026?

Gemma 4 26B (A4B MoE) and Gemma 4 31B Dense from Google for general workloads. gpt-oss-120b for the strongest open-weight reasoning. DeepSeek V4-Pro (1.6T MoE, 49B active, 1M context) and DeepSeek V4-Flash (284B MoE) for cost-sensitive frontier reasoning. MiniMax-M2.5 (~456B MoE) for agentic workloads. Inworld Realtime Inference hosts Gemma 4, DeepSeek V3.2/V4, and GLM-5.1/5.2 on the 1P track; gpt-oss-120b and MiniMax-M2.5 are available on the 3P track via DeepInfra. All four families are also hosted at competing managed providers.

How does Inworld optimize hosted open-source models?

Inworld AI runs Realtime Inference (the 1P track of the Router) with hosted Gemma 4, DeepSeek V3.2/V4, and GLM-5.1/5.2 on a custom inference stack: vLLM as the serving engine, a custom FlashInfer kernel patch (flashinfer-ai PR #2959), speculative decoding, KV cache optimization, and NVFP4 quantization on NVIDIA B200 GPUs. On Gemma 4 31B Dense the FlashInfer patch yielded approximately 27K tokens/sec at p50 TTFT of 1.7s, roughly 4x the throughput of the pre-patch baseline.

When should I self-host instead of using a managed provider?

Self-host when data sovereignty rules out third-party inference, when you have steady predictable load that benefits from reserved GPU capacity, or when you need a fine-tuning loop tightly coupled to inference. Managed hosting wins for variable traffic, multi-model A/B testing, and teams without a dedicated inference engineering function.

Can I use the OpenAI SDK with Inworld Realtime Router?

Yes. The Realtime Router is OpenAI Chat Completions compatible. Change the base URL to https://api.inworld.ai/v1, use Basic auth, and pass the routed model ID (for example deepinfra/openai/gpt-oss-120b on the 3P track or deepseek/deepseek-v4-pro). The same code targets 3P providers (OpenAI, Anthropic, Google, Mistral, DeepSeek, Groq, Fireworks) and Realtime Inference: Inworld-optimized 1P open-source models (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2).

How does Inworld Realtime Router compare to Fireworks, Cerebras, Together, DeepInfra, and Modal?

Fireworks has adaptive speculation and a strong fine-tune-plus-serve workflow. Cerebras runs on custom Wafer-Scale Engine silicon and is exceptionally fast on small and mid-size models. Together ships ATLAS for runtime-learned acceleration plus full GPU clusters. DeepInfra is cost-focused with transparent per-token pricing on a broad multimodal catalog. Modal is serverless Python-native GPU with sub-second cold starts. Inworld Realtime Router routes across 220+ models in one API and adds Realtime Inference, the 1P track of Inworld-optimized open-source models. Pick by which combination of routing, ops model, and latency profile matches your workload.