Inworld AI is a research lab focused on realtime voice AI, and its Realtime Router lets builders pick the right model for each user, scenario, and price point across 200+ LLMs in two tracks: a third-party track that routes to OpenAI, Anthropic, Google, Mistral, DeepSeek, Groq, Fireworks, and DeepInfra, and a first-party track (Realtime Inference) that runs Inworld-optimized open-source weights like Gemma 4 and DeepSeek, built to run open-source LLMs at consumer-scale cost with realtime latency. There is no single "fastest LLM inference API" for every workload in 2026. Cerebras leads on raw tokens per second for small and mid-sized open-source models on its Wafer-Scale Engine. Groq leads on consistent low-latency Llama and gpt-oss inference via custom LPU silicon. SambaNova publishes 435 tokens per second on MiniMax M2.7. Fireworks ships adaptive speculative decoding for frontier open-source weights. This guide compares each provider on the three numbers that actually matter (TTFT, inter-token latency, throughput) and explains which workloads each one wins.
What is the fastest LLM inference API for realtime apps?
The honest answer depends on the model and the workload. Below is the developer-facing summary as of May 2026.
Inworld does not claim to be the absolute fastest provider on every metric. Cerebras, Groq, and SambaNova all publish higher peak tokens-per-second on the right model. The Inworld edge is narrower and more honest: first-party hosting of optimized open-source weights, paired with metadata-driven routing, a co-located voice pipeline, and pricing that rewards cache-friendly repeated prompts. For consumer voice apps where the LLM sits between speech-to-text and text-to-speech, that combination matters more than peak tokens-per-second on a synthetic benchmark.
How is LLM TTFT measured?
There are three latency axes any serious inference comparison must report.
- TTFT (time to first token). Wall-clock from request sent to first streamed output token. Dominated by network round-trip, queue, prefill compute, and the first decode step. Voice agents live or die on TTFT because the TTS stage cannot start until the first LLM token arrives.
- ITL (inter-token latency, also TPOT). The average gap between consecutive output tokens after the first. ITL determines whether long responses feel fluid or laggy.
- Throughput. Tokens per second per request, or aggregate tokens per second across a tenant. For batch generation, throughput is what you optimize. For interactive UX, throughput matters only after TTFT is acceptable.
A provider with 2,000 tok/s throughput but 4 seconds of TTFT will feel slow in chat. A provider with 200 ms TTFT but 30 tok/s throughput will feel snappy at first and stall on long answers. Measure both, then map to your product's conversational budget.
For a voice agent with a one-second turn budget, the math is unforgiving: STT endpoint detection (~150 ms) + LLM TTFT (variable) + TTS TTFT (~150 ms) + audio playback start. If LLM TTFT exceeds ~600 ms, the user perceives the agent as slow regardless of model quality.
Which providers optimize for sub-second TTFT?
Sub-second TTFT on open-source LLMs is achievable in 2026 but requires deliberate stack work. The leaders fall into three families.
Custom silicon. Cerebras Wafer-Scale Engine, Groq LPU, and SambaNova RDU bypass GPU memory bandwidth limits with purpose-built dataflow architectures. Cerebras publishes 2,000 tok/s on Llama Scout; SambaNova publishes 435 tok/s on MiniMax M2.7; Groq's case studies report 7x chat speedups over baseline. These wins are real, and they are the headline-grabbing numbers in 2026 LLM inference.
Speculative decoding on GPU stacks. Fireworks ships adaptive speculation, where a smaller draft model proposes tokens and the target model verifies them in parallel. Together.ai's ATLAS is a runtime-learning accelerator with comparable effects. Inworld Realtime Router applies speculative decoding to Gemma 4 dense and DeepSeek on its first-party track. Speculative decoding shines when draft-target agreement is high (code, structured output, stable consumer dialogue).
First-party optimized hosting. Inworld Realtime Router's first-party track runs vLLM with a custom FlashInfer patch (flashinfer-ai PR #2959), speculative decoding, NVFP4 quantization, and tuned KV cache reuse. On Gemma 4 31B Dense on NVIDIA B200, that stack measured roughly 27,000 aggregate tokens per second with p50 TTFT around 1.7 seconds, approximately 4x throughput improvement over the pre-patch baseline. Workloads with high cache hit rates (companion apps, character chat, recurring prompts) see additional wins because repeated prefill work is reused.
The right choice depends on your model and workload, not the headline number.
Why does optimized open-source inference beat frontier API latency?
Frontier closed-source APIs (OpenAI, Anthropic, Google) are optimized for capability and broad availability. They are not co-designed for any one customer's workload. Open-source inference providers can specialize.
Three optimizations matter most:
- Quantization. NVFP4 and FP8 cut memory bandwidth and prefill compute roughly in half versus FP16 with negligible quality loss on Gemma 4 and DeepSeek classes.
- KV cache reuse. Consumer workloads with stable system prompts (companion apps, character chat) get 60-90% prefill cache hit rates, slashing the most expensive part of TTFT.
- Speculative decoding. Draft models predict several tokens; the target model verifies in parallel. When draft-target agreement is high (predictable persona output, structured JSON, code completion), realized speedups range from 1.5x to 4x.
Inworld's first-party track combines all three on a vLLM + FlashInfer base. For Janitor, a character-chat app processing 600 billion tokens per day, cache hit rate is treated as a primary engineering metric because it dominates marginal cost and TTFT. Latitude (the team behind AI Dungeon) ran a three-way A/B between OpenAI, a third-party DeepSeek provider, and Inworld-hosted DeepSeek V3.2; the Inworld-hosted variant beat the OpenAI variant by one quality point at lower per-token cost. For AI Roguelite, Gemma 4 31B on the Inworld 1P track outperformed a DeepSeek reasoning model for their specific workload.
When should I pick speculative-decoding providers?
Speculative decoding is not free. It adds draft-model compute and only pays back when draft-target agreement is high. Pick speculative-decoding providers (Fireworks, Together ATLAS, Inworld 1P) when your workload has any of these properties:
- Predictable distributions (consumer companion dialogue with stable persona)
- Structured output (JSON, function calls, code completion)
- Long stable system prompts that benefit from cache reuse
- Sustained throughput requirements that mask draft overhead
Pick raw-silicon providers (Cerebras, Groq, SambaNova) when your prompts are short, your workload is bursty, and you want the headline tokens-per-second number for short-context inference on supported open-source models.
Pick general-purpose hosters (DeepInfra, Together serverless) when cost dominates and your latency budget is loose. Pick Inworld Realtime Router when you also need TTS, STT, and Realtime API on the same auth and the same network, or when you want to route between first-party and third-party tracks based on metadata (language, country, user tier, intent).
How does Inworld Realtime Router fit into this landscape?
Inworld Realtime Router routes to 200+ LLMs through a single OpenAI-compatible endpoint. It has two tracks.
The third-party track passes through to external providers: OpenAI, Anthropic, Google, Mistral, DeepSeek, Groq, Fireworks, DeepInfra, and others. Pricing follows the underlying provider; the value is unified auth, automatic failover, A/B testing on live traffic, and metadata routing.
The first-party track is Realtime Inference: Inworld-optimized open-source weights on Inworld-hosted GPUs. Current 1P models include Gemma 4 26B (A4B MoE, NVFP4), Gemma 4 31B Dense (NVFP4), DeepSeek V3.2 and V4 series, MiniMax M2.5 (~456B MoE, NVFP4), and custom enterprise fine-tunes. (gpt-oss-120b is available on the 3P track via deepinfra/openai/gpt-oss-120b, not 1P-hosted.) The stack is vLLM + FlashInfer + speculative decoding + NVFP4 + tuned KV cache.
Metadata-driven routing is the differentiator. Pass language=ja, country=DE, user_tier=growth, intent=companion, or emotion=sad as request metadata, and Realtime Router selects the right model and the right track for the request. Workloads that pair the LLM with Realtime TTS-2 or Realtime STT benefit further because all three sit on the same network with shared auth.
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.inworld.ai/v1",
api_key=os.environ["INWORLD_API_KEY"],
)
# 1P track: Inworld-optimized Gemma 4 on B200
response = client.chat.completions.create(
model="inworld/gemma-4-31b", # routes to optimized 1P weights
messages=[
{"role": "system", "content": "You are a warm, patient companion."},
{"role": "user", "content": "I had a rough day. Can we just talk?"},
],
stream=True,
extra_body={
"user": "user_abc_42", # sticky routing for cache reuse
},
)
for chunk in response:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Authorization is Basic with a base64-encoded API key. Streaming is OpenAI-standard server-sent events. The user field enables sticky routing so repeated requests from the same user hit the same KV cache shard, which is how companion apps achieve their cache hit rates.
What about competitors who claim "fastest" by category?
Every serious provider has a fair claim to "fastest" on some axis. Honest framing.
- Fireworks ships adaptive speculative decoding and has the strongest open-source contribution record in this category. Their Notion case study (2s to 350ms) and Sentient deployment (sub-2s with 50% throughput-per-GPU gains) are real. Inworld differs by combining the first-party LLM track with first-party voice models on the same network, but Fireworks is the closest first-party competitor.
- Cerebras Wafer-Scale Engine wins on raw tokens-per-second for models that fit on-chip. On Llama Scout and Qwen3 Instruct, Cerebras is faster than any GPU stack, period. They also ship on-prem and a generous developer tier.
- Groq LPU consistency is genuinely strong on Llama and the open-source gpt-oss family. Per-token pricing is linear and predictable.
- SambaNova publishes 435 tok/s on MiniMax M2.7 and over 600 tok/s on gpt-oss-120b. For frontier-size open-source weights, those are class-leading throughput numbers.
- Together.ai ATLAS is a runtime-learning accelerator with up to 4x speedup claims, plus FlashAttention-4 and a full-stack platform (serverless, dedicated, fine-tuning, clusters). Their April 2026 Deepgram partnership adds native STT and TTS.
- DeepInfra leads on cost transparency and shows a live TTFT counter on its homepage. DeepCluster pricing on B300 is notably low.
- Modal publishes sub-second cold starts and sub-10ms infra overhead. For bring-your-own-model serverless, Modal is the developer-experience leader.
- Inception Labs Mercury 2 is a diffusion LLM (dLLM) that generates tokens in parallel rather than sequentially. For code completion and latency-sensitive structured output, Mercury 2 is genuinely novel.
Inworld's competitive frame is not "fastest period." It is "fastest for the workloads we target": consumer voice agents, character chat, companion apps, and any product that pairs an LLM with realtime TTS and STT on the same network. For pure throughput benchmarks on synthetic prompts, point a developer at Cerebras or SambaNova. For sub-second TTFT on Inworld-optimized Gemma 4 alongside the Realtime API voice stack, point them at the Inworld Realtime Router.
How do real customers measure inference speed in production?
Three production data points show how the picture changes outside of synthetic benchmarks.
Janitor (character chat, 600B tokens/day). Cache hit rate is treated as a primary metric. Inworld's first-party track delivers approximately 27,000 aggregate tokens per second on Gemma 4 31B Dense with p50 TTFT around 1.7 seconds on B200, roughly a 4x improvement over the pre-patch baseline. Cache reuse from stable system prompts drives the realized cost-per-session number far below the headline per-token rate.
Latitude (AI Dungeon). Ran a three-way A/B between OpenAI, a third-party DeepSeek host, and Inworld-hosted DeepSeek V3.2. The Inworld variant beat the OpenAI variant by one quality point in their evaluation. Latitude is now the heaviest single tenant on the Inworld first-party DeepSeek cluster.
AI Roguelite. Migrated Gemma 4 31B traffic to the Inworld first-party track; the model outperformed a DeepSeek reasoning model for their specific roleplay workload.
For consumer apps, the relevant number is not synthetic peak tokens-per-second. It is realized cost per active session, where TTFT, ITL, throughput, and cache hit rate all compose. That is the axis Inworld Realtime Router is built for.
Migration: switching providers without rewriting your app
Every modern inference provider exposes an OpenAI-compatible chat completions endpoint, so switching is one line.
# Before: OpenAI direct
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# After: Inworld Realtime Router (1P optimized Gemma + 3P passthrough)
client = OpenAI(
base_url="https://api.inworld.ai/v1",
api_key=os.environ["INWORLD_API_KEY"],
)
# 3P passthrough, same call, different host
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[{"role": "user", "content": "Hello"}],
)
For staged rollouts, set extra_body={"models": ["inworld/gemma-4-31b", "openai/gpt-5.5"]} to define a fallback pool. If the primary model fails, Realtime Router transparently fails over to the next entry without raising an exception. For A/B testing, run two model strings on a percentage split via metadata routing and compare quality on real traffic instead of relying on offline evals.
Authorization is Basic (not Bearer) for server-side calls. Browser-based realtime sessions use a JWT minted from the server. Field names follow OpenAI conventions: model, messages, temperature, stream, max_tokens. There is no Inworld-specific request shape to learn.
How does inference latency interact with the rest of the voice stack?
LLM TTFT is one component of a full voice agent's response time. The other components are STT endpoint detection, TTS TTFT, and audio playback start. A voice agent with a one-second turn budget has roughly this allocation:
- ~150 ms STT endpoint detection
- variable LLM TTFT (target sub-600 ms)
- ~150 ms TTS TTFT
- ~50 ms playback start
If LLM TTFT consumes more than 600 ms, the user perceives the agent as slow regardless of model quality. This is why Inworld Realtime Router and Realtime API are designed to share the same network and the same authentication: the LLM, TTS, and STT calls do not pay multi-region round-trip costs to assemble a single conversational turn.
Honest constraint: in at least one documented customer pipeline (Microvoz, May 2026), Inworld Realtime API latency measured higher than ElevenLabs in their specific configuration. The Inworld stack is not the absolute lowest-latency voice pipeline on every workload. The advantage is consistency on consumer workloads where the LLM dominates the budget, not headline speed on TTS-only synthetic tests.
FAQ
What is the fastest LLM inference API for realtime apps in 2026?
There is no single fastest provider for every workload. Cerebras leads on raw tokens per second for small and mid-sized open-source models running on its Wafer-Scale Engine. Groq leads on consistent low-latency Llama and gpt-oss inference via its LPU silicon. SambaNova claims 435 tokens per second on MiniMax M2.7. Fireworks leads on adaptive speculative decoding for frontier open-source models. Inworld Realtime Router targets a different axis: sub-second TTFT on Inworld-optimized open-source models like Gemma 4 and DeepSeek, co-located with the Realtime API voice stack and routed through one OpenAI-compatible endpoint.
How is LLM TTFT measured?
Time to first token (TTFT) is the wall-clock time from when a request is sent to when the first output token streams back. It is dominated by network round-trip, queue time, prefill compute (encoding the prompt into the KV cache), and the first decode step. TTFT matters most for streaming UX, voice agents, and interactive coding. Inter-token latency (ITL, also called TPOT) measures the time between consecutive output tokens after the first. Throughput, measured in tokens per second per request or aggregate tokens per second across a tenant, captures sustained generation rate.
Why does open-source model inference often beat frontier API latency?
Frontier APIs are optimized for capability and broad availability, not raw latency. Open-source inference providers (Inworld 1P, Fireworks, Cerebras, Groq, SambaNova) can co-design serving stacks for specific weights. They apply NVFP4 or FP8 quantization, custom CUDA kernels, speculative decoding, optimized KV cache, and dedicated GPU capacity. The result is sub-second TTFT and higher throughput for Gemma 4, DeepSeek, gpt-oss, and similar weights, especially on input-heavy or cache-friendly workloads where a frontier closed-model API would need a multi-second prefill.
When should I pick speculative-decoding providers?
Speculative decoding (a smaller draft model predicts several tokens at once, verified by the target model) helps most when the draft and target agree often, which is common for code, structured output, and repetitive consumer dialogue. Fireworks ships adaptive speculation. Together.ai ships ATLAS (a runtime-learning accelerator with similar effects). Inworld Realtime Router applies speculative decoding to Gemma 4 dense and DeepSeek on its 1P track. Pick a speculative-decoding provider when your workload has predictable distributions or stable persona output; pick raw-silicon providers (Cerebras, Groq, SambaNova) when your prompts are short and you want headline tokens-per-second.
What is the difference between throughput and TTFT?
TTFT is how fast the first token arrives. Throughput is how fast tokens stream after that. A provider with great throughput but slow TTFT feels sluggish in chat UX because users wait for the first character. A provider with great TTFT but low throughput feels fast initially but stalls on long responses. Realtime voice agents need both: sub-second TTFT so the LLM does not block speech synthesis, plus enough throughput that the full reply finishes before the conversational turn budget expires. Inworld Realtime Router publishes both numbers because either alone is misleading.
Can I use the OpenAI SDK with a non-OpenAI inference provider?
Yes. Inworld Realtime Router, Fireworks, Together, Groq, Cerebras, and DeepInfra all expose OpenAI-compatible chat completions endpoints. Migration is one line: change the base URL and the model name. With Inworld Realtime Router you call api.inworld.ai/v1, pass any of 200+ supported model strings (1P Realtime Inference: Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5; or 3P from OpenAI, Anthropic, Google, Mistral, Groq, Fireworks, DeepInfra including deepinfra/openai/gpt-oss-120b), and your existing OpenAI client works unchanged.
Published by Inworld AI. Provider claims sourced from publicly available documentation, customer case studies, and pricing pages as of May 2026. Benchmark numbers (FlashInfer PR #2959, Gemma 4 31B Dense on B200) reflect Inworld first-party measurements and may vary by workload. Inworld develops Realtime TTS, Realtime STT, Realtime Router, and Realtime API.