Get started
Published 05.28.2026

Voice AI at Consumer Scale: Lessons From Apps With Millions of Users

Inworld AI is the voice AI stack behind consumer apps running at millions of users. Voice AI at consumer scale means something specific: voice as a default feature in 90-minute sessions, hundreds of billions of tokens per day on the LLM layer, and unit economics that hold up when most users never pay. This guide walks through the hard numbers from Status by Wishroll, Janitor, Bible Chat, Latitude, Slingshot, and Tolans, the cost-discipline framework that makes the math work, and the scaling-curve pattern that takes a consumer app from launch to its first million users without melting cost. Six Inworld products carry the load: Realtime TTS (TTS-2 research preview + 1.5 Max GA), Realtime STT, Realtime API, Realtime Inference (1P-optimized Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5), Realtime Router, and Compute.

What Does Voice AI at Consumer Scale Actually Look Like?

Six anchor customers, six different proof shapes. The numbers below are exact as recorded in production.
Different verticals, one repeating pattern. Every app on the list ships voice as a default feature rather than a paywall feature, runs the high-volume conversational load on either Inworld TTS or an Inworld-optimized open-source LLM (or both), and treats cost per active user as a first-class operating metric.

Why Do Consumer Voice AI Apps Need a Different Stack?

Enterprise AI clouds (Azure AI Foundry, AWS Bedrock, Google Vertex AI) optimize for compliance, model breadth in a single contract, and integration with existing enterprise software. The success metrics are SOC 2, HIPAA, regional residency, and frontier model coverage.
Consumer apps optimize for very different things. Retention. Session length. Voice latency under live conversation. Voice quality good enough that users want to spend an hour with it. Per-user cost when most users never pay. Cache hit rate on inference. Failover behavior when an upstream model degrades during peak.
Almost all of the consumer AI apps that retain users six months later and pull recurring revenue run on realtime voice. That category needs a stack built for it, not a stack borrowed from enterprise procurement.

What Is the Cost-Discipline Framework Behind These Numbers?

Three levers actually move the cost line at consumer scale. None of them are silver bullets in isolation. All three together is what produces 95% AI cost reductions and 85% TTS cost reductions on real production traffic.
Lever 1: First-party Inworld-optimized open-source LLMs on the Router. Realtime Inference is built to run open-source LLMs at consumer-scale cost with realtime latency. Janitor runs a fine-tuned Gemma 4 31B fleet at 600 billion tokens per day on dedicated B200 GPU capacity. Same OpenAI-compatible call, very different cost profile on the cache-friendly, input-heavy workloads consumer character chat actually produces.
Lever 2: Cache hit rate as a first-class operating metric. Janitor instruments cache hit rate the same way most teams instrument latency. When the cache hit rate goes up, cost per active user goes down on the same traffic. Consumer voice apps repeat enough conversational context across sessions that this is the highest-leverage line on the dashboard once you are past a few million tokens per day.
Lever 3: Routing on metadata, not on a single pinned model. Free user gets a cheaper model. Subscriber gets a more capable one. Greetings and acknowledgments route to a small model. Hard turns route to a bigger one. Sticky routing on a user ID keeps a single user pinned to the same backend for the full session so persona and cache stay warm.

How Does the Free-to-Paid Scaling Curve Actually Work?

A consumer voice app does not have one cost curve. It has two layered curves: a free-tier curve where the goal is to make voice viable as a default feature, and a paid-tier curve where unit economics are healthier and the model choice can be more aggressive on quality.
The shape that consistently works:
  1. Launch on Inworld-optimized open-source LLMs through the Router for the free tier. Default to first-party Gemma 4 or DeepSeek V4 Pro for the conversational core. This is the line that makes voice viable for users who never pay.
  2. Default to Realtime TTS 1.5 Max or Mini for voice on the free tier. Top-ranked realtime quality is the retention lever; TTS-2 (research preview) is the upgrade path once GA timing lands.
  3. Route paid-tier traffic on metadata for quality lifts. Subscribers get a frontier closed model on the hard turns. Same user ID stays sticky across the session so cache and persona hold.
  4. Wire fallbacks at the model layer, not the vendor layer. Wishroll runs as a partner, not a captive: model-agnostic Routers route to Gemini, OpenAI, and Anthropic on outages so the product never goes dark.
  5. Instrument cache hit rate, cost per active user, and time to first audio. Three numbers tell the truth about whether the stack is paying back.
The combination of (1) and (2) is the lever that takes a consumer app from launch to a million users without melting cost. Status by Wishroll did this in 19 days. Bible Chat did the same shape, slower, with the voice layer scaling 10x while cost dropped 85% on the same migration.

How Did Status by Wishroll Reach 1 Million Users in 19 Days?

Status by Wishroll is the cleanest case study of voice as a default feature at viral consumer scale: 1 million users in 19 days, 95% AI cost reduction, 90+ minute median sessions. Voice is not a gated feature on Status. It is the product.
The stack underneath is Realtime TTS for voice output, the Realtime Router for the LLM layer (running model-agnostically across Inworld-optimized open-source models, Gemini, OpenAI, and Anthropic with live fallback routing), and the Realtime API for the live conversation loop. The 95% cost reduction came from migrating the high-volume conversational load to first-party Inworld-optimized open-source LLMs on the Router while keeping frontier closed models as fallback for the calls that actually justify them.
This is what model-agnostic actually buys at consumer scale: optimize cost in steady state on Inworld-optimized OSS, fall over to frontier closed in the rare moments something breaks, never block on a single vendor.

How Does Janitor Serve 600 Billion Tokens a Day?

Janitor is the volume case. 600 billion tokens per day routes through a fine-tuned Gemma 4 31B fleet on Realtime Inference (the 1P track of the Router), built to run open-source LLMs at consumer-scale cost with realtime latency on dedicated B200 GPU capacity.
Two operating disciplines make that volume work:
Cache hit rate as a primary metric. Janitor instruments cache hit rate the way most teams instrument latency. Repeated context across sessions is the largest single cost lever once you are past a few hundred billion tokens per day, and cache-friendly workloads are exactly what consumer character chat produces at scale.
Custom fine-tunes on first-party hosting. Janitor runs a custom fine-tune of Gemma 4 31B on dedicated Inworld GPU capacity, not a vendor's hosted variant. Fine-tune + serve on the same fabric means the model can be tuned to the cache structure that produces the operating metrics that matter.

What Did Bible Chat Learn Scaling TTS 10x?

Bible Chat scaled Realtime TTS from 2 million characters per week to 20 million characters per week, and cut TTS cost 85% on the same migration. This is the cleanest TTS-side case study because the scaling factor and the cost reduction are on the same traffic.
The shape of the migration: move from a generic third-party TTS to Inworld Realtime TTS for the high-volume conversational load, hold voice quality on a top-ranked realtime model, and let the cost line drop on the same workload. Voice quality went up. Cost per character went down. Volume went up 10x at the same time.
This is what top-ranked realtime TTS actually delivers economically at consumer scale: voice can be a default feature instead of a paywall feature because the quality is good enough that users stay and the unit cost stays low enough that you can afford to keep them in voice.

How Does Latitude Run the Heaviest Realtime Workload?

Latitude (AI Game Master) is the realtime case. Roleplay sessions are long, high token volume per session, and quality-sensitive in a way that gets immediate user feedback. Latitude ran a 3-way A/B test of frontier and open-source models on live roleplay traffic. The first-party Inworld-optimized DeepSeek V3.2 cluster beat OpenAI by a point on user-rated quality.
That outcome matters because it makes a routing decision that would otherwise be a quality compromise into a quality win. Latitude can run first-party Inworld-optimized DeepSeek as the default model rather than the budget option, at a cost structure that lets the model stay default rather than gated behind a paid tier.
# OpenAI-compatible Router call with first-party Inworld-optimized DeepSeek
from openai import OpenAI

client = OpenAI(
    api_key="<your-api-key>",
    base_url="https://api.inworld.ai/v1",
)

response = client.chat.completions.create(
    model="deepseek/deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are the game master for a survival roleplay."},
        {"role": "user", "content": "I climb the watchtower and look out."},
    ],
    user="user_8a92c7",  # sticky routing keeps the session pinned
    extra_body={
        "models": ["openai/gpt-5.5", "anthropic/claude-sonnet-4-6"],
        "sort": ["latency", "intelligence"],
    },
)

print(response.choices[0].message.content)
print(response.metadata["attempts"])  # routing trace per call
The extra_body.models field is a fallback pool. If the primary degrades during a live session, the Router retries the next model in order without the app code knowing. The Router's metadata.attempts field returns the routing trace so you can monitor what actually served in production.

Why Did Slingshot Migrate 100% of Voice to Realtime TTS-2?

Slingshot is the conviction case. Slingshot is an AI therapy app where voice quality is not a nice-to-have. Slingshot migrated 100% of voice traffic to Realtime TTS-2 during the research preview, before TTS-2 was GA, because the steering and voice identity advantages were strong enough on their use case to justify operating in preview.
TTS-2 brings four things that matter for a high-stakes voice product: 8-dimension natural-language steering (emotion, articulation, intonation, volume, pitch, range, speed, vocal style), non-verbal tags ([laugh], [sigh], [breathe]), the deliveryMode field (STABLE, BALANCED, CREATIVE), and cross-lingual voice identity that preserves the same voice across language switches.
TTS-2 is research preview. Realtime TTS 1.5 Max is the GA default. Slingshot chose to operate in preview because the upside on their specific quality bar was worth the SLA tradeoff.

How Does Inworld Compare to ElevenLabs and Cartesia at Consumer Scale?

ElevenLabs ships Eleven v3 TTS, Scribe v2 STT, ElevenAgents (with Expressive Mode), Flows, Music v2, Dubbing v2, and a Government tier. The voice library is the largest in the industry (10,000+ community voices), they ship constantly, and Eleven Flash is a real low-latency option for some workloads. Strong fit for apps where voice variety matters more than per-user cost discipline at scale.
Cartesia ships Sonic 3.5 TTS, Ink STT, and the Line voice agent platform. Sonic Turbo time-to-first-byte is genuinely fast on their hosted realtime path, and the state-space-model architecture is purpose-built for synchronous live interactions. Strong fit for apps where TTS time-to-first-byte is the dominant constraint and the rest of the stack can be assembled separately.
The differentiator is the combination, not any single line item. Top-ranked realtime TTS plus first-party Inworld-optimized open-source LLM inference plus a model-agnostic Realtime API, sharing one auth header and one billing relationship. That combination is what Wishroll, Janitor, Bible Chat, and Latitude run on. The voice that makes AI agents human.

How Do You Start Building on This Stack?

Three steps, same base URL, one auth header.
  1. Pick a TTS model. Realtime TTS 1.5 Max (inworld-tts-1.5-max) for GA, Realtime TTS-2 (inworld-tts-2) for research preview. Streaming endpoint returns NDJSON with base64 audioContent per line. Max input is 2,000 characters per request.
  2. Add Realtime STT and the Realtime Router. Two more calls against the same base URL. STT uses {transcribeConfig, audioData} with base64 audio. Router is OpenAI Chat Completions format with extra_body for fallback pools and routing sort.
  3. Wire it through the Realtime API when you are ready for full-duplex. WebSocket session over wss://api.inworld.ai/api/v1/realtime/session. OpenAI Realtime protocol compatible. Inworld extensions exposed through providerData.
See pricing and start building. The voice that makes AI agents human.

Frequently Asked Questions

What does it take to scale voice AI to millions of users?
Inworld AI is the voice AI stack behind several consumer apps at scale: Status by Wishroll reached 1 million users in 19 days with a 95% AI cost reduction, Janitor serves 600 billion tokens per day on a fine-tuned Gemma fleet, Bible Chat scaled TTS from 2 million to 20 million characters per week while cutting voice cost 85%, and Latitude runs the heaviest realtime workload on a first-party Inworld-optimized DeepSeek cluster. The pattern across all of them is the same: top-ranked realtime TTS, a model-agnostic Router with first-party Inworld-optimized open-source LLMs for cache-friendly workloads, and the Realtime API for live conversation, sharing one auth header.
How do consumer voice AI apps keep unit economics viable?
Two levers actually move the cost line. First, route the high-volume conversational load to Realtime Inference, the first-party track of Inworld-optimized open-source LLMs on the Router (Gemma 4, DeepSeek V3.2/V4 Pro, MiniMax-M2.5), where cache hit rate becomes a primary operating metric. Second, treat the Router as a routing layer over 200+ LLMs and fall back to frontier closed models only when the call actually justifies it. Janitor runs 600 billion tokens per day on first-party Gemma. Status by Wishroll cut AI cost 95% while reaching 1 million users in 19 days.
Which consumer AI apps run on Inworld voice AI?
Status by Wishroll (1M users in 19 days, 95% AI cost reduction), Janitor (600B tokens per day, cache-hit rate as a primary metric), Bible Chat (2M to 20M characters per week, 85% TTS cost reduction), Latitude (heaviest realtime user, first-party DeepSeek V3.2 beat OpenAI by a point in a 3-way A/B), Slingshot (100% voice migration to Realtime TTS-2), and Tolans (one of the largest consumer AI apps now running on Inworld). The verticals are companions, character chat, and roleplay.
How does the Inworld Router help apps scale voice AI economically?
The Realtime Router lets builders pick the right model for each user, scenario, and price point and switch without rewiring. One OpenAI-compatible API routes across 200+ LLMs in two tracks. The third-party track covers OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, Fireworks, and DeepInfra. The first-party track (Realtime Inference) hosts Inworld-optimized open-source models built to run open-source LLMs at consumer-scale cost with realtime latency (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5). gpt-oss-120b is available on the 3P track via DeepInfra. Apps route on cost, latency, language, user tier, intent, or emotion, with automatic failover to frontier closed models when a call justifies it.
What does scaling Realtime TTS to millions of users actually look like?
Bible Chat scaled Realtime TTS from 2 million to 20 million characters per week and cut TTS cost 85% on the same migration. Status by Wishroll runs voice as a default feature in 90+ minute sessions for 1 million users. Slingshot migrated 100% of voice traffic to Realtime TTS-2, the top-ranked realtime TTS on the Artificial Analysis Speech Arena. The shared pattern is voice as a default feature, not a paywall feature, made viable by cost discipline at every layer.
How is the Inworld Realtime API used inside live consumer products?
Latitude runs the heaviest realtime workload on Inworld and validated a first-party Inworld-optimized DeepSeek V3.2 cluster in a 3-way A/B against OpenAI on live roleplay traffic, beating OpenAI by a point on user-rated quality. The Realtime API is OpenAI Realtime protocol compatible, supports WebSocket (GA) and WebRTC (early access), and exposes Inworld extensions through providerData for STT prompts, TTS delivery controls, memory auto-summarization, backchannels, and responsiveness. Server VAD uses Inworld-hosted Silero VAD plus a Smart Turn detector.
Published by Inworld AI. Production numbers verified May 2026. Realtime TTS-2 is a research preview; Realtime TTS 1.5 Max and Mini are GA. Realtime API WebSocket is GA, WebRTC is early access.
Copyright © 2021-2026 Inworld AI
Voice AI at Consumer Scale (2026)