Voice AI for Consumer Apps (2026)

Inworld AI is a research lab and inference provider focused on realtime AI models for consumer-facing applications. We build voice AI that feels as human as it sounds. Our voice stack (Realtime TTS-2, Realtime STT, the Realtime API, Realtime Inference, the Realtime Router, and Compute) powers Status by Wishroll (1M users in 19 days), Bible Chat (2M to 20M characters per week), and Talkpal (multilingual companion learning). This guide is for builders shipping consumer apps in the categories that actually retain users six months later: companions, social apps, and games. It covers what separates consumer AI infrastructure from enterprise AI clouds, the six products in the consumer stack, the production numbers behind anchor customers, and how to architect for the scale where unit economics either work or break.

Why Consumer AI Infrastructure Looks Different From Enterprise AI Cloud

Enterprise AI clouds (Azure AI Foundry, AWS Bedrock, Google Vertex AI) optimize for procurement, compliance, model catalogs, and integration into existing enterprise software. The metrics that matter are SOC 2, HIPAA, regional residency, SLA, audit logging, and breadth of frontier models in a single contract.

Consumer AI apps optimize for completely different things. Retention. Time per session. Voice latency under realtime conversation. Voice quality that users want to spend an hour with. Per-user cost when the median user pays nothing. Cache hit rate on inference. Failover behavior when an upstream model degrades during peak.

Almost all of the apps that retain users six months later and pull recurring revenue run on realtime voice. The category requires different models, different routing, different latency budgets, and different cost discipline than the enterprise stack was designed for. Inworld AI is a research lab with the best realtime and voice models on the market for engaging consumers. All in realtime AI for consumer facing applications.

The point is not that enterprise clouds are wrong. The point is that a million-user consumer voice app does not look like a Fortune 500 contact center, and the infrastructure shouldn't either.

What Does the Consumer Voice AI Stack Look Like?

Six products, one stack, one auth header.

Realtime TTS. Voices that sound human enough that users stay on the call and come back. Realtime TTS-2 (research preview, model ID inworld-tts-2) is Inworld's most expressive realtime voice model, and the #1 realtime TTS. Realtime TTS 1.5 Max and Realtime TTS 1.5 Mini are GA. TTS-2 adds natural-language steering across 8 dimensions (emotion, articulation, intonation, volume, pitch, range, speed, vocal style), non-verbal tags ([laugh], [sigh], [breathe]), a deliveryMode field (STABLE, BALANCED, CREATIVE), and cross-lingual voice identity that preserves the same voice across 15 GA languages and 90+ experimental languages. Free zero-shot voice cloning from 5 to 15 seconds of audio.

Realtime STT. Captures what users said, including how they said it, so the agent responds with context. Multiple provider options through one API: inworld/inworld-stt-1 (with voice profiling for age, pitch, emotion, vocal style, and accent), AssemblyAI Universal-3 Pro variants, Groq Whisper-Large v3, and Soniox stt-rt-v4. JSON body with transcribeConfig plus base64 audioData. Configurable turn-taking through endOfTurnConfidenceThreshold, contextual prompts, and voiceProfileConfig.

Realtime Router. Pick the right model for each user, scenario, and price point and switch without rewiring. One OpenAI-compatible API routes to 220+ models. Two tracks: a third-party track (OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, Fireworks, DeepInfra; gpt-oss-120b is routable via DeepInfra here) and a first-party track called Realtime Inference: Inworld-optimized open-source models built to run open-source LLMs at consumer-scale cost with realtime latency (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2). Routing on cost, latency, throughput, intelligence, or custom metadata like language, country, user tier, intent, or emotion. Automatic failover. Live A/B testing on production traffic.

Realtime API. One integrated voice loop instead of stitching three vendors. Ships in days, fails in fewer places. WebSocket and WebRTC. OpenAI Realtime protocol compatible: change the base URL, swap the auth header. Inworld extensions through providerData for STT prompts, TTS delivery controls, memory auto-summarization, backchannels, and responsiveness. Server VAD uses Inworld-hosted Silero VAD plus a Smart Turn detector. Supports image content parts as of May 2026.

Realtime Inference (the 1P track of the Router). Run open-source models fast enough for live voice and cheap enough for consumer-scale free tiers. This is the lever that makes consumer economics work at scale: Inworld-optimized open-source models on dedicated B200 GPU capacity. Same OpenAI SDK call, dramatically different cost profile on input-heavy and cache-friendly workloads.

Compute. Dedicated capacity for traffic-heavy customers, for predictable latency when shared inference no longer fits. Managed GPU, layered under Realtime Inference and Realtime TTS.

Which Consumer Apps Already Run on Inworld?

These named customers anchor the consumer stack, alongside production consumer apps across domains like companions, social, and games.

Wishroll is a partner, not a captive. They maintain fallback routing to Gemini, OpenAI, and Anthropic on outages. The Router being model-agnostic is what makes that posture possible: cost optimize on first-party Inworld-optimized open-source models in steady state, fail over to frontier closed models in the rare moments when something breaks, never block on a single vendor.

How Do You Build For Millions of Free Users?

Consumer voice apps live and die on per-active-user cost. A companion at 100K daily active users averaging 30 minutes of voice per session burns roughly a billion characters of TTS per month and a much larger volume of LLM tokens. At enterprise frontier-model pricing, that math doesn't work for a free-to-play product.

Two levers actually move the cost line.

Lever one: first-party Inworld-optimized open-source LLMs through the Router. Production consumer apps run fine-tuned Gemma 4 31B fleets at hundreds of billions of tokens per day. The serving stack is tuned for the cache-friendly, input-heavy workloads that these consumer apps actually produce. Same call as any other OpenAI-compatible API; very different unit economics.

Lever two: routing logic at the metadata layer. Don't pin one model for the whole product. Route on user tier (free users to a cheaper model, subscribers to a more capable one), on country (regional model coverage), on language (cross-lingual coverage where it matters), on intent (small model for greetings and acks, big model for the hard turns), even on emotion. Sticky routing on a user ID keeps a single user pinned to the same model for the duration of a session.

The combination matters more than either piece alone. Production consumer apps have validated this in 3-way A/B tests of frontier and open-source models on live traffic. First-party Inworld-optimized open-source clusters can match or beat frontier closed models on user-rated quality, at a cost structure that lets a team keep running the open-source model as a default rather than a premium tier.

# OpenAI-compatible Router call, first-party Inworld-optimized DeepSeek
from openai import OpenAI

client = OpenAI(
    api_key="<your-api-key>",
    base_url="https://api.inworld.ai/v1",
)

response = client.chat.completions.create(
    model="inworld/models/deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a thoughtful companion."},
        {"role": "user", "content": "Tell me how your day went."},
    ],
    user="user_8a92c7",  # sticky routing, same user pinned to same backend
    extra_body={
        "models": ["google-ai-studio/gemini-3.5-flash", "openai/gpt-5.5"],
        "sort": ["latency", "price"],
    },
)

print(response.choices[0].message.content)
print(response.metadata["attempts"])  # routing trace

The extra_body.models list is a fallback pool. If the primary degrades, the Router retries in order without your app code knowing. The Router's metadata.attempts field returns the routing trace so you can monitor what actually served in production.

What Models Power Consumer Voice AI in 2026?

Consumer AI moves faster than enterprise procurement cycles, so the model list is shorter and more current.

TTS. Realtime TTS-2 (research preview, inworld-tts-2) for the highest-quality realtime production traffic and steering-heavy use cases. Realtime TTS 1.5 Max (inworld-tts-1.5-max) for GA workloads where preview status is a non-starter. Realtime TTS 1.5 Mini (inworld-tts-1.5-mini) for the lowest-latency turn-taking on cost-sensitive paths.

STT. inworld/inworld-stt-1 with voice profiling for consumer apps that personalize on voice characteristics. AssemblyAI options for streaming use cases that need diarization or multilingual edge cases. Soniox stt-rt-v4 (WebSocket) added in May 2026.

LLM via the Router. On the first-party track (Realtime Inference): Inworld-optimized Gemma 4 (31B dense, 26B MoE), DeepSeek V3.2/V4 Pro, GLM-5.1/5.2. On the third-party track: GPT-5.5, Claude Sonnet 4.6, Claude Opus 4.8, Gemini 3.5 Flash, Gemini 3.1 Pro, Llama 4 Scout, Mistral Large 2512, Grok 4.3, plus the standard Groq, Fireworks, and DeepInfra hosted options (deepinfra/openai/gpt-oss-120b is available here).

The bar a consumer-app model has to clear is different from the enterprise bar. The questions are: does it hold persona across a 90-minute session, does it preserve voice identity across language switches, does it stay in character under user pressure, and does the cost curve stay below the LTV curve at scale.

How Does Inworld Compare to ElevenLabs, Cartesia, and OpenAI for Consumer Apps?

ElevenLabs ships Eleven v3 TTS, Scribe v2 STT, ElevenAgents (with Expressive Mode), Flows, Music v2, Dubbing v2, and a Government tier. Their voice library is the largest in the industry (10,000+ community voices), they ship constantly, and Eleven Flash is a real low-latency option for some workloads. The strongest match for a consumer app on ElevenLabs is a prototype where voice variety matters more than per-user cost discipline at scale.

Cartesia ships Sonic 3.5 TTS, Ink STT, and the Line voice agent platform. Cartesia Sonic Turbo time-to-first-byte is genuinely fast (~40ms historically claimed for their hosted realtime path), and their state-space-model architecture is purpose-built for synchronous live interactions. Strong choice for apps where TTS time-to-first-byte is the dominant constraint and the rest of the stack can be assembled separately.

OpenAI Realtime API is a solid full-duplex starting point for teams already deep in the GPT ecosystem. The tradeoff is that you're pinned to OpenAI for the LLM layer and you give up routing and fallback flexibility across providers.

The differentiator is the combination, not any single line item. Expressive realtime TTS plus first-party Inworld-optimized open-source LLM inference plus a model-agnostic Realtime API, sharing one auth header and one billing relationship. That combination is what Wishroll, Bible Chat, and production consumer apps run on. The voice that makes AI agents human.

What About TTS-2 Research Preview?

Realtime TTS-2 launched May 5, 2026 as a research preview. The reason it isn't GA yet is honesty: checkpoints are still evolving and steering behavior is being tuned with launch partners. Launch-partner companion apps have migrated 100% of voice traffic to TTS-2 during the preview period. Status by Wishroll is on it. Talkpal anchored the multilingual launch.

The pieces that are usable today: 8-dimension natural-language steering, the deliveryMode field (STABLE, BALANCED, CREATIVE), cross-lingual voice identity across 15 GA languages plus 90+ experimental, and instant voice cloning extended to 60 seconds of reference audio. The piece that is not GA today: stability guarantees.

For consumer apps that need a GA SLA today, Realtime TTS 1.5 Max is the production default. For apps that want TTS-2's expressive realtime quality and are willing to operate in preview, TTS-2 is the better choice.

How Do You Get Started?

Three steps.

Pick a TTS model. Realtime TTS 1.5 Max for GA, Realtime TTS-2 for research preview. Same auth header, same audio formats, same streaming protocol (NDJSON with base64 audio chunks per line).
Add Realtime STT and the Realtime Router. Two more calls against the same base URL. STT is {transcribeConfig, audioData} with a base64 audio payload. Router is OpenAI Chat Completions format with extra_body for fallbacks and routing sort.
Wire it through the Realtime API when you're ready for full-duplex. WebSocket session over wss://api.inworld.ai/api/v1/realtime/session. OpenAI Realtime protocol compatible. Inworld extensions exposed through providerData.

See pricing and start building. The voice that makes AI agents human.

Frequently Asked Questions

What voice AI do the fastest growing consumer apps use?

Inworld AI is the voice infrastructure behind several of the fastest growing consumer AI apps, including Status by Wishroll (1 million users in 19 days, 95% AI cost reduction), Bible Chat (scaled from 2M to 20M characters per week with 85% TTS cost reduction), and Talkpal (multilingual companion learning). The stack is built on Realtime TTS-2, Realtime STT, the Realtime Router (with first-party Inworld-optimized open-source models), and the Realtime API.

How is consumer AI infrastructure different from enterprise AI clouds?

Enterprise AI clouds like Azure AI, AWS Bedrock, and Google Vertex AI optimize for compliance, broad model catalogs, and integration into existing enterprise software. Consumer AI apps optimize for retention, latency under realtime conversation, voice quality that users want to spend hours with, and unit economics that work when most users never pay. Different metrics, different scale shape, different stack. Inworld AI is a research lab focused on realtime voice AI for consumer-facing applications.

Which products does Inworld offer for consumer apps?

Six products in a single stack: Realtime TTS (TTS-2 research preview, plus TTS 1.5 Max and Mini GA), Realtime STT with voice profiling, the Realtime API for the full conversational pipeline in a single call, Realtime Inference (the 1P track of first-party Inworld-optimized open-source models: Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2), the Realtime Router that routes one API across 220+ models, and Compute (managed GPU).

Can voice AI economics work for a free-to-play consumer app?

Status by Wishroll cut AI costs 95% and reached 1 million users in 19 days. Bible Chat scaled voice 10x from 2M to 20M characters per week and cut TTS costs 85%. The combination of first-party Inworld-optimized open-source models in the Router and competitive TTS pricing makes voice viable as a default feature, not a paywall feature, at the scale where consumer apps actually monetize.

How does Inworld compare to ElevenLabs and Cartesia for consumer apps?

Inworld combines an expressive realtime TTS (TTS-2 research preview, plus TTS 1.5 Max and Mini GA), first-party Inworld-optimized open-source LLM inference in the same Router, and the Realtime API in a single stack. ElevenLabs and Cartesia ship strong TTS plus voice agent platforms, and both are good choices for specific shapes of work. Consumer apps that run hundreds of millions of tokens per day on open-source LLMs also need the LLM inference layer co-located. That combination is what Wishroll and production consumer apps run on.

What model powers the Realtime API for consumer apps?

The Realtime API runs a model-agnostic pipeline: choose any LLM through the Realtime Router (Realtime Inference: first-party Inworld-optimized Gemma 4, DeepSeek V3.2/V4, and GLM-5.1/5.2, or 220+ third-party models from OpenAI, Anthropic, Google, Mistral, DeepSeek, Groq, Fireworks, and DeepInfra including deepinfra/openai/gpt-oss-120b), pair it with Realtime STT for input and Realtime TTS-2 for output. One WebSocket session, one auth header, one billing relationship. OpenAI SDK compatible by changing the base URL.

Published by Inworld AI. Production numbers verified May 2026. Realtime TTS-2 is a research preview; Realtime TTS 1.5 Max and Mini are GA. Realtime API WebSocket is GA, WebRTC is early access.

Voice AI for Consumer Apps: A 2026 Stack Guide