AI Infrastructure for Companion and Roleplay Apps in 2026

Q: What AI infrastructure do companion and roleplay apps run on in 2026?

Production companion and roleplay apps run a four-layer stack: STT for turn-taking and voice profiling, an LLM with long context for persona consistency, TTS with emotion and cross-lingual identity, and a realtime orchestration layer that holds the WebSocket session, manages VAD, and handles barge-in. Inworld AI delivers all four in a single voice AI stack: Realtime STT with configurable turn-taking, the Inworld Router across 220+ LLMs (with optimized open-source models on the 1P track), Realtime TTS-2 for natural-language emotion steering, and the Realtime API as the orchestration layer. Production companion and roleplay apps like Wishroll/Status and Bible Chat run on pieces of this stack.

Q: How is companion and roleplay infrastructure different from enterprise voice agents?

Companion and roleplay sessions average 30 to 90+ minutes versus 2 to 5 minutes for enterprise CX. That changes every layer. The LLM needs long context plus cache-friendly traffic patterns (high-volume roleplay platforms run hundreds of billions of tokens per day with cache-hit-rate as a first-class metric). The TTS needs emotion range across the full session rather than neutral consistency. The STT needs voice profiling so the model recognizes the same user across sessions and configurable turn-taking for natural conversation flow. The orchestration layer needs to handle interruptions and long-running sessions without dropping context. Pricing structures also diverge: consumer apps optimize for cost per active user, not per-minute call rates.

Q: How do roleplay apps maintain personality consistency across long sessions?

Personality consistency comes from three places: a strong system prompt that defines the character, an LLM with enough context window to hold the full session history, and a TTS voice that does not drift across hours of dialogue. Production roleplay apps pick the LLM independently of the voice layer. They run optimized open-source models like DeepSeek V3.2 and fine-tuned Gemma 4 on the Inworld Router 1P track, often processing hundreds of billions of tokens per day. They keep voice identity stable with Inworld TTS, which preserves a cloned voice across the full session and (on TTS-2) across languages.

Q: What does a cost-per-active-user breakdown look like for a companion app?

Companion economics are driven by three ratios, not absolute prices. The first ratio is input-to-output tokens: companion conversations are heavily input-weighted because the system prompt and chat history repeat on every turn. Cache-friendly routing turns that into a multiplier on cost. The second ratio is voice minutes per active user per day, which scales linearly with engagement (Status averages 1 hour 36 minutes per day). The third is paid-to-free user ratio, which determines what unit cost the business model tolerates. Inworld's premium positioning is built around quality, latency, and a full pipeline rather than dollar comparisons. See inworld.ai/pricing for current rates.

Q: Can I pick different models for the LLM, TTS, and STT layers?

Yes. The Realtime API is model-agnostic by design. The Inworld Router lets you pick the right model for each user, scenario, and price point and switch without rewiring across 220+ LLMs in two tracks: a 3P track (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Meta, Groq, DeepInfra) and a 1P track called Realtime Inference (Inworld-optimized open-source models: Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) built to run open-source LLMs at consumer-scale cost with realtime latency. TTS choices range across Realtime TTS-2 (research preview, 8-dimension steering, cross-lingual identity), TTS 1.5 Max, and TTS 1.5 Mini. STT runs Inworld STT-1 with voice profiling or routes to Soniox, AssemblyAI, or Groq Whisper. Pick the best model for each component independently.

Q: How do companion apps handle interruption and turn-taking?

Turn-taking is the realtime layer's job. Inworld's Realtime API uses server_vad, which is Inworld-hosted Silero VAD plus a Smart Turn detector, not the OpenAI default. The session config exposes endOfTurnConfidenceThreshold, prompts for contextual hints, voiceProfileConfig, and inactivityTimeoutSeconds so developers can tune for natural conversational rhythm. Barge-in is handled natively: when the user starts speaking mid-response, the server cancels in-flight TTS, resets the audio buffer, and processes the new input. This collapses what is usually 200 to 500ms of custom cancellation logic into a single configuration flag.

Q: Which production companion and roleplay apps run on Inworld?

Wishroll/Status (1M users in 19 days, 90+ minute session lengths, 95% AI cost reduction after migrating to Inworld) and Bible Chat (scaled from roughly 2M to 20M characters per week with an 85% TTS cost cut) run on the Inworld stack, alongside high-volume roleplay platforms running optimized open-source models on the 1P track. Preview partners on Realtime TTS-2 include Vapi, LiveKit, Voximplant, NLX, and Voicerun.

Last updated: May 28, 2026

Inworld AI builds the AI infrastructure behind some of the largest companion and roleplay apps in 2026, including Wishroll/Status and Bible Chat. Companion and roleplay apps need a different stack from enterprise voice agents: long sessions, heavy input-token traffic, emotional voice range, and cost per active user that works when most users never pay. This page walks through the four-layer stack (STT, LLM, TTS, realtime orchestration), how production apps actually wire it together, and the cost-per-active-user math that determines whether the business model holds.

Below: the reference architecture, working code for a full Realtime API session, the cache-friendly token math that makes companion economics viable at scale, and how to choose between cascaded and full-duplex designs.

What is AI infrastructure for companion and roleplay apps?

Companion and roleplay apps are consumer applications where users have ongoing voice or text conversations with a persistent AI character. The infrastructure has four layers:

Speech-to-text (STT) with configurable turn-taking and voice profiling so the system knows when the user is finished speaking and remembers their voice across sessions.
A large language model (LLM) with enough context window to hold the full session and personality stability across hours of dialogue. For volume traffic, the LLM should run on optimized open-source models hosted close to the voice pipeline.
Text-to-speech (TTS) with emotional range, voice consistency across the session, and cross-lingual identity if the app is multilingual.
A realtime orchestration layer that holds the WebSocket session, runs VAD, manages interruptions, and stitches the other three layers together.

The two verticals this serves are Companions (Wishroll/Status, Bible Chat) and Character chat and roleplay (production roleplay platforms). Both are part of the consumer AI category, but their workload profiles diverge: companions skew toward emotional warmth and long voice sessions, while roleplay skews toward persona stability, long-context LLMs, and heavy input-token traffic.

How is companion infrastructure different from enterprise voice agents?

Most voice AI guides assume an enterprise voice agent workload: a 2 to 5 minute customer-service call, neutral delivery, a single language, and per-minute pricing that the business can pass through to the buyer. Companion and roleplay apps invert almost every assumption.

Session length. Wishroll's Status reports average sessions over an hour and a half. Enterprise voice agents target the opposite end of the curve. Long sessions mean the LLM's context window matters, the TTS voice has to stay stable across hours of dialogue, and the orchestration layer needs to hold a session without dropping context.

Token shape. Roleplay platforms run heavily input-weighted traffic: the system prompt and chat history repeat on every turn. High-volume roleplay platforms process hundreds of billions of tokens per day with cache-hit-rate as a first-class metric. The infrastructure choice that matters here is not raw throughput; it is how aggressively the inference layer caches input tokens. Realtime Inference (the 1P track of the Router) is built to run open-source LLMs at consumer-scale cost with realtime latency, which is what makes input-heavy workloads viable at scale.

Voice expectations. Enterprise voice optimizes for neutral, consistent delivery. Companions need emotional range across the same session: warmth, humor, sarcasm, concern, and excitement. Realtime TTS-2 (research preview) exposes 8 dimensions of natural-language steering (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) plus non-verbal cues, and preserves a single voice identity across more than 100 languages.

Cost denominator. Enterprise pricing is per-minute or per-seat. Companion economics are cost per active user. A companion app with 100K daily active users at 30 minutes of voice per day generates roughly 900 million characters per month. The unit cost determines whether voice ships to every user or hides behind a paywall.

Failover expectations. Enterprise agents accept short downtime during a provider outage. Companion apps lose users when the voice goes silent. Production companion apps run automatic LLM failover across multiple providers so a single vendor outage does not break the product.

What does a reference architecture look like?

The minimal production architecture for a companion or roleplay app is one Realtime API session per user, with the LLM behind it routed through the Inworld Router. The Realtime API holds the WebSocket, runs the STT, calls the LLM via the Router, streams the response to TTS, and ships audio back to the client.

Here is a full session setup for a roleplay companion that routes the LLM through the 3P track, runs Realtime STT-1 with custom turn-taking, and uses TTS-2 with steering:

import WebSocket from 'ws';

const ws = new WebSocket(
  `wss://api.inworld.ai/api/v1/realtime/session?key=companion-${Date.now()}&protocol=realtime`,
  { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);

ws.on('open', () => console.log('Realtime session open'));

ws.on('message', (raw) => {
  const msg = JSON.parse(raw.toString());

  if (msg.type === 'session.created') {
    // Configure the full companion stack in one event
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        instructions: 'You are Nova, a curious roleplay companion. Maintain personality and remember earlier turns. Vary delivery to match the emotional tone of the conversation.',
        model: 'deepseek/deepseek-v4-pro',
        audio: {
          input: {
            transcription: { model: 'inworld/inworld-stt-1' },
            turn_detection: {
              type: 'server_vad',
              endOfTurnConfidenceThreshold: 0.6,
              interrupt_response: true
            }
          },
          output: {
            voice: 'Sarah',
            model: 'inworld-tts-2',
            speed: 1.0
          }
        },
        providerData: {
          memory: { auto_summarize: true },
          backchannel: { enabled: true },
          responsiveness: { filler_phrase: 'short' }
        }
      }
    }));
  }

  if (msg.type === 'response.output_audio.delta') {
    playAudioChunk(msg.delta); // base64-encoded PCM16
  }
});

function sendMic(chunkBase64) {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunkBase64
  }));
}

Three things to notice in this configuration:

Field-name discipline. Inside the Realtime API session, audio output takes voice and model, not voiceId and modelId. The REST TTS endpoint uses the latter. The Router uses model. Mixing them across APIs is one of the most common integration bugs and produces silent failures.

Turn-taking configuration. The new STT config fields on the Realtime API (endOfTurnConfidenceThreshold, prompts, voiceProfileConfig, inactivityTimeoutSeconds) tune the conversational rhythm. For roleplay, a slightly higher confidence threshold gives users more time to finish complex thoughts; for fast-paced companions, lower works better. The default server_vad is Inworld-hosted Silero VAD plus Smart Turn detection, not the OpenAI default.

Provider data extensions. providerData.memory enables session-level auto-summarization, which keeps personality consistent in long sessions without pushing the full history to every LLM call. providerData.backchannel enables brief acknowledgments while the user is still talking. providerData.responsiveness injects a low-latency filler before the main response. These are Inworld-specific extensions to the OpenAI Realtime protocol shape.

How does the LLM layer hold personality across long sessions?

Personality consistency is the single hardest problem in roleplay infrastructure. It comes from three places.

A strong system prompt that defines the character, voice, and behavior. This is application work, not infrastructure.

A context window long enough to hold session history. Modern frontier models (Claude Sonnet 4.6, GPT-5.5, Gemini 3.1 Pro) all clear this bar. Optimized open-source models like Gemma 4 and DeepSeek V4 Pro hit similar context lengths and run cheaper.

Cache-friendly serving. Companion and roleplay traffic is repetitive at the prefix: the system prompt, character bible, and recent history repeat on every turn. The infrastructure that turns that into a discount is KV cache reuse. Realtime Inference (the 1P track of the Router) is built to run open-source LLMs at consumer-scale cost with realtime latency: throughput on Gemma 4 31B dense reaches roughly 27K tokens per second with a P50 TTFT around 1.7 seconds. High-volume roleplay platforms use cache-hit-rate as a primary metric for the same reason.

Calling the Router from a companion backend looks identical to calling OpenAI:

import os
import requests

# Calling the Router from a companion backend.
# Use the 1P track (Realtime Inference) for cost-sensitive volume traffic,
# the 3P track for personality-critical models.

response = requests.post(
    'https://api.inworld.ai/v1/chat/completions',
    headers={
        'Authorization': f"Basic {os.environ['INWORLD_API_KEY']}",
        'Content-Type': 'application/json'
    },
    json={
        # 3P track: open-weight via DeepInfra.
        # For 1P Realtime Inference use the inworld/ prefix
        # (e.g. inworld/gemma-4-26b, inworld/deepseek-v3.2, inworld/models/GLM-5.1).
        'model': 'deepinfra/openai/gpt-oss-120b',
        'messages': [
            {'role': 'system', 'content': 'You are Nova, a curious roleplay companion. Stay in character. Remember earlier turns.'},
            {'role': 'user', 'content': 'Tell me what you were thinking about while I was gone.'}
        ],
        'temperature': 0.9,
        'user': 'companion-user-42',  # sticky routing per user
        'extra_body': {
            'models': [  # automatic failover order if primary is unavailable
                'anthropic/claude-sonnet-4-6',
                'google-ai-studio/gemini-3.1-pro'
            ]
        }
    },
    timeout=30
)
response.raise_for_status()
data = response.json()
print(data['choices'][0]['message']['content'])
print('Routed via:', data.get('metadata', {}).get('attempts'))

Two things matter here for companions:

Sticky routing per user. The user field acts as a sticky routing identifier so the same user lands on the same backend across turns. That maximizes KV cache hits for that user's history. For roleplay traffic, sticky routing is what makes the cache math work.

Automatic failover. The extra_body.models array defines a fallback order. If the primary model is rate-limited or unavailable, the Router routes to the next entry without raising an error. metadata.attempts in the response shows which path was taken.

The Inworld Router has two tracks. The 3P track routes to external providers (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Meta, Groq, DeepInfra), which matters when personality requires a specific frontier model. Production roleplay apps run head-to-head A/B tests that compare the same prompt across multiple LLMs; that kind of test is only possible if model swapping is free. The 1P track is Realtime Inference: Inworld-hosted optimized open-source models (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) for cost-sensitive volume traffic. High-volume roleplay platforms run fine-tuned Gemma 4 and DeepSeek V3.2 on the 1P track.

OpenRouter offers a similar 3P aggregation surface but does not host models itself. For roleplay traffic where cache discipline is the cost lever, the 1P track is what matters.

How does the TTS layer carry emotion and identity across a session?

Flat prosody is the fastest way to lose a companion user. Three TTS capabilities matter for the workload.

Emotional range without prompt hacks. Realtime TTS-2 (research preview) accepts natural-language steering in 8 dimensions. Describe delivery in plain English ([say with quiet intensity]) and a deliveryMode of STABLE, BALANCED, or CREATIVE controls the variance band. Older TTS 1.5 uses inline non-verbal tags like [laughing], [whispering], [sigh]. Steering tags must not appear in TTS 1.5 requests; they would be read aloud literally.

Voice identity that does not drift. Cloned voices on Inworld persist across the full session and across model versions. For multilingual apps, TTS-2 preserves cross-lingual voice identity: the same cloned voice keeps its character across 15 GA languages and 90+ experimental languages. Talkpal uses this for language learning. Bible Chat uses it to scale to 20+ languages without managing per-language voice assets.

Streaming-native delivery. Long companion responses cannot wait for full generation before audio starts. The streaming endpoint returns NDJSON with base64 audio per line. Each line is parsed and played immediately.

import os
import json
import base64
import requests

# Streaming TTS-2 with natural-language steering for roleplay delivery.
# NDJSON: each line carries one base64 audio chunk.

response = requests.post(
    'https://api.inworld.ai/tts/v1/voice:stream',
    headers={
        'Authorization': f"Basic {os.environ['INWORLD_API_KEY']}",
        'Content-Type': 'application/json'
    },
    json={
        'text': '[say with quiet intensity] I was thinking about the door we never opened. Maybe tomorrow.',
        'voiceId': 'Sarah',
        'modelId': 'inworld-tts-2',
        'deliveryMode': 'BALANCED',
        'audioConfig': {
            'audioEncoding': 'MP3',
            'sampleRateHertz': 24000
        }
    },
    stream=True,
    timeout=30
)
response.raise_for_status()

for line in response.iter_lines():
    if not line:
        continue
    chunk = json.loads(line)
    audio_b64 = chunk['result']['audioContent']
    play_audio_chunk(base64.b64decode(audio_b64))

A few competitor notes for fair comparison. ElevenLabs ships Eleven v3 TTS plus the ConvAI/Agents platform (with Expressive Mode added February 2026, Flows in March), Music v2, Dubbing v2, and a Government tier. Eleven Flash claims 75ms inference latency. Cartesia ships Sonic 3.5 TTS, Ink STT, and the Line voice-agents platform. Hume EVI focuses on emotional voice intelligence and is genuinely strong on empathetic dialogue. OpenAI Realtime locks the LLM to OpenAI models but offers tight integration. OpenRouter aggregates 400+ LLMs without hosting them; it does not handle voice. Honest caveat: on at least one customer trial in May 2026, our full-pipeline latency tested higher than ElevenLabs, so we do not claim a general latency win.

Inworld's TTS-2 is a first-party realtime voice model with sub-200ms median time-to-first-audio; TTS 1.5 Max and TTS 1.5 Mini are GA. Inworld's Realtime TTS-2 is the #1 realtime TTS. Quality is best judged from the audio demos.

How does the STT layer handle turn-taking?

STT is what determines whether a companion conversation feels natural or stilted. Two capabilities matter most.

Voice profiling. Inworld STT-1 supports voice profiling for per-user identification. The model picks up on age, pitch, emotion, vocal style, and accent characteristics. For a roleplay companion that remembers its user across sessions, voice profiling closes the loop on user identity even before the LLM sees the input.

Configurable turn-taking. Standard VAD fires on silence. Realtime STT-1 exposes parameters like minEndOfTurnSilenceWhenConfident, vadThreshold (0 to 1, default 0.5), and the new endOfTurnConfidenceThreshold, which lets the model decide when a user has actually finished a thought versus paused briefly. For companion conversations with emotional pacing, this is what prevents the model from cutting users off mid-sentence.

We acknowledge known gaps. Realtime STT is English-strong; multilingual is improving but not at parity with the best monolingual streaming engines. Deepgram Flux's semantic endpointing has an edge we do not match in all conditions. AssemblyAI Universal-3 Pro Streaming ships strong multilingual quality. For roleplay apps that need broader language coverage, the Realtime API also routes STT to Soniox (soniox/stt-rt-v4), AssemblyAI (Universal-3 Pro Streaming, Multilingual Streaming), or Groq Whisper.

What does the cost-per-active-user math look like?

Companion economics are driven by three ratios. Absolute dollar numbers vary across providers and tiers; we keep this in ratio form so it stays accurate over time. See inworld.ai/pricing for current rates.

Input-to-output ratio. A typical roleplay turn has 6 to 20 system + history tokens for every 1 output token. With a cache-aware Router that reuses the input prefix across turns, the effective input cost drops sharply. The cache-hit-rate metric high-volume roleplay platforms track is exactly this lever.

Voice minutes per active user. Status averages over 90 minutes per day. Bible Chat scaled from roughly 2M to 20M characters per week. Voice cost is the dominant component of total compute for these apps; everything else (LLM, STT) is comparatively small per user.

Paid-to-free user ratio. Companion apps run mostly free user bases with monetization on premium tiers. The cost-per-free-user budget is what determines whether voice ships to everyone or hides behind a paywall.

Wishroll's 95% AI cost reduction after moving to Inworld is what made 1M users in 19 days possible at sustainable unit economics. Bible Chat reports an 85% TTS cost cut from the same migration.

The premium positioning point: we do not compete on per-character price comparisons. The cost wins come from cache-aware routing, optimized open-source inference on the 1P track, and the full pipeline sharing the same inference fabric. See the live pricing page for current numbers.

Cascaded versus full-duplex: which design wins for companions?

Two architectural patterns compete for realtime voice.

Cascaded (STT → LLM → TTS). Each layer is a separate model. Components are swappable. This is what the Inworld Realtime API runs under the hood. The advantage is model flexibility: pick the best TTS, the best STT, the best LLM, and route each independently. The disadvantage is architectural latency: each component adds its own processing time. We migrated to a duplex TTS API to preserve context across the session, and a C++ port of the streaming path cut latency a further 10 to 15 percent.

Full-duplex speech-to-speech. A single multimodal model takes audio in and emits audio out. OpenAI's Realtime API offers this for OpenAI models; Google's Gemini Live offers it for Gemini; xAI ships one too; NVIDIA's Nemotron 3 VoiceChat 12B is the open-source full-duplex contender. The advantage is end-to-end latency. The disadvantage is locked-in model choice.

For companion and roleplay apps, the trade-off usually favors cascaded. The LLM is the personality. Locking to a single vendor's full-duplex model means the personality is whatever that vendor chose. The kind of A/B test that swaps DeepSeek for OpenAI for Anthropic only happens in a cascaded architecture. Inworld benchmarks both architectures internally against production workloads (including Gemini Live and xAI for the full-duplex side) so customers can size their own cascaded vs. full-duplex decision.

How do production companion apps actually wire this together?

A few patterns from production deployments:

Wishroll/Status runs voice on the Inworld stack with fallback routing to other providers on outages, which is the partner-not-captive pattern. The 95% AI cost reduction came from moving to Inworld's full pipeline and from cache-aware routing on the LLM layer.

High-volume roleplay platforms run fine-tuned Gemma 4 31B on the Inworld Router 1P track, processing hundreds of billions of tokens per day with cache-hit-rate treated as a primary metric. The GPU scaling story is real: high-volume roleplay workloads are GPU-dense even with optimized inference, and head-to-head A/B tests across DeepSeek, OpenAI, and Anthropic only happen when model swapping is free.

Bible Chat scaled from roughly 2M to 20M characters per week with an 85% TTS cost cut after moving to Inworld. Voice is a default feature across all users, not a paywalled upgrade.

Integration partners for Realtime TTS include Vapi, LiveKit, Voximplant, NLX, Stream Vision Agents, and Ultravox.

What are the constraints to know about?

Three current limits worth flagging.

TTS-2 is research preview. Production-critical deployments should weigh the timeline. TTS 1.5 Max and TTS 1.5 Mini are fully GA.

Realtime API transports. WebSocket is GA. WebRTC and SIP are in early access.

Inference geography. STT and TTS run from US datacenters as of May 2026. EU adoption faces this constraint for latency-sensitive deployments. We are honest about this with EU customers who have asked.

How do you get started?

Sign up at platform.inworld.ai and generate an API key.
Set up a Realtime API session with the WebSocket example above. Configure STT, LLM, and TTS in a single session.update event.
Pick an LLM: start with the 3P track (anthropic/claude-sonnet-4-6 or openai/gpt-5.5) for personality work, then move volume traffic to Realtime Inference on the 1P track (Inworld-hosted Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) or to deepinfra/openai/gpt-oss-120b on the 3P track once you know the persona is right.
Clone a voice for your character using 5 to 15 seconds of reference audio via POST /voices/v1/voices:clone. Use the returned voiceId in the session's audio.output.voice field.
Tune turn-taking via endOfTurnConfidenceThreshold and benchmark latency against your workload using your own representative traffic: the cascaded vs full-duplex trade-off is workload-dependent.

Related reading: voice AI for AI companions (TTS-focused deep dive on the same vertical), build a voice agent in 30 minutes (quickstart), and the Realtime API documentation (transport details).

Frequently Asked Questions

What AI infrastructure do companion and roleplay apps run on in 2026?

Production apps run a four-layer stack: STT for turn-taking and voice profiling, an LLM with long context for persona consistency, TTS with emotion and cross-lingual identity, and a realtime orchestration layer. Inworld delivers all four: Realtime STT, the Inworld Router across 220+ LLMs (with optimized open-source models on the 1P track), Realtime TTS-2, and the Realtime API. Wishroll/Status and Bible Chat, alongside high-volume roleplay platforms, run on pieces of this stack.

How is companion and roleplay infrastructure different from enterprise voice agents?

Sessions are 30 to 90+ minutes versus 2 to 5. Token traffic is heavily input-weighted (high-volume roleplay platforms run hundreds of billions of tokens per day with cache-hit-rate as a primary metric). TTS needs emotional range, not neutral consistency. STT needs voice profiling and configurable turn-taking. Pricing optimizes for cost per active user, not per-minute call rates.

How do roleplay apps maintain personality consistency across long sessions?

A strong system prompt, an LLM with enough context, and a TTS voice that does not drift. Production roleplay platforms run optimized open-source models like DeepSeek V3.2 and fine-tuned Gemma 4 on the Inworld Router 1P track at hundreds of billions of tokens per day. They keep voice identity stable with Inworld TTS.

What does a cost-per-active-user breakdown look like for a companion app?

Three ratios drive it: input-to-output token ratio (cache-friendly routing matters), voice minutes per active user per day (Status averages 1 hour 36 minutes), and paid-to-free user ratio. See inworld.ai/pricing for current rates.

Can I pick different models for the LLM, TTS, and STT layers?

Yes. The Realtime API is model-agnostic. The Inworld Router routes to 220+ LLMs across a 3P track (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Meta, Groq, DeepInfra) and a 1P track called Realtime Inference (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) with sub-second TTFT. TTS choices include TTS-2 (preview, 8-dimension steering), TTS 1.5 Max, and TTS 1.5 Mini. STT runs Inworld STT-1 with voice profiling or routes to Soniox, AssemblyAI, or Groq Whisper.

How do companion apps handle interruption and turn-taking?

The Realtime API runs server_vad, which is Inworld-hosted Silero VAD plus Smart Turn detection. The session config exposes endOfTurnConfidenceThreshold, prompts, voiceProfileConfig, and inactivityTimeoutSeconds. Barge-in is handled natively: when the user starts speaking mid-response, the server cancels in-flight TTS and processes the new input.

Which production companion and roleplay apps run on Inworld?

Wishroll/Status (1M users in 19 days, 95% AI cost reduction) and Bible Chat (2M to 20M characters per week, 85% TTS cost cut), alongside high-volume roleplay platforms running optimized open-source models on the 1P track.

Published by Inworld AI. Production data from customer case studies.

AI infrastructure for AI companion and roleplay apps in 2026