Last updated: May 28, 2026
Inworld AI builds the AI infrastructure behind some of the largest companion and roleplay apps in 2026, including Wishroll/Status, Janitor, Latitude, Tolans, and Bible Chat. Companion and roleplay apps need a different stack from enterprise voice agents: long sessions, heavy input-token traffic, emotional voice range, and cost per active user that works when most users never pay. This page walks through the four-layer stack (STT, LLM, TTS, realtime orchestration), how production apps actually wire it together, and the cost-per-active-user math that determines whether the business model holds.
Below: the reference architecture, working code for a full Realtime API session, the cache-friendly token math that makes companion economics viable at scale, and how to choose between cascaded and full-duplex designs.
What is AI infrastructure for companion and roleplay apps?
Companion and roleplay apps are consumer applications where users have ongoing voice or text conversations with a persistent AI character. The infrastructure has four layers:
- Speech-to-text (STT) with configurable turn-taking and voice profiling so the system knows when the user is finished speaking and remembers their voice across sessions.
- A large language model (LLM) with enough context window to hold the full session and personality stability across hours of dialogue. For volume traffic, the LLM should run on optimized open-source models hosted close to the voice pipeline.
- Text-to-speech (TTS) with emotional range, voice consistency across the session, and cross-lingual identity if the app is multilingual.
- A realtime orchestration layer that holds the WebSocket session, runs VAD, manages interruptions, and stitches the other three layers together.
The two verticals this serves are Companions (Wishroll/Status, Tolans, Bible Chat, Slingshot) and Character chat and roleplay (Janitor, Latitude). Both are part of the consumer AI category, but their workload profiles diverge: companions skew toward emotional warmth and long voice sessions, while roleplay skews toward persona stability, long-context LLMs, and heavy input-token traffic.
How is companion infrastructure different from enterprise voice agents?
Most voice AI guides assume an enterprise voice agent workload: a 2 to 5 minute customer-service call, neutral delivery, a single language, and per-minute pricing that the business can pass through to the buyer. Companion and roleplay apps invert almost every assumption.
Session length. Wishroll's Status reports average sessions over an hour and a half. Enterprise voice agents target the opposite end of the curve. Long sessions mean the LLM's context window matters, the TTS voice has to stay stable across hours of dialogue, and the orchestration layer needs to hold a session without dropping context.
Token shape. Roleplay platforms run heavily input-weighted traffic: the system prompt and chat history repeat on every turn. Janitor processes around 600B tokens per day with cache-hit-rate as a first-class metric. The infrastructure choice that matters here is not raw throughput; it is how aggressively the inference layer caches input tokens. Realtime Inference (the 1P track of the Router) is built to run open-source LLMs at consumer-scale cost with realtime latency, which is what makes input-heavy workloads viable at scale.
Voice expectations. Enterprise voice optimizes for neutral, consistent delivery. Companions need emotional range across the same session: warmth, humor, sarcasm, concern, and excitement. Realtime TTS-2 (research preview) exposes 8 dimensions of natural-language steering (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) plus non-verbal cues, and preserves a single voice identity across more than 100 languages.
Cost denominator. Enterprise pricing is per-minute or per-seat. Companion economics are cost per active user. A companion app with 100K daily active users at 30 minutes of voice per day generates roughly 900 million characters per month. The unit cost determines whether voice ships to every user or hides behind a paywall.
Failover expectations. Enterprise agents accept short downtime during a provider outage. Companion apps lose users when the voice goes silent. Production companion apps run automatic LLM failover across multiple providers so a single vendor outage does not break the product.
What does a reference architecture look like?
The minimal production architecture for a companion or roleplay app is one Realtime API session per user, with the LLM behind it routed through the Inworld Router. The Realtime API holds the WebSocket, runs the STT, calls the LLM via the Router, streams the response to TTS, and ships audio back to the client.
Here is a full session setup for a roleplay companion that routes the LLM through the 3P track, runs Realtime STT-1 with custom turn-taking, and uses TTS-2 with steering:
import WebSocket from 'ws';
const ws = new WebSocket(
`wss://api.inworld.ai/api/v1/realtime/session?key=companion-${Date.now()}&protocol=realtime`,
{ headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);
ws.on('open', () => console.log('Realtime session open'));
ws.on('message', (raw) => {
const msg = JSON.parse(raw.toString());
if (msg.type === 'session.created') {
// Configure the full companion stack in one event
ws.send(JSON.stringify({
type: 'session.update',
session: {
instructions: 'You are Nova, a curious roleplay companion. Maintain personality and remember earlier turns. Vary delivery to match the emotional tone of the conversation.',
model: 'deepseek/deepseek-v4-pro',
audio: {
input: {
transcription: { model: 'inworld/inworld-stt-1' },
turn_detection: {
type: 'server_vad',
endOfTurnConfidenceThreshold: 0.6,
interrupt_response: true
}
},
output: {
voice: 'Sarah',
model: 'inworld-tts-2',
speed: 1.0
}
},
providerData: {
memory: { auto_summarize: true },
backchannel: { enabled: true },
responsiveness: { filler_phrase: 'short' }
}
}
}));
}
if (msg.type === 'response.output_audio.delta') {
playAudioChunk(msg.delta); // base64-encoded PCM16
}
});
function sendMic(chunkBase64) {
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: chunkBase64
}));
}
Three things to notice in this configuration:
Field-name discipline. Inside the Realtime API session, audio output takes voice and model, not voiceId and modelId. The REST TTS endpoint uses the latter. The Router uses model. Mixing them across APIs is one of the most common integration bugs and produces silent failures.
Turn-taking configuration. The new STT config fields on the Realtime API (endOfTurnConfidenceThreshold, prompts, voiceProfileConfig, inactivityTimeoutSeconds) tune the conversational rhythm. For roleplay, a slightly higher confidence threshold gives users more time to finish complex thoughts; for fast-paced companions, lower works better. The default server_vad is Inworld-hosted Silero VAD plus Smart Turn detection, not the OpenAI default.
Provider data extensions. providerData.memory enables session-level auto-summarization, which keeps personality consistent in long sessions without pushing the full history to every LLM call. providerData.backchannel enables brief acknowledgments while the user is still talking. providerData.responsiveness injects a low-latency filler before the main response. These are Inworld-specific extensions to the OpenAI Realtime protocol shape.
How does the LLM layer hold personality across long sessions?
Personality consistency is the single hardest problem in roleplay infrastructure. It comes from three places.
A strong system prompt that defines the character, voice, and behavior. This is application work, not infrastructure.
A context window long enough to hold session history. Modern frontier models (Claude Sonnet 4.6, GPT-5.5, Gemini 3.1 Pro) all clear this bar. Optimized open-source models like Gemma 4 and DeepSeek V4 Pro hit similar context lengths and run cheaper.
Cache-friendly serving. Companion and roleplay traffic is repetitive at the prefix: the system prompt, character bible, and recent history repeat on every turn. The infrastructure that turns that into a discount is KV cache reuse. Realtime Inference (the 1P track of the Router) is built to run open-source LLMs at consumer-scale cost with realtime latency: throughput on Gemma 4 31B dense reaches roughly 27K tokens per second with a P50 TTFT around 1.7 seconds. Janitor uses cache-hit-rate as a primary metric for the same reason.
Calling the Router from a companion backend looks identical to calling OpenAI:
import os
import requests
# Calling the Router from a companion backend.
# Use the 1P track (Realtime Inference) for cost-sensitive volume traffic,
# the 3P track for personality-critical models.
response = requests.post(
'https://api.inworld.ai/v1/chat/completions',
headers={
'Authorization': f"Basic {os.environ['INWORLD_API_KEY']}",
'Content-Type': 'application/json'
},
json={
# 3P track: open-weight via DeepInfra.
# For 1P Realtime Inference use the inworld/ prefix
# (e.g. inworld/gemma-4-26b, inworld/deepseek-v3.2, inworld/minimax-m2.5).
'model': 'deepinfra/openai/gpt-oss-120b',
'messages': [
{'role': 'system', 'content': 'You are Nova, a curious roleplay companion. Stay in character. Remember earlier turns.'},
{'role': 'user', 'content': 'Tell me what you were thinking about while I was gone.'}
],
'temperature': 0.9,
'user': 'companion-user-42', # sticky routing per user
'extra_body': {
'models': [ # automatic failover order if primary is unavailable
'anthropic/claude-sonnet-4-6',
'google-ai-studio/gemini-3.1-pro'
]
}
},
timeout=30
)
response.raise_for_status()
data = response.json()
print(data['choices'][0]['message']['content'])
print('Routed via:', data.get('metadata', {}).get('attempts'))
Two things matter here for companions:
Sticky routing per user. The user field acts as a sticky routing identifier so the same user lands on the same backend across turns. That maximizes KV cache hits for that user's history. For roleplay traffic, sticky routing is what makes the cache math work.
Automatic failover. The extra_body.models array defines a fallback order. If the primary model is rate-limited or unavailable, the Router routes to the next entry without raising an error. metadata.attempts in the response shows which path was taken.
The Inworld Router has two tracks. The 3P track routes to external providers (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Meta, Groq, DeepInfra), which matters when personality requires a specific frontier model. Latitude's AI Game Master beat OpenAI by a point in a 3-way A/B that compared the same prompt across multiple LLMs; that test is only possible if model swapping is free. The 1P track is Realtime Inference: Inworld-hosted optimized open-source models (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5) for cost-sensitive volume traffic. Janitor runs fine-tuned Gemma 4 on the 1P track. Yonder runs DeepSeek V3.2.
OpenRouter offers a similar 3P aggregation surface but does not host models itself. For roleplay traffic where cache discipline is the cost lever, the 1P track is what matters.
How does the TTS layer carry emotion and identity across a session?
Flat prosody is the fastest way to lose a companion user. Three TTS capabilities matter for the workload.
Emotional range without prompt hacks. Realtime TTS-2 (research preview) accepts natural-language steering in 8 dimensions. Describe delivery in plain English ([say with quiet intensity]) and a deliveryMode of STABLE, BALANCED, or CREATIVE controls the variance band. Older TTS 1.5 uses inline non-verbal tags like [laughing], [whispering], [sigh]. Steering tags must not appear in TTS 1.5 requests; they would be read aloud literally.
Voice identity that does not drift. Cloned voices on Inworld persist across the full session and across model versions. For multilingual apps, TTS-2 preserves cross-lingual voice identity: the same cloned voice keeps its character across 15 GA languages and 90+ experimental languages. Talkpal uses this for language learning. Bible Chat uses it to scale to 20+ languages without managing per-language voice assets.
Streaming-native delivery. Long companion responses cannot wait for full generation before audio starts. The streaming endpoint returns NDJSON with base64 audio per line. Each line is parsed and played immediately.
import os
import json
import base64
import requests
# Streaming TTS-2 with natural-language steering for roleplay delivery.
# NDJSON: each line carries one base64 audio chunk.
response = requests.post(
'https://api.inworld.ai/tts/v1/voice:stream',
headers={
'Authorization': f"Basic {os.environ['INWORLD_API_KEY']}",
'Content-Type': 'application/json'
},
json={
'text': '[say with quiet intensity] I was thinking about the door we never opened. Maybe tomorrow.',
'voiceId': 'Sarah',
'modelId': 'inworld-tts-2',
'deliveryMode': 'BALANCED',
'audioConfig': {
'audioEncoding': 'MP3',
'sampleRateHertz': 24000
}
},
stream=True,
timeout=30
)
response.raise_for_status()
for line in response.iter_lines():
if not line:
continue
chunk = json.loads(line)
audio_b64 = chunk['result']['audioContent']
play_audio_chunk(base64.b64decode(audio_b64))
A few competitor notes for fair comparison. ElevenLabs ships Eleven v3 TTS plus the ConvAI/Agents platform (with Expressive Mode added February 2026, Flows in March), Music v2, Dubbing v2, and a Government tier. Eleven Flash claims 75ms inference latency. Cartesia ships Sonic 3.5 TTS, Ink STT, and the Line voice-agents platform. Hume EVI focuses on emotional voice intelligence and is genuinely strong on empathetic dialogue. OpenAI Realtime locks the LLM to OpenAI models but offers tight integration. OpenRouter aggregates 400+ LLMs without hosting them; it does not handle voice. Honest caveat: on at least one customer trial in May 2026, our full-pipeline latency tested higher than ElevenLabs, so we do not claim a general latency win.
Inworld's TTS-2 is #1 realtime TTS on the Artificial Analysis Realtime TTS Arena, with TTS 1.5 Max also in the top tier. See the
live leaderboard.
How does the STT layer handle turn-taking?
STT is what determines whether a companion conversation feels natural or stilted. Two capabilities matter most.
Voice profiling. Inworld STT-1 supports voice profiling for per-user identification. The model picks up on age, pitch, emotion, vocal style, and accent characteristics. For a roleplay companion that remembers its user across sessions, voice profiling closes the loop on user identity even before the LLM sees the input.
Configurable turn-taking. Standard VAD fires on silence. Realtime STT-1 exposes parameters like minEndOfTurnSilenceWhenConfident, vadThreshold (0 to 1, default 0.5), and the new endOfTurnConfidenceThreshold, which lets the model decide when a user has actually finished a thought versus paused briefly. For companion conversations with emotional pacing, this is what prevents the model from cutting users off mid-sentence.
We acknowledge known gaps. Realtime STT is English-strong; multilingual is improving but not at parity with the best monolingual streaming engines. Deepgram Flux's semantic endpointing has an edge we do not match in all conditions. AssemblyAI Universal-3 Pro Streaming ships strong multilingual quality. For roleplay apps that need broader language coverage, the Realtime API also routes STT to Soniox (soniox/stt-rt-v4), AssemblyAI (Universal-3 Pro Streaming, Multilingual Streaming), or Groq Whisper.
What does the cost-per-active-user math look like?
Companion economics are driven by three ratios. Absolute dollar numbers vary across providers and tiers; we keep this in ratio form so it stays accurate over time. See
inworld.ai/pricing for current rates.
Input-to-output ratio. A typical roleplay turn has 6 to 20 system + history tokens for every 1 output token. With a cache-aware Router that reuses the input prefix across turns, the effective input cost drops sharply. The cache-hit-rate metric Janitor tracks is exactly this lever.
Voice minutes per active user. Status averages over 90 minutes per day. Bible Chat scaled from roughly 2M to 20M characters per week. Voice cost is the dominant component of total compute for these apps; everything else (LLM, STT) is comparatively small per user.
Paid-to-free user ratio. Companion apps run mostly free user bases with monetization on premium tiers. The cost-per-free-user budget is what determines whether voice ships to everyone or hides behind a paywall.
Wishroll's
95% AI cost reduction after moving to Inworld is what made 1M users in 19 days possible at sustainable unit economics. Bible Chat reports an 85% TTS cost cut from the same migration. Slingshot moved 100% of voice traffic to TTS-2.
The premium positioning point: we do not compete on per-character price comparisons. The cost wins come from cache-aware routing, optimized open-source inference on the 1P track, and the full pipeline sharing the same inference fabric. See the live
pricing page for current numbers.
Cascaded versus full-duplex: which design wins for companions?
Two architectural patterns compete for realtime voice.
Cascaded (STT → LLM → TTS). Each layer is a separate model. Components are swappable. This is what the Inworld Realtime API runs under the hood. The advantage is model flexibility: pick the best TTS, the best STT, the best LLM, and route each independently. The disadvantage is architectural latency: each component adds its own processing time. We migrated to a duplex TTS API to preserve context across the session, and a C++ port of the streaming path cut latency a further 10 to 15 percent.
Full-duplex speech-to-speech. A single multimodal model takes audio in and emits audio out. OpenAI's Realtime API offers this for OpenAI models; Google's Gemini Live offers it for Gemini; xAI ships one too; NVIDIA's Nemotron 3 VoiceChat 12B is the open-source full-duplex contender. The advantage is end-to-end latency. The disadvantage is locked-in model choice.
For companion and roleplay apps, the trade-off usually favors cascaded. The LLM is the personality. Locking to a single vendor's full-duplex model means the personality is whatever that vendor chose. Latitude's A/B test that swapped DeepSeek for OpenAI for Anthropic only happens in a cascaded architecture. Inworld benchmarks both architectures internally against production workloads — including Gemini Live and xAI for the full-duplex side — so customers can size their own cascaded vs. full-duplex decision.
How do production companion apps actually wire this together?
A few patterns from production deployments:
Wishroll/Status runs voice on the Inworld stack with fallback routing to other providers on outages, which is the partner-not-captive pattern. The 95% AI cost reduction came from moving to Inworld's full pipeline and from cache-aware routing on the LLM layer.
Janitor runs fine-tuned Gemma 4 31B on the Inworld Router 1P track at roughly 600B tokens per day. Cache-hit-rate is treated as a primary metric. The 16-to-20 GPU scaling story is real: high-volume roleplay workloads are GPU-dense even with optimized inference.
Latitude (AI Game Master) runs DeepSeek V3.2 on the 1P track as the primary roleplay LLM. The 3-way A/B test against OpenAI showed DeepSeek won by a point on their evaluation. Heaviest realtime user as of May 2026.
Tolans is one of the largest consumer AI companion apps and runs on the Inworld voice stack.
Bible Chat scaled from roughly 2M to 20M characters per week with an 85% TTS cost cut after moving to Inworld. Voice is a default feature across all users, not a paywalled upgrade.
Slingshot (AI therapy companion) migrated 100% of voices to TTS-2. The empathy and pacing requirements of therapy conversations are the closest analog to high-quality roleplay.
Preview partners on Realtime TTS-2 include Vapi, LiveKit, Voximplant, NLX, Voicerun, DialogueAI, OtherHalf, Brahma, and Ultravox. Realtime API customers include Sofatutor (EU) and Novita.
What are the constraints to know about?
Three current limits worth flagging.
TTS-2 is research preview. Production-critical deployments should weigh the timeline. TTS 1.5 Max and TTS 1.5 Mini are fully GA. TTS-2 is unlisted on Artificial Analysis Speech Arena by choice while early checkpoints evolve.
Realtime API transports. WebSocket is GA. WebRTC and SIP are in early access.
Inference geography. STT and TTS run from US datacenters as of May 2026. EU adoption faces this constraint for latency-sensitive deployments. We are honest about this with customers like Sofatutor who have asked.
How do you get started?
- Sign up at platform.inworld.ai and generate an API key.
- Set up a Realtime API session with the WebSocket example above. Configure STT, LLM, and TTS in a single
session.update event.
- Pick an LLM: start with the 3P track (
anthropic/claude-sonnet-4-6 or openai/gpt-5.5) for personality work, then move volume traffic to Realtime Inference on the 1P track (Inworld-hosted Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5) or to deepinfra/openai/gpt-oss-120b on the 3P track once you know the persona is right.
- Clone a voice for your character using 5 to 15 seconds of reference audio via
POST /voices/v1/voices:clone. Use the returned voiceId in the session's audio.output.voice field.
- Tune turn-taking via
endOfTurnConfidenceThreshold and benchmark latency against your workload using your own representative traffic — the cascaded vs full-duplex trade-off is workload-dependent.
Related reading:
voice AI for AI companions (TTS-focused deep dive on the same vertical),
build a voice agent in 30 minutes (quickstart), and the
Realtime API documentation (transport details).
Frequently Asked Questions
What AI infrastructure do companion and roleplay apps run on in 2026?
Production apps run a four-layer stack: STT for turn-taking and voice profiling, an LLM with long context for persona consistency, TTS with emotion and cross-lingual identity, and a realtime orchestration layer. Inworld delivers all four: Realtime STT, the Inworld Router across 200+ LLMs (with optimized open-source models on the 1P track), Realtime TTS-2, and the Realtime API. Wishroll/Status, Janitor, Tolans, Latitude, and Bible Chat all run on pieces of this stack.
How is companion and roleplay infrastructure different from enterprise voice agents?
Sessions are 30 to 90+ minutes versus 2 to 5. Token traffic is heavily input-weighted (Janitor 600B tokens per day with cache-hit-rate as a primary metric). TTS needs emotional range, not neutral consistency. STT needs voice profiling and configurable turn-taking. Pricing optimizes for cost per active user, not per-minute call rates.
How do roleplay apps maintain personality consistency across long sessions?
A strong system prompt, an LLM with enough context, and a TTS voice that does not drift. Latitude runs DeepSeek V3.2 on the Inworld Router 1P track. Janitor runs fine-tuned Gemma 4 at 600B tokens per day. Both keep voice identity stable with Inworld TTS.
What does a cost-per-active-user breakdown look like for a companion app?
Three ratios drive it: input-to-output token ratio (cache-friendly routing matters), voice minutes per active user per day (Status averages 1 hour 36 minutes), and paid-to-free user ratio. See
inworld.ai/pricing for current rates.
Can I pick different models for the LLM, TTS, and STT layers?
Yes. The Realtime API is model-agnostic. The Inworld Router routes to 200+ LLMs across a 3P track (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Meta, Groq, DeepInfra) and a 1P track called Realtime Inference (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5) with sub-second TTFT. TTS choices include TTS-2 (preview, 8-dimension steering), TTS 1.5 Max, and TTS 1.5 Mini. STT runs Inworld STT-1 with voice profiling or routes to Soniox, AssemblyAI, or Groq Whisper.
How do companion apps handle interruption and turn-taking?
The Realtime API runs server_vad, which is Inworld-hosted Silero VAD plus Smart Turn detection. The session config exposes endOfTurnConfidenceThreshold, prompts, voiceProfileConfig, and inactivityTimeoutSeconds. Barge-in is handled natively: when the user starts speaking mid-response, the server cancels in-flight TTS and processes the new input.
Which production companion and roleplay apps run on Inworld?
Wishroll/Status (1M users in 19 days, 95% AI cost reduction), Janitor (600B tokens per day, fine-tuned Gemma 4 on the 1P track), Latitude (AI Game Master, DeepSeek V3.2 on the 1P track, beat OpenAI by a point in 3-way A/B), Tolans (one of the largest consumer AI companion apps), Bible Chat (2M to 20M characters per week, 85% TTS cost cut), and Slingshot (AI therapy, 100% voice migration to TTS-2).
Published by Inworld AI. Rankings from the Artificial Analysis Speech Arena as of May 2026. Production data from customer case studies and public Slack channels.