Last updated: May 28, 2026
Inworld AI builds the realtime voice stack behind production AI character apps including Janitor, Latitude, Tolans, and Bible Chat. The best voice AI for AI character apps in 2026 has to do four things at once: hold a distinctive persona over hours of conversation, give every character its own recognizable voice, run on a different LLM personality per character without rewriting the client, and survive free-tier economics where most users never pay. Realtime TTS-2 (research preview) is the
#1 realtime TTS on the Artificial Analysis Speech Arena, and the same fabric powers Janitor's 600B-token-per-day Router traffic and Latitude's DeepSeek V3.2 deployment.
This page is about the character-app stack specifically. Not general companionship, not enterprise voice agents. Character apps where the character is the product, often user-generated, often roleplay-heavy. Below: what character apps need from voice, how the Inworld Realtime API addresses each requirement, and working code for the persona-driven pieces of the stack.
Quality ranking from the Artificial Analysis Speech Arena, May 2026. Competitor product detail from public product pages as of May 2026.
What is a character app, and how is it different from a companion app?
The two categories overlap, but the engineering shape is different.
A companion app typically ships with a small, curated set of personas users grow attached to over time. Tolans is in this lane, as is Bible Chat. Engagement is deep, long, and emotionally consistent. The voice has to carry warmth and continuity across hundreds of sessions with the same character.
A character app is a platform where the character is the product. Often user-generated. Often hundreds or thousands of personas per app. Often roleplay-heavy. Janitor and Latitude are character apps in this sense. The engineering differences:
- Persona isolation. Character A's prompt cannot bleed into Character B's voice or behavior. System prompts, voices, and memory must be scoped tightly.
- Distinct voices at scale. Every character needs a recognizable voice without commissioning voice actors per persona. Voice cloning is the only thing that scales.
- Free-tier economics. Most users in a character app never pay. Per-character per-token transparent economics matter more than headline rates.
- Personality variance across the LLM layer. Different characters benefit from different LLMs. A noir detective sounds different on Claude than on DeepSeek. Locking to one model family flattens the personality space.
Character chat is one of the
three verticals Inworld optimizes for, alongside companions and roleplay. The Realtime API, Realtime TTS-2, and Realtime Router were built with these workloads in mind.
How do you give every character its own voice?
Voice identity is the single biggest factor in whether a character feels real over time. If the voice drifts between sessions, the persona collapses.
Zero-shot voice cloning from 5 to 15 seconds of reference audio is the standard pattern. Clone the voice once via /voices/v1/voices:clone, then reuse the returned voiceId in every subsequent TTS call. The cloned voice persists across sessions and reproduces consistently on every request.
import requests
import base64
import os
INWORLD_API_KEY = os.environ['INWORLD_API_KEY']
# Step 1: Clone a voice from 5-15 seconds of reference audio
with open('vesper_reference.wav', 'rb') as f:
audio_b64 = base64.b64encode(f.read()).decode('utf-8')
clone_response = requests.post(
'https://api.inworld.ai/voices/v1/voices:clone',
headers={
'Authorization': f'Basic {INWORLD_API_KEY}',
'Content-Type': 'application/json'
},
json={
'displayName': 'Captain Vesper',
'langCode': 'EN_US',
'voiceSamples': [{'audioData': audio_b64}]
},
timeout=60
)
clone_response.raise_for_status()
cloned_voice_id = clone_response.json()['voice']['voiceId']
# Step 2: Use the cloned voice with TTS-2 natural-language steering
response = requests.post(
'https://api.inworld.ai/tts/v1/voice',
headers={
'Authorization': f'Basic {INWORLD_API_KEY}',
'Content-Type': 'application/json'
},
json={
'text': '[say with dry amusement] So you finally got the warp coil online. Took you long enough, Captain.',
'voiceId': cloned_voice_id,
'modelId': 'inworld-tts-2',
'deliveryMode': 'BALANCED',
'audioConfig': {
'audioEncoding': 'MP3',
'sampleRateHertz': 24000
}
},
timeout=30
)
response.raise_for_status()
audio_bytes = base64.b64decode(response.json()['audioContent'])
with open('vesper_line.mp3', 'wb') as f:
f.write(audio_bytes)
Three details matter for character apps specifically:
Cross-lingual voice identity. TTS-2 preserves the same character voice across 100+ languages (15 GA, 90+ experimental). A user roleplaying a French-speaking diplomat in one session and switching to English in the next gets the same voice, not a different one for each language. The 15 GA languages have the highest quality bar; the 90+ experimental languages broaden coverage without losing identity.
Reference audio quality matters. Clean speech, minimal background noise, 5 to 15 seconds at the higher end of that range. For character apps where users upload their own audio, the cloning endpoint accepts noisy reference audio with optional audioProcessingConfig.removeBackgroundNoise, but cleaner input always produces a more faithful clone.
TTS-2 steering as the personality layer. Once the voice is cloned, TTS-2 natural-language steering across 8 dimensions (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) controls how each line is delivered. Captain Vesper can be [say with dry amusement] in one line and [say in a hushed warning tone] in the next, on the same voice, without remixing audio. The deliveryMode field (STABLE, BALANCED, CREATIVE) controls expressive variance per request.
TTS-2 is research preview. For deployments that need strict GA guarantees, TTS 1.5 Max and 1.5 Mini are fully GA and support voice cloning and inline non-verbal tags ([laugh], [breathe], [sigh]).
Which LLM should each character run on?
This is the question character-app teams underestimate. Personality lives in the LLM. Locking every character to one model flattens the persona space.
The
Realtime Router routes to 200+ LLMs in one API with two tracks:
- 3P track. External providers: OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, DeepInfra.
gpt-oss-120b is routable here via deepinfra/openai/gpt-oss-120b.
- 1P track (Realtime Inference). Inworld-hosted, optimized open-source models with sub-second TTFT. The stack is vLLM + FlashInfer + speculative decoding + KV cache. Confirmed 1P models: Gemma 4 (26B / 31B NVFP4), DeepSeek V3.2 / V4, MiniMax-M2.5.
Two production patterns from anchor character apps:
Janitor. Runs a fine-tuned Gemma on the 1P track. Roughly 600 billion tokens per day. The choice of a fine-tuned Gemma reflects the character-app pattern: a strong, controllable base model fine-tuned on roleplay distributions performs better at staying in character than a frontier general-purpose model.
Latitude (AI Game Master). Runs DeepSeek V3.2 in a dedicated cluster. In a 3-way A/B test, DeepSeek V3.2 on Inworld beat OpenAI by a point. Latitude is the heaviest realtime user on the platform.
Same Router, different personalities. Switching between them is a model field change:
import requests
import os
INWORLD_API_KEY = os.environ['INWORLD_API_KEY']
# Pick the LLM personality per character: Janitor pattern (Gemma) vs Latitude pattern (DeepSeek)
response = requests.post(
'https://api.inworld.ai/v1/chat/completions',
headers={
'Authorization': f'Basic {INWORLD_API_KEY}',
'Content-Type': 'application/json'
},
json={
'model': 'deepseek/deepseek-v4-pro',
'messages': [
{'role': 'system', 'content': 'You are Captain Vesper, a sharp-tongued starship navigator. Stay in character.'},
{'role': 'user', 'content': 'Where are we headed?'}
],
'user': 'vesper-character-id-42',
'temperature': 0.9
},
timeout=60
)
response.raise_for_status()
print(response.json()['choices'][0]['message']['content'])
The Router's user parameter provides sticky routing per character ID, which is useful for cache-hit-rate optimization at character-app scale (cache hits drop dramatically when prompts rotate across characters without identity hints).
For specific personalities:
- Claude Sonnet 4.6 or Opus 4.7 for nuanced, empathetic, long-context characters
- DeepSeek V4 Pro for cost-disciplined roleplay where character-faithful generation matters more than reasoning depth
- Gemma 4 (1P track / Realtime Inference) for fine-tuned, in-house character distributions with sub-second TTFT
- MiniMax-M2.5 (1P track / Realtime Inference) for agentic, tool-use-heavy characters
deepinfra/openai/gpt-oss-120b (3P track) when an open-weights frontier-class model is required
- GPT-5.5 for creative storytelling and improvisation
A/B test against actual retention data. Different personas tend to favor different models in ways prompt-engineering benchmarks do not predict.
How do you wire the full character voice pipeline?
The
Realtime API collapses STT, LLM, TTS, voice activity detection, turn-taking, and interruption handling into a single WebSocket or WebRTC connection. Audio in, audio out. The LLM choice flows through the same connection via the Router.
const WebSocket = require('ws');
const ws = new WebSocket(
`wss://api.inworld.ai/api/v1/realtime/session?key=character-${Date.now()}&protocol=realtime`,
{ headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);
ws.on('open', () => console.log('Connected'));
ws.on('message', (raw) => {
const msg = JSON.parse(raw.toString());
if (msg.type === 'session.created') {
// Configure a character session: persona, voice, LLM, semantic VAD
ws.send(JSON.stringify({
type: 'session.update',
session: {
instructions: 'You are Captain Vesper, a sharp-tongued starship navigator with a dry sense of humor. Stay in character. Reference past conversations when relevant.',
model: 'deepseek/deepseek-v4-pro',
audio: {
input: {
transcription: { model: 'inworld/inworld-stt-1' },
turn_detection: {
type: 'semantic_vad',
eagerness: 'medium',
interrupt_response: true
}
},
output: {
voice: 'vesper-cloned',
model: 'inworld-tts-2'
}
}
}
}));
}
if (msg.type === 'response.output_audio.delta') {
playAudio(msg.delta); // base64-encoded PCM16
}
});
function sendMicrophoneChunk(audioChunk) {
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: audioChunk
}));
}
Three pieces of this configuration matter for character apps:
server_vad uses Inworld's own Silero VAD + Smart Turn detector, not the OpenAI default. The Smart Turn detector is tuned for character-app turn-taking patterns where users pause for dramatic effect or to think through a roleplay decision. Semantic VAD (semantic_vad with eagerness: 'medium') listens to what the user is saying to decide when they are done, instead of triggering on silence alone.
Interruption handling with interrupt_response: true enables barge-in. When a user interrupts a character mid-line, the system cancels in-flight TTS generation and pivots to the new input. Without barge-in, the character finishes its line before responding, which breaks roleplay flow.
Per-character session config. The instructions field carries the character system prompt. The model field selects the LLM. The voice and model fields inside session.audio.output select the TTS voice and TTS model. (Note: the Realtime WebSocket uses voice and model, while the REST TTS API uses voiceId and modelId. Different APIs, different field names.)
For new STT configurations, the Realtime API now accepts endOfTurnConfidenceThreshold, prompts, voiceProfileConfig, and inactivityTimeoutSeconds inside audio.input for finer turn-taking control.
The Realtime API is WebSocket GA, WebRTC early access. Honest caveat: at least one customer trial in May 2026 flagged that pipeline latency on the Realtime API can run higher than ElevenLabs in some configurations, so for production deployments where every millisecond matters, measure against your specific pipeline before committing.
How do you keep character memory and persona persistent?
Character apps need two kinds of memory: in-session and cross-session.
In-session memory is the conversation history inside one WebSocket session. Standard. The Realtime API maintains conversation items, and providerData.memory enables auto-summarization for long sessions.
Cross-session memory is what makes a character feel like it remembers. The pattern most character apps converge on:
- Stable character ID as the
user parameter in Router calls. Sticky routing improves cache hit rate (Janitor tracks cache hit rate as a primary metric for cost discipline).
- External memory store keyed on
(user_id, character_id). Summarize recent sessions into the system prompt at session start. Store the summary in your own database, not in the LLM context.
- Per-character voice ID so the voice identity is stable even if the LLM model changes.
- Per-character system prompt template that interpolates the user-specific memory summary at session creation.
This pattern keeps the LLM context window manageable while making characters feel persistent. It is also LLM-agnostic, so the same memory store works whether a character runs on DeepSeek today and Claude tomorrow.
For very long-running characters, log conversation summaries on session end and roll them into the next session's system prompt. The Realtime API's providerData.memory can automate part of this with auto-summarization settings.
What does voice cost in a character app?
Character apps live and die on free-tier economics. Most users never pay. The character-app stack has to be cost-disciplined by default.
Three production data points:
Janitor. Roughly 600 billion tokens per day on the Realtime Router. Cache hit rate is tracked as a primary metric. Sticky routing per character ID on the Router improves cache reuse.
Wishroll/Status. Reached 1 million users in 19 days. Cut total AI costs by 95% after switching to Inworld's infrastructure. Status users average 1 hour 36 minutes of daily engagement.
Bible Chat. Scaled voice features from roughly 2 million to 20 million characters per week with an 85% TTS cost cut after switching.
The cost levers character apps actually pull:
- Per-product per-token transparent pricing. No bundling, no markup on routed models. See inworld.ai/pricing for current rates.
- 1P track Inworld-hosted models (Realtime Inference) — optimized Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5 — for cost-disciplined character LLMs with sub-second TTFT.
- TTS 1.5 Mini for high-volume free-tier traffic, TTS-2 or 1.5 Max for premium tiers where users opt into the best voice.
- Streaming TTS so audio playback starts within sub-second time-to-first-audio regardless of total response length.
Premium positioning is the right frame here: model quality, realtime latency, full pipeline integration, and developer experience are the levers, not headline price.
How does this compare to the alternatives?
Each alternative handles one slice of the character-app problem well.
ElevenLabs ConvAI/Agents pairs Eleven v3 TTS, which is genuinely strong, with BYO LLM through their agent platform. Expressive Mode (added Feb 2026) added emotional steering on Agents. Strong on voice quality, fewer levers on LLM-side persona variance, credit-based pricing. ElevenLabs also has Flows (Mar 11) and a Government tier (Feb 11), so the platform is broader than TTS alone.
Cartesia Line pairs Sonic 3.5 TTS with Cartesia's agent platform. Sonic 3.5 is fast and competitive on quality. Line is newer; less production data on character-app workloads.
Hume EVI specializes in emotional voice intelligence (reading the user's emotional state from voice input). That is a different angle from persona generation. EVI is interruptible and BYO-LLM compatible, but the differentiation is on the input understanding side, not the character generation side.
OpenAI Realtime (gpt-realtime) locks you to OpenAI models. For character apps, that flattens the personality space, since you cannot route to Claude for a thoughtful character, DeepSeek for a cost-disciplined character, and Gemma for a fine-tuned roleplay character on the same platform.
Character.AI's own infrastructure is closed. Not available to external developers.
Inworld's bundle is: model-agnostic Realtime API across 200+ LLMs, the #1 realtime TTS on the Artificial Analysis Speech Arena, voice cloning with cross-lingual identity preservation, and the same fabric Janitor and Latitude run on at scale. The right choice depends on whether you want vendor lock-in or per-character model flexibility.
What constraints should I know about?
Honest constraints worth designing around:
- TTS-2 is research preview. The 8-dimension steering and cross-lingual identity are usable today, but the model is not yet GA. TTS 1.5 Max and Mini are fully GA fallbacks.
- Realtime API is WebSocket GA, WebRTC early access. SIP is also early access.
- Pipeline latency varies by configuration. At least one customer trial in May 2026 flagged that Realtime API latency in some configurations runs higher than ElevenLabs. Measure on your specific pipeline.
- STT is multilingual but English remains strongest at production scale. For non-English-first character apps, validate STT accuracy before committing.
- Inference is US-first. EU-first character apps with strict data-residency requirements should factor this into the timeline.
- TTS-2 steering is English-only. The instruction tag is English even when the spoken language is not. The voice still speaks the target language; only the instruction syntax is English.
How do you get started building a character app on Inworld?
- Sign up at platform.inworld.ai and generate an API key
- Clone a voice for your first character using 5 to 15 seconds of reference audio via
/voices/v1/voices:clone
- Pick the LLM for that character via the Realtime Router. Try DeepSeek V4 Pro or a 1P-track Gemma-4 first, then A/B test against frontier models
- Wire a Realtime API session using the WebSocket example above. Set the character's voice in
session.audio.output.voice and the LLM in session.model
- Add cross-session memory by keying summaries on
(user_id, character_id) and interpolating into the system prompt at session start
- Measure cache hit rate and per-character cost as your primary cost metrics
The
TTS API quickstart covers the REST endpoint in detail. The
Realtime API documentation covers session management, semantic VAD, and transport options. For teams migrating from OpenAI Realtime, the Realtime API follows OpenAI's Realtime protocol extended via a
providerData block, so the event schema is largely compatible.
Frequently Asked Questions
What is the right voice AI stack for AI character apps in 2026?
For AI character apps where the character is the product, the production stack pairs the Inworld Realtime API for bidirectional voice with Realtime TTS-2 (research preview, the #1 realtime TTS on the Artificial Analysis Speech Arena) and the Realtime Router across 200+ LLMs. Voice cloning fixes the character's identity; TTS-2 natural-language steering across 8 dimensions (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) adds the expressive range a persona needs to feel real over long sessions. The Router lets each character run on a different LLM personality (Janitor uses a fine-tuned Gemma, Latitude uses DeepSeek V3.2) without changing client code.
How is a character app different from a companion app?
Companion apps optimize for general companionship with one or a few personas users grow attached to. Character apps are platforms where the character is the product, often user-generated, often hundreds or thousands of personas per app, often roleplay-heavy. Janitor and Latitude are character apps. Tolans and Bible Chat sit closer to the companion line. The engineering differences are persona isolation, distinct voice identities at scale, and free-tier economics that work when 90%+ of users never pay.
Which LLM should I use for a character app personality?
Different personalities suit different LLMs. The Realtime Router routes to 200+ models in one API, so each character can use the best model for its persona. The Router has two tracks. The 3P track covers external providers (OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, DeepInfra; gpt-oss-120b via DeepInfra). The 1P track is Realtime Inference: Inworld-hosted, optimized open-source models with sub-second TTFT — Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5. Janitor runs on a fine-tuned Gemma. Latitude runs on DeepSeek V3.2 and beat OpenAI by a point in a 3-way A/B test. Both apps use the same Router.
How does TTS-2 give each character a distinctive voice?
Realtime TTS-2 (research preview) accepts natural-language steering across 8 dimensions: emotion, articulation, intonation, volume, pitch, range, speed, and vocal style. A deliveryMode field of STABLE, BALANCED, or CREATIVE controls expressive variance. Cross-lingual voice identity is preserved across languages, so the same character voice carries through 100+ languages (15 GA, 90+ experimental). Zero-shot voice cloning from 5 to 15 seconds of reference audio fixes each character's identity and reproduces consistently on every request.
What does voice cost at character-app scale?
Character apps have free-tier economics. Most users never pay, but they generate hours of voice traffic. Per-character per-token transparent pricing matters more than headline rates. See
inworld.ai/pricing for current Inworld rates. Production data points: Wishroll/Status reached 1 million users in 19 days and cut total AI costs by 95% on Inworld. Bible Chat scaled voice features from roughly 2M to 20M characters per week with an 85% TTS cost cut after switching. Janitor runs roughly 600 billion tokens per day on the Router.
How does the Inworld Realtime API compare to OpenAI Realtime, ElevenLabs Agents, Cartesia Line, and Hume EVI for character apps?
Each handles a different slice of the problem. OpenAI's Realtime API (gpt-realtime) locks you to OpenAI models, which limits character personality variance. ElevenLabs ConvAI/Agents pairs Eleven v3 TTS with BYO LLM and is strong on voice quality. Cartesia's Line agent platform pairs Sonic 3.5 TTS with their own orchestration. Hume EVI specializes in emotional input understanding. The Inworld Realtime API is model-agnostic by design across 200+ LLMs via the Realtime Router, uses #1-realtime-TTS quality on TTS-2, and is the same fabric Janitor and Latitude run on. The right choice depends on whether you want vendor lock-in or persona-level model flexibility.
Published by Inworld AI. Quality ranking from the Artificial Analysis Speech Arena (May 2026). Product details from public competitor pages as of May 2026 and may change.