Get started
Published 04.02.2026

Voice AI for AI Companions: How to Build Expressive, Low-Latency Voice Into Consumer Apps at Scale

Last updated: April 5, 2026
Inworld AI builds the voice infrastructure behind the largest AI companion apps in production. The best voice AI for companion apps in 2026 is one that delivers natural expressiveness, sub-200ms latency, voice identity through cloning, and costs that work when 30 minutes of daily engagement is the baseline and most users never pay. Inworld TTS holds the #1 ranking on the Artificial Analysis Speech Arena and powers companion apps serving hundreds of thousands of daily active users.
Below: what companion developers actually need from voice AI, how the Inworld Realtime API addresses each requirement, and working code for building a voice-first companion session from scratch.
Quality rankings from the Artificial Analysis Speech Arena, March 2026. Pricing reflects published rates as of April 2026.

What do AI companions need from voice that other use cases don't?

Companion voice is a different engineering problem than enterprise voice agents, content narration, or IVR systems. The constraints compound in ways that break solutions designed for those other categories.
Long, emotionally varied sessions. Users talk to companions for 30 minutes to over an hour. According to Wishroll's production data, Status users average 1 hour 36 minutes of daily engagement. A companion voice needs to carry warmth, humor, sarcasm, concern, and excitement across the same session. Enterprise TTS optimizes for consistent, neutral delivery. Companion TTS needs range.
Sub-200ms latency with interruption handling. Companions hold multi-turn conversations where users interrupt, change topics, and expect instant responses. Above 300ms, pauses feel like lag. The system needs to detect when a user starts talking mid-response, cancel in-flight audio generation, and pivot to the new input without finishing the old thought. This is voice activity detection (VAD) plus cancellation logic, not just fast TTS.
Consumer-scale unit economics. Consumer AI infrastructure exists because enterprise pricing doesn't work for consumer engagement patterns. A companion with 100K daily active users at 30 minutes of voice per day generates roughly 900 million characters per month. At ElevenLabs' $103-206/1M characters, that's $93K-185K monthly in TTS alone. Inworld's per-character cost is significantly lower (see pricing page). The cost difference determines whether voice is a feature every user gets or one locked behind a paywall that kills engagement.
Persistent voice identity across sessions. Users form attachment to a companion's voice. If the voice drifts or changes between sessions, users notice and engagement drops. Zero-shot voice cloning from seconds of reference audio creates a fixed identity. That identity needs to reproduce consistently on every request without degradation.
Streaming-native architecture. Companions generate responses token-by-token from the LLM. The TTS layer needs to start producing audio from the first tokens, not wait for the complete response. WebSocket streaming with no buffering step is what keeps multi-turn conversation fluid.
Model-agnostic by design. The best LLM for a companion today may not be the best in three months. Locking to a single model vendor means you can't A/B test whether Claude or GPT drives better retention, can't route to a cheaper model for simple responses, and can't fail over when a provider has an outage. The infrastructure should let you choose the best model for each component independently.

How does the Inworld Realtime API handle companion voice?

The Inworld Realtime API collapses the full companion voice pipeline into a single WebSocket or WebRTC connection. Audio goes in from the user's microphone. Audio comes back from the companion. STT, LLM reasoning, TTS generation, VAD, turn-taking, and interruption handling all happen server-side. (For a deeper comparison of speech-to-speech architectures, see the speech-to-speech API guide.)
For companion developers, this eliminates the infrastructure work of wiring together separate STT, LLM, and TTS services, building cancellation logic, managing streaming state, and handling the failure modes that compound across a multi-service pipeline.
Here is a working session setup for a companion character:
const WebSocket = require('ws');

const ws = new WebSocket(
  `wss://api.inworld.ai/api/v1/realtime/session?key=companion-${Date.now()}&protocol=realtime`,
  { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);

ws.on('open', () => console.log('Connected'));

ws.on('message', (raw) => {
  const msg = JSON.parse(raw.toString());

  if (msg.type === 'session.created') {
    // Configure the companion session
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        instructions: 'You are Luna, a warm and curious companion. You remember previous conversations and ask thoughtful follow-up questions. Use [happy] and [laughing] tags when the mood is light.',
        audio: {
          output: {
            voice: 'Sarah',
            model: 'inworld-tts-1.5-max'
          }
        },
        input_audio_transcription: { model: 'inworld/inworld-stt-1' },
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'medium',
          interrupt_response: true
        }
      }
    }));
  }

  if (msg.type === 'response.output_audio.delta') {
    // Play audio chunk to the user
    playAudio(msg.delta); // base64-encoded PCM16
  }

  if (msg.type === 'response.output_audio_transcript.delta') {
    // Display live captions
    updateCaptions(msg.delta);
  }
});

// Stream microphone audio to the companion
function sendMicrophoneChunk(audioChunk) {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: audioChunk // base64-encoded PCM16 from mic
  }));
}
Three aspects of this configuration matter specifically for companions:
Voice and model selection in session.audio.output. The Realtime API uses voice and model fields inside the session's audio output configuration. This is different from the REST TTS endpoint, which uses voiceId and modelId in the request body. The voice persists for the duration of the WebSocket session, maintaining character identity across the entire conversation.
Semantic VAD with configurable eagerness. Standard VAD triggers on silence. Semantic VAD listens to what the user is saying to determine when they're done talking. The eagerness parameter (low, medium, high) controls the trade-off between fast responses and premature cutoffs. For companions, medium works well as a default. Lower eagerness gives the user more time to finish complex thoughts.
Interruption handling. Setting interrupt_response: true enables barge-in. When the user talks over the companion, the system cancels in-flight TTS generation and starts processing the new input. Without this, the companion "finishes its thought" before responding to what the user actually said, which breaks conversational flow.
The Realtime API is model-agnostic through the Inworld Router. The same session can route to OpenAI, Anthropic, Google, Mistral, or any of 200+ models. You can A/B test different LLMs against user engagement metrics without changing your client integration. If one provider has an outage, automatic failover routes to the next available model.

How do you add emotional expressiveness to a companion's voice?

Flat prosody kills companion immersion. Users expect the voice to reflect the emotional content of what the companion is saying. Inworld TTS supports two layers of expressiveness control.
Audio markup tags add explicit emotional direction inline with the text. The supported tags are: [happy], [sad], [angry], [surprised], [fearful], [disgusted], [laughing], [whispering]. These are experimental and English-only. Place them before the text segment they should affect:
import requests
import base64
import os

INWORLD_API_KEY = os.environ['INWORLD_API_KEY']

# Basic companion TTS with emotion markup
response = requests.post(
    'https://api.inworld.ai/tts/v1/voice',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'text': '[happy] Hey! I was just thinking about what you said yesterday about that hiking trail. Did you end up going?',
        'voiceId': 'Sarah',
        'modelId': 'inworld-tts-1.5-max'
    }
)

audio_bytes = base64.b64decode(response.json()['audioContent'])
with open('companion_greeting.wav', 'wb') as f:
    f.write(audio_bytes)

print(f'Generated {len(audio_bytes)} bytes of audio')
In a Realtime API session, the LLM system prompt can instruct the model to insert emotion tags contextually. A prompt like "Insert [happy] before cheerful responses and [whispering] for intimate moments" lets the LLM drive expressiveness automatically based on conversational context, without hardcoding emotion logic in the client.
Temperature and speed parameters provide per-character personality tuning. A companion character designed to be calm and thoughtful might use 0.8x speed. An energetic character might use 1.2x speed with higher temperature for more vocal variation. These parameters work on both the REST endpoint and within Realtime API sessions.
Non-verbal audio cues like [sigh], [laugh], [breathe], and [cough] add texture that makes the voice feel present rather than reciting text. Combined with emotion tags, they produce the kind of vocal range that keeps users engaged through 30+ minute sessions.

How does voice cloning create character identity?

Every companion character needs a voice users can recognize across sessions. Zero-shot voice cloning from 5-15 seconds of reference audio is free for all Inworld users. No per-clone licensing fees, no tier gating.
import requests
import base64
import os

INWORLD_API_KEY = os.environ['INWORLD_API_KEY']

# Step 1: Clone a voice from 5-15 seconds of reference audio
with open('character_voice_sample.wav', 'rb') as f:
    audio_b64 = base64.b64encode(f.read()).decode('utf-8')

clone_response = requests.post(
    'https://api.inworld.ai/voices/v1/voices:clone',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'displayName': 'CompanionVoice',
        'langCode': 'EN_US',
        'voiceSamples': [{'audioData': audio_b64}]
    }
)

cloned_voice_id = clone_response.json()['voice']['voiceId']
print(f'Cloned voice ID: {cloned_voice_id}')

# Step 2: Generate speech in the cloned voice
response = requests.post(
    'https://api.inworld.ai/tts/v1/voice',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'text': '[whispering] I have something important to tell you...',
        'voiceId': cloned_voice_id,
        'modelId': 'inworld-tts-1.5-max'
    }
)

audio_bytes = base64.b64decode(response.json()['audioContent'])
with open('cloned_companion_voice.wav', 'wb') as f:
    f.write(audio_bytes)
The cloned voice maintains consistency across sessions and works with both the REST TTS endpoint and the Realtime API. Upload a clean reference sample (clear speech, minimal background noise), and the cloned voice persists as a reusable identity.
For companion apps with multiple characters, this means each character gets a distinct, recognizable voice without commissioning voice actors or managing audio assets. Clone the voice once via /voices/v1/voices:clone, then use the returned voiceId in all subsequent TTS calls for consistent reproduction.
Voice cloning also works across Inworld's 15 supported languages. Crosslingual voice cloning (using the same cloned voice in a language different from the reference audio) is experimental and fully supported in English.

What latency targets matter for companion conversations?

Latency in companion apps is more nuanced than a single number. Three measurements matter:
Time-to-first-audio (TTFA) is the delay between when the user stops speaking and when the first audio frame of the companion's response reaches the client. This is the number users feel. Inworld TTS 1.5 Max delivers sub-250ms P90. Mini delivers sub-130ms P90. These are end-to-end measurements including network overhead, not inference-only benchmarks.
End-to-end pipeline latency includes VAD processing time, STT transcription, LLM token generation, TTS synthesis, and network transit. With the Realtime API handling all stages server-side, the pipeline stages overlap. The TTS starts generating audio from the LLM's first tokens while the model is still producing the rest of the response. This overlapping architecture is what makes sub-300ms end-to-end feasible.
Interruption recovery time is how quickly the system pivots when a user talks over the companion. This includes detecting the interruption, canceling in-flight TTS generation, resetting stream state, and beginning to process the new input. The Realtime API's semantic VAD handles this natively. Without it, developers build custom cancellation logic that typically adds 200-500ms to the pivot.
For reference, research on conversational turn-taking shows that typical human response gaps average around 200ms (Stivers et al., PNAS, 2009). Anything consistently above 300ms feels unnatural. Below 200ms, users stop noticing the AI is generating speech and the conversation feels natural.

How do you handle streaming for fluid companion conversations?

Companions generate long, variable-length responses. Waiting for the full response before starting audio playback adds seconds of latency. Streaming TTS generates and delivers audio chunk-by-chunk as text arrives from the LLM.
import requests
import base64
import json
import os

INWORLD_API_KEY = os.environ['INWORLD_API_KEY']

# Streaming TTS for low-latency companion responses
response = requests.post(
    'https://api.inworld.ai/tts/v1/voice:stream',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'text': '[happy] That sounds amazing! Tell me more about what happened after you got to the summit. I bet the view was incredible.',
        'voiceId': 'Sarah',
        'modelId': 'inworld-tts-1.5-max'
    },
    stream=True
)

audio_chunks = []
for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        audio_data = base64.b64decode(chunk['result']['audioContent'])
        audio_chunks.append(audio_data)
        # Play each chunk immediately for lowest latency
        play_audio_chunk(audio_data)

print(f'Streamed {len(audio_chunks)} chunks')
The streaming endpoint (/tts/v1/voice:stream) returns NDJSON (newline-delimited JSON). Each line contains a JSON object with result.audioContent holding a base64-encoded audio chunk. Parse each line as it arrives, decode the base64, and play the audio immediately.
For the Realtime API, streaming is the default behavior. Audio chunks arrive as response.output_audio.delta events without any additional configuration. The server handles text chunking (splitting LLM output at sentence boundaries for optimal TTS processing) and streams audio as fast as it's generated.
Companion response length is unpredictable. An enterprise voice agent might generate 15-word confirmations. A companion might generate 200-word stories. Streaming ensures the user hears the first sentence within 250ms regardless of how long the full response turns out to be.

What does voice cost at companion scale?

Companion economics are inverted from enterprise. High engagement, mostly-free user bases, and long sessions mean per-user voice cost is the metric that determines whether the business model works.
Scenario: 100K daily active users, 30 minutes of voice per day (~900 million characters per month).
At 1 million DAU, those numbers multiply by 10. The gap between providers at this scale is the difference between voice as a core feature and voice as a premium upsell that most users never experience.
Wishroll's Status app demonstrates what this looks like in production. Before moving to Inworld's infrastructure, Status faced $12-15 per user per day in total AI costs. After the switch, they achieved a 95% cost reduction. With voice as a default feature rather than an upsell, engagement reached 1 hour 36 minutes of average daily usage, and the app grew to 500K+ daily active users.
Bible Chat scaled voice features to approximately 800K daily active users with over 90% cost reduction on TTS after switching to Inworld.

How do companion apps choose the right LLM for the voice pipeline?

The Inworld Realtime API is model-agnostic by design. Instead of locking to a single LLM provider, it routes through the Inworld Router, which provides unified access to 200+ models across OpenAI, Anthropic, Google, Mistral, xAI, Cerebras, and others through a single API key.
For companions:
Different models excel at different companion behaviors. Claude may produce more empathetic responses for emotional conversations. GPT-5 may handle creative storytelling better. A smaller, faster model may be sufficient for quick acknowledgments. The Router lets you evaluate these trade-offs with live A/B tests against actual user engagement data.
Costs vary across providers. A companion app handling millions of daily messages benefits from routing simple interactions to cheaper models and reserving frontier models for complex conversations. The Router's intelligent routing can optimize for cost, latency, or business outcomes (retention, engagement) based on developer-defined strategies.
Provider outages shouldn't break your product. With automatic failover across multiple LLM providers, a single vendor's downtime doesn't mean your companion goes silent. The Router handles failover transparently without client-side logic.
Choose the best model for each component instead of accepting whatever one vendor bundles. The TTS is Inworld's #1-ranked model. The STT is Inworld's streaming speech-to-text. The LLM is whichever model performs best for your specific companion's personality and user base.

What production companion apps run on Inworld's voice infrastructure?

Wishroll (Status) is a social AI companion app with 500K+ daily active users. Users spend an average of 1 hour 36 minutes per day in the app. Before Inworld, the app faced $12-15 per user per day in AI costs. On Inworld's infrastructure, Wishroll achieved 95% cost reduction while maintaining the voice quality and engagement levels that drive retention.
FlowGPT is a platform where users create and interact with AI characters. FlowGPT uses Inworld's voice infrastructure for character voice generation across its platform.
Bible Chat scaled voice features to approximately 800K daily active users and achieved over 90% cost reduction on TTS after moving to Inworld.
At these numbers, voice becomes a default feature rather than a premium add-on. #1-ranked TTS quality, sub-250ms latency, and single-digit dollars per million characters keep the economics viable at scale.

What are the technical constraints to know about?

Current limitations:
15 languages. Inworld TTS supports 15 languages. For companions targeting languages outside this set, ElevenLabs (70+ languages) or Google Cloud TTS (75+ languages) offer broader coverage for specific markets.
Audio markup is experimental and English-only. The emotion tags ([happy], [sad], [angry], [surprised], [fearful], [disgusted], [laughing], [whispering]) work in production for English text. Multi-language emotion markup is not yet supported. The tags work reliably at the start of a generation. Multi-tag mid-text sequences are experimental.
Crosslingual voice cloning is experimental. Using a voice cloned from English reference audio to speak in Korean, for example, is supported but may produce variable quality compared to same-language cloning.
The Realtime API is in research preview. It is not yet generally available. For production-critical deployments with zero tolerance for breaking changes, factor this into your timeline. The REST TTS endpoint and streaming endpoint are fully GA.

How do you get started?

  1. Sign up at platform.inworld.ai and generate an API key
  2. Try the REST endpoint with the Python example above. Generate a line of companion dialogue with emotion markup and listen to it
  3. Set up a Realtime API session using the WebSocket example to build a bidirectional voice companion
  4. Clone a voice for your character using 5-15 seconds of reference audio
  5. Configure the Router to select the LLM that works best for your companion's personality
The TTS API quickstart covers the REST endpoint in detail. The Realtime API documentation covers session management, VAD configuration, and transport options. For teams migrating from OpenAI's Realtime API, Inworld publishes a migration guide documenting the compatible event schema.

Frequently Asked Questions

What voice AI stack do production AI companion apps actually use?
Production companions at scale (Wishroll with 500K+ DAU, Bible Chat with ~800K DAU) run on Inworld's infrastructure. The typical stack is the Inworld Realtime API for bidirectional voice over WebSocket, Inworld TTS 1.5 Max or Mini for speech generation, and the Inworld Router for model-agnostic LLM access. Voice cloning gives each character a persistent identity. Emotion markup adds expressiveness. The Realtime API handles STT, LLM, TTS, VAD, and interruption management in one connection.
How do I add emotional expressiveness to a companion's voice?
Inworld TTS supports audio markup tags for emotional tone: [happy], [sad], [angry], [surprised], [fearful], [disgusted], [laughing], [whispering]. These are experimental and English-only. Include them inline in the text sent to the TTS API. For the Realtime API, the LLM system prompt can instruct the model to insert emotion tags contextually. Temperature and speed parameters (0.5x to 1.5x) provide additional per-character tuning.
How much does voice cost per user for a companion app?
Voice cost per user depends on engagement and provider pricing. Inworld TTS delivers the #1-ranked quality at significantly lower per-character cost than alternatives. For comparison, ElevenLabs charges $103-206/1M characters. See inworld.ai/pricing for current Inworld rates. Wishroll reduced total AI costs by 95% after switching to Inworld's infrastructure.
Can I give each companion character a unique voice?
Yes. Zero-shot voice cloning from 5-15 seconds of reference audio is free for all users. Upload a sample via the API, and the cloned voice persists across sessions with consistent identity. This works for both the REST TTS endpoint and the Realtime API. No per-clone licensing fees.
What latency should I target for a companion voice experience?
Sub-200ms time-to-first-audio for natural-feeling conversation. Inworld TTS 1.5 Max delivers sub-250ms P90, Mini delivers sub-130ms P90. These are end-to-end measurements including network, not inference-only numbers. Above 300ms, users perceive lag. Below 200ms, the conversation feels natural.
How does the Inworld Realtime API differ from OpenAI's Realtime API for companion apps?
Both accept audio over WebSocket and return audio. The key difference is model flexibility. OpenAI locks you to a single model family. The Inworld Realtime API is model-agnostic, routing to 200+ LLMs via the Inworld Router. You can A/B test Claude against GPT against Gemini without changing integration code. Inworld's TTS is #1-ranked on Artificial Analysis, voice cloning is free, and on-premise deployment is available.
What is the Inworld Realtime API?
A single WebSocket or WebRTC endpoint that handles the full companion voice pipeline: speech-to-text, LLM reasoning, text-to-speech, voice activity detection, turn-taking, and interruption handling. Audio goes in, audio comes back. No separate service orchestration required. It pairs with the Inworld Router for model-agnostic LLM access and Inworld TTS for #1-ranked voice output.
Does Inworld support non-English companion apps?
Inworld TTS supports 15 languages. Crosslingual voice cloning is experimental and fully supported in English.
Published by Inworld AI. Quality rankings from the Artificial Analysis Speech Arena (March 2026). Pricing reflects published rates as of April 2026 and may change.
Copyright © 2021-2026 Inworld AI