Voice AI for AI Companions

Q: How do I add emotional expressiveness to a companion's voice?

TTS-2 (research preview) accepts natural-language steering across 8 dimensions: emotion, articulation, intonation, volume, pitch, range, speed, and vocal style. You can describe the delivery in plain English ("sound warm and slightly sarcastic, low volume") instead of inserting markup. TTS-2 also accepts `deliveryMode` of `STABLE`, `BALANCED`, or `CREATIVE`. TTS 1.5 supports non-verbal cues like [sigh], [laugh], [whispering], [laughing]. Temperature and speed parameters (0.5x to 1.5x) provide additional per-character tuning.

Q: How much does voice cost per user for a companion app?

Voice cost per user depends on engagement level and provider pricing. Realtime TTS delivers high-quality realtime voice at significantly lower per-character cost than alternatives like ElevenLabs ($103-206/1M characters). See inworld.ai/pricing for current rates. Wishroll reduced total AI costs by 95% after switching to Inworld's infrastructure.

Last updated: April 5, 2026

Inworld AI builds the voice infrastructure behind production AI companion apps including Wishroll/Status and Bible Chat. The best voice AI for companion apps in 2026 delivers natural expressiveness, realtime latency, voice identity through cloning, and economics that work when 30 minutes of daily engagement is the baseline and most users never pay. Inworld's Realtime TTS-2 (research preview) is the #1 realtime TTS and powers companion apps with hundreds of thousands of daily active users.

Below: what companion developers actually need from voice AI, how the Inworld Realtime API addresses each requirement, and working code for building a voice-first companion session from scratch.

Pricing reflects published competitor rates as of May 2026.

What do AI companions need from voice that other use cases don't?

Companion voice is a different engineering problem than enterprise voice agents, content narration, or IVR systems. The constraints compound in ways that break solutions designed for those other categories.

Long, emotionally varied sessions. Users talk to companions for 30 minutes to over an hour. According to Wishroll's production data, Status users average 1 hour 36 minutes of daily engagement. A companion voice needs to carry warmth, humor, sarcasm, concern, and excitement across the same session. Enterprise TTS optimizes for consistent, neutral delivery. Companion TTS needs range. TTS-2's 8-dimension natural-language steering (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) targets this directly.

Sub-200ms latency with interruption handling. Companions hold multi-turn conversations where users interrupt, change topics, and expect instant responses. Above 300ms, pauses feel like lag. The system needs to detect when a user starts talking mid-response, cancel in-flight audio generation, and pivot to the new input without finishing the old thought. This is voice activity detection (VAD) plus cancellation logic, not just fast TTS.

Consumer-scale unit economics. Consumer AI infrastructure exists because enterprise pricing doesn't work for consumer engagement patterns. A companion with 100K daily active users at 30 minutes of voice per day generates roughly 900 million characters per month. At ElevenLabs' $103-206/1M characters, that's $93K-185K monthly in TTS alone. Inworld's per-character cost is significantly lower (see pricing page). The cost difference determines whether voice is a feature every user gets or one locked behind a paywall that kills engagement.

Persistent voice identity across sessions. Users form attachment to a companion's voice. If the voice drifts or changes between sessions, users notice and engagement drops. Zero-shot voice cloning from seconds of reference audio creates a fixed identity. That identity needs to reproduce consistently on every request without degradation.

Streaming-native architecture. Companions generate responses token-by-token from the LLM. The TTS layer needs to start producing audio from the first tokens, not wait for the complete response. WebSocket streaming with no buffering step is what keeps multi-turn conversation fluid.

Model-agnostic by design. The best LLM for a companion today may not be the best in three months. Locking to a single model vendor means you can't A/B test whether Claude or GPT drives better retention, can't route to a cheaper model for simple responses, and can't fail over when a provider has an outage. The infrastructure should let you choose the best model for each component independently.

How does the Inworld Realtime API handle companion voice?

The Inworld Realtime API collapses the full companion voice pipeline into a single WebSocket or WebRTC connection. Audio goes in from the user's microphone. Audio comes back from the companion. STT, LLM reasoning, TTS generation, VAD, turn-taking, and interruption handling all happen server-side. (For a deeper comparison of speech-to-speech architectures, see the speech-to-speech API guide.)

For companion developers, this eliminates the infrastructure work of wiring together separate STT, LLM, and TTS services, building cancellation logic, managing streaming state, and handling the failure modes that compound across a multi-service pipeline.

Here is a working session setup for a companion character:

const WebSocket = require('ws');

const ws = new WebSocket(
  `wss://api.inworld.ai/api/v1/realtime/session?key=companion-${Date.now()}&protocol=realtime`,
  { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);

ws.on('open', () => console.log('Connected'));

ws.on('message', (raw) => {
  const msg = JSON.parse(raw.toString());

  if (msg.type === 'session.created') {
    // Configure the companion session
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        instructions: 'You are Luna, a warm and curious companion. You remember previous conversations and ask thoughtful follow-up questions. Use [happy] and [laughing] tags when the mood is light.',
        model: 'anthropic/claude-sonnet-4-6',
        audio: {
          input: {
            transcription: { model: 'inworld/inworld-stt-1' },
            turn_detection: {
              type: 'semantic_vad',
              eagerness: 'medium',
              interrupt_response: true
            }
          },
          output: {
            voice: 'Sarah',
            model: 'inworld-tts-1.5-max'
          }
        }
      }
    }));
  }

  if (msg.type === 'response.output_audio.delta') {
    // Play audio chunk to the user
    playAudio(msg.delta); // base64-encoded PCM16
  }

  if (msg.type === 'response.output_audio_transcript.delta') {
    // Display live captions
    updateCaptions(msg.delta);
  }
});

// Stream microphone audio to the companion
function sendMicrophoneChunk(audioChunk) {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: audioChunk // base64-encoded PCM16 from mic
  }));
}

Three aspects of this configuration matter specifically for companions:

Voice and model selection in session.audio.output. The Realtime API uses voice and model fields inside the session's audio output configuration. This is different from the REST TTS endpoint, which uses voiceId and modelId in the request body. The voice persists for the duration of the WebSocket session, maintaining character identity across the entire conversation.

Semantic VAD with configurable eagerness. Standard VAD triggers on silence. Semantic VAD listens to what the user is saying to determine when they're done talking. The eagerness parameter (low, medium, high) controls the trade-off between fast responses and premature cutoffs. For companions, medium works well as a default. Lower eagerness gives the user more time to finish complex thoughts.

Interruption handling. Setting interrupt_response: true enables barge-in. When the user talks over the companion, the system cancels in-flight TTS generation and starts processing the new input. Without this, the companion "finishes its thought" before responding to what the user actually said, which breaks conversational flow.

The Realtime API is model-agnostic through the Inworld Router, which routes to 220+ LLMs across two tracks: a 3P track (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Meta) and a 1P track of Realtime Inference: Inworld-hosted, optimized open-source models (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) tuned for sub-second TTFT with vLLM + FlashInfer + speculative decoding + KV cache. You can A/B test different LLMs against user engagement metrics without changing your client integration. If one provider has an outage, automatic failover routes to the next available model.

How do you add emotional expressiveness to a companion's voice?

Flat prosody kills companion immersion. Users expect the voice to reflect the emotional content of what the companion is saying. Inworld offers two layers of expressiveness control.

TTS-2 natural-language steering (research preview). Describe delivery in plain English across 8 dimensions: emotion, articulation, intonation, volume, pitch, range, speed, vocal style. A deliveryMode field of STABLE, BALANCED, or CREATIVE controls the trade-off between consistency and expressive variance.

TTS 1.5 audio markup. TTS 1.5 supports non-verbal cues and delivery styles inline: [laughing], [whispering], [sigh], [laugh], [breathe], [cough]. These are experimental and English-only. Do not use TTS-2 steering tags inside TTS 1.5 calls; they would be read aloud literally. Place TTS 1.5 markup before the text segment it should affect:

import requests
import base64
import os

INWORLD_API_KEY = os.environ['INWORLD_API_KEY']

# Basic companion TTS with emotion markup
response = requests.post(
    'https://api.inworld.ai/tts/v1/voice',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'text': '[happy] Hey! I was just thinking about what you said yesterday about that hiking trail. Did you end up going?',
        'voiceId': 'Sarah',
        'modelId': 'inworld-tts-1.5-max',
        'audioConfig': {
            'audioEncoding': 'MP3',
            'sampleRateHertz': 24000
        }
    },
    timeout=30
)
response.raise_for_status()

audio_bytes = base64.b64decode(response.json()['audioContent'])
with open('companion_greeting.mp3', 'wb') as f:
    f.write(audio_bytes)

print(f'Generated {len(audio_bytes)} bytes of audio')

In a Realtime API session, the LLM system prompt can drive expressiveness automatically. For TTS-2, the prompt can include natural-language directions for each response ("Sound warm and slightly playful"). For TTS 1.5, it can instruct the model to insert [whispering] for intimate moments or [laughing] after a joke, contextually, without hardcoding emotion logic in the client.

Temperature and speed parameters provide per-character personality tuning. A companion character designed to be calm and thoughtful might use 0.8x speed. An energetic character might use 1.2x speed with higher temperature for more vocal variation. These parameters work on both the REST endpoint and within Realtime API sessions.

Non-verbal audio cues add texture that makes the voice feel present rather than reciting text. Combined with emotional steering, they produce the kind of vocal range that keeps users engaged through 30+ minute sessions.

How does voice cloning create character identity?

Every companion character needs a voice users can recognize across sessions. Zero-shot voice cloning from 5-15 seconds of reference audio is free for all Inworld users. No per-clone licensing fees, no tier gating.

import requests
import base64
import os

INWORLD_API_KEY = os.environ['INWORLD_API_KEY']

# Step 1: Clone a voice from 5-15 seconds of reference audio
with open('character_voice_sample.wav', 'rb') as f:
    audio_b64 = base64.b64encode(f.read()).decode('utf-8')

clone_response = requests.post(
    'https://api.inworld.ai/voices/v1/voices:clone',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'displayName': 'CompanionVoice',
        'langCode': 'EN_US',
        'voiceSamples': [{'audioData': audio_b64}]
    },
    timeout=60
)
clone_response.raise_for_status()

cloned_voice_id = clone_response.json()['voice']['voiceId']
print(f'Cloned voice ID: {cloned_voice_id}')

# Step 2: Generate speech in the cloned voice
response = requests.post(
    'https://api.inworld.ai/tts/v1/voice',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'text': '[whispering] I have something important to tell you...',
        'voiceId': cloned_voice_id,
        'modelId': 'inworld-tts-1.5-max',
        'audioConfig': {
            'audioEncoding': 'MP3',
            'sampleRateHertz': 24000
        }
    },
    timeout=30
)
response.raise_for_status()

audio_bytes = base64.b64decode(response.json()['audioContent'])
with open('cloned_companion_voice.mp3', 'wb') as f:
    f.write(audio_bytes)

The cloned voice maintains consistency across sessions and works with both the REST TTS endpoint and the Realtime API. Upload a clean reference sample (clear speech, minimal background noise), and the cloned voice persists as a reusable identity.

For companion apps with multiple characters, this means each character gets a distinct, recognizable voice without commissioning voice actors or managing audio assets. Clone the voice once via /voices/v1/voices:clone, then use the returned voiceId in all subsequent TTS calls for consistent reproduction.

TTS-2 preserves cross-lingual voice identity: the same cloned voice keeps its character across languages, which matters for multilingual companions. Voice cloning works across the 15 GA languages and 90+ experimental languages supported on TTS-2.

What latency targets matter for companion conversations?

Latency in companion apps is more nuanced than a single number. Three measurements matter:

Time-to-first-audio (TTFA) is the delay between when the user stops speaking and when the first audio frame of the companion's response reaches the client. This is the number users feel. Realtime TTS delivers sub-second TTFA; TTS 1.5 Mini delivers sub-130ms inference. End-to-end pipeline latency depends on network, STT, and LLM stages, not TTS alone.

End-to-end pipeline latency includes VAD processing time, STT transcription, LLM token generation, TTS synthesis, and network transit. With the Realtime API handling all stages server-side, the pipeline stages overlap. The TTS starts generating audio from the LLM's first tokens while the model is still producing the rest of the response. This overlapping architecture is what makes sub-300ms end-to-end feasible.

Interruption recovery time is how quickly the system pivots when a user talks over the companion. This includes detecting the interruption, canceling in-flight TTS generation, resetting stream state, and beginning to process the new input. The Realtime API's semantic VAD handles this natively. Without it, developers build custom cancellation logic that typically adds 200-500ms to the pivot.

For reference, research on conversational turn-taking shows that typical human response gaps average around 200ms (Stivers et al., PNAS, 2009). Anything consistently above 300ms feels unnatural. Below 200ms, users stop noticing the AI is generating speech and the conversation feels natural.

How do you handle streaming for fluid companion conversations?

Companions generate long, variable-length responses. Waiting for the full response before starting audio playback adds seconds of latency. Streaming TTS generates and delivers audio chunk-by-chunk as text arrives from the LLM.

import requests
import base64
import json
import os

INWORLD_API_KEY = os.environ['INWORLD_API_KEY']

# Streaming TTS for low-latency companion responses
response = requests.post(
    'https://api.inworld.ai/tts/v1/voice:stream',
    headers={
        'Authorization': f'Basic {INWORLD_API_KEY}',
        'Content-Type': 'application/json'
    },
    json={
        'text': '[happy] That sounds amazing! Tell me more about what happened after you got to the summit. I bet the view was incredible.',
        'voiceId': 'Sarah',
        'modelId': 'inworld-tts-1.5-max',
        'audioConfig': {
            'audioEncoding': 'MP3',
            'sampleRateHertz': 24000
        }
    },
    stream=True,
    timeout=30
)
response.raise_for_status()

audio_chunks = []
for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        audio_data = base64.b64decode(chunk['result']['audioContent'])
        audio_chunks.append(audio_data)
        # Play each chunk immediately for lowest latency
        play_audio_chunk(audio_data)

print(f'Streamed {len(audio_chunks)} chunks')

The streaming endpoint (/tts/v1/voice:stream) returns NDJSON (newline-delimited JSON). Each line contains a JSON object with result.audioContent holding a base64-encoded audio chunk. Parse each line as it arrives, decode the base64, and play the audio immediately.

For the Realtime API, streaming is the default behavior. Audio chunks arrive as response.output_audio.delta events without any additional configuration. The server handles text chunking (splitting LLM output at sentence boundaries for optimal TTS processing) and streams audio as fast as it's generated.

Companion response length is unpredictable. An enterprise voice agent might generate 15-word confirmations. A companion might generate 200-word stories. Streaming ensures the user hears the first sentence within 250ms regardless of how long the full response turns out to be.

What does voice cost at companion scale?

Companion economics are inverted from enterprise. High engagement, mostly-free user bases, and long sessions mean per-user voice cost is the metric that determines whether the business model works.

Scenario: 100K daily active users, 30 minutes of voice per day (~900 million characters per month).

At 1 million DAU, those numbers multiply by 10. The gap between providers at this scale is the difference between voice as a core feature and voice as a premium upsell that most users never experience.

Wishroll's Status app demonstrates what this looks like in production. Before moving to Inworld's infrastructure, Status faced $12-15 per user per day in total AI costs. After the switch, they achieved a 95% cost reduction. With voice as a default feature rather than an upsell, engagement reached 1 hour 36 minutes of average daily usage, and the app reached 1 million users in 19 days.

Bible Chat scaled voice features (roughly 2M to 20M chars/week) with an 85% TTS cost cut after switching to Inworld.

How do companion apps choose the right LLM for the voice pipeline?

The Inworld Realtime API is model-agnostic by design. Instead of locking to a single LLM provider, it routes through the Inworld Router, which provides unified access to 220+ models in one API. The Router has two tracks: a 3P track (OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, DeepInfra) and a 1P track of Realtime Inference: Inworld-hosted, optimized open-source models (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) with sub-second TTFT tuned for realtime voice.

For companions:

Different models excel at different companion behaviors. Claude may produce more empathetic responses for emotional conversations. GPT-5.5 may handle creative storytelling better. A smaller, faster model may be sufficient for quick acknowledgments. The Router lets you evaluate these trade-offs with live A/B tests against actual user engagement data.

Costs vary across providers. A companion app handling millions of daily messages benefits from routing simple interactions to cheaper models and reserving frontier models for complex conversations. The Router's intelligent routing can optimize for cost, latency, or business outcomes (retention, engagement) based on developer-defined strategies.

Provider outages shouldn't break your product. With automatic failover across multiple LLM providers, a single vendor's downtime doesn't mean your companion goes silent. The Router handles failover transparently without client-side logic.

Choose the best model for each component instead of accepting whatever one vendor bundles. The TTS is Inworld's own expressive realtime model. The STT is Inworld's streaming speech-to-text. The LLM is whichever model performs best for your specific companion's personality and user base.

What production companion apps run on Inworld's voice infrastructure?

Wishroll (Status) is a social AI companion app that reached 1 million users in 19 days. Users spend an average of 1 hour 36 minutes per day in the app. Before Inworld, the app faced $12-15 per user per day in AI costs. On Inworld's infrastructure, Wishroll achieved 95% cost reduction while maintaining the voice quality and engagement levels that drive retention.

Bible Chat scaled voice features (from roughly 2M to 20M chars/week) with an 85% TTS cost cut after moving to Inworld.

At these numbers, voice becomes a default feature rather than a premium add-on. Expressive realtime TTS, sub-second TTFA, and competitive per-character economics keep voice viable at scale.

What are the technical constraints to know about?

Current limitations:

15 GA languages, 90+ experimental on TTS-2. GA covers the highest-demand markets. For companions targeting languages outside the 90+ experimental set, ElevenLabs (70+ languages) or Google Cloud TTS (75+ languages) offer broader stability for specific markets.

TTS 1.5 audio markup is English-only. Tags like [whispering], [laughing], [sigh], [laugh] work in production for English text on TTS 1.5. TTS-2 natural-language steering broadens what is controllable but is still research preview.

TTS-2 is research preview. The 8-dimension steering and cross-lingual voice identity are usable today but the model is not yet GA. For production-critical deployments with zero tolerance for breaking changes, factor this into your timeline. TTS 1.5 Max and Mini are fully GA.

Realtime API transports. WebSocket is GA. WebRTC and SIP are in early access. For production-critical deployments on emerging transports, factor this into your timeline.

How do you get started?

Sign up at platform.inworld.ai and generate an API key
Try the REST endpoint with the Python example above. Generate a line of companion dialogue with emotion markup and listen to it
Set up a Realtime API session using the WebSocket example to build a bidirectional voice companion
Clone a voice for your character using 5-15 seconds of reference audio
Configure the Router to select the LLM that works best for your companion's personality

The TTS API quickstart covers the REST endpoint in detail. The Realtime API documentation covers session management, VAD configuration, and transport options. For teams migrating from OpenAI's Realtime API, Inworld publishes a migration guide documenting the compatible event schema.

Frequently Asked Questions

What voice AI stack do production AI companion apps actually use?

Production companions at scale (Wishroll/Status, Bible Chat) run on Inworld's infrastructure. The typical stack is the Inworld Realtime API for bidirectional voice over WebSocket, Realtime TTS-2 (research preview) or TTS 1.5 Max/Mini for speech generation, and the Inworld Router for routing across 220+ LLMs. Voice cloning gives each character a persistent identity. TTS-2's 8-dimension natural-language steering adds expressiveness. The Realtime API handles STT, LLM, TTS, VAD, and interruption management in one connection.

How do I add emotional expressiveness to a companion's voice?

TTS-2 (research preview) supports natural-language steering across 8 dimensions: emotion, articulation, intonation, volume, pitch, range, speed, vocal style. Describe the delivery in plain English ("sound warm and slightly sarcastic"). TTS 1.5 supports inline markup like [laughing], [whispering], [sigh], [laugh]. Do not mix TTS-2 steering tags into TTS 1.5 calls; they would be read aloud literally. Temperature and speed parameters (0.5x to 1.5x) provide per-character tuning on both.

How much does voice cost per user for a companion app?

Voice cost per user depends on engagement and provider pricing. Realtime TTS delivers high-quality realtime voice at significantly lower per-character cost than alternatives. For comparison, ElevenLabs charges $103-206/1M characters. See inworld.ai/pricing for current Inworld rates. Wishroll reduced total AI costs by 95% after switching to Inworld's infrastructure.

Can I give each companion character a unique voice?

Yes. Zero-shot voice cloning from 5-15 seconds of reference audio is free for all users. Upload a sample via the API, and the cloned voice persists across sessions with consistent identity. This works for both the REST TTS endpoint and the Realtime API. No per-clone licensing fees.

What latency should I target for a companion voice experience?

Realtime latency for natural-feeling conversation. Inworld Realtime TTS delivers sub-second time-to-first-audio. TTS 1.5 Mini delivers sub-130ms P90 inference. These are realtime measurements; end-to-end pipeline latency depends on network, STT, and LLM stages. Above 300ms total feels like lag. Below 200ms total, the conversation feels natural.

How does the Inworld Realtime API differ from OpenAI's Realtime API for companion apps?

Both accept audio over WebSocket and return audio. The key difference is model flexibility. OpenAI locks you to a single model family. The Inworld Realtime API is model-agnostic, routing to 220+ LLMs via the Inworld Router across a 3P track (OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek) and a 1P track of Realtime Inference: Inworld-hosted, optimized open-source models (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2) with sub-second TTFT. You can A/B test Claude against GPT against Gemini without changing integration code. Inworld's TTS-2 delivers expressive realtime voice, voice cloning is free, and on-premise deployment is available.

What is the Inworld Realtime API?

A single WebSocket or WebRTC endpoint that handles the full companion voice pipeline: speech-to-text, LLM reasoning, text-to-speech, voice activity detection, turn-taking, and interruption handling. Audio goes in, audio comes back. No separate service orchestration required. It pairs with the Inworld Router for 220+ LLMs and Realtime TTS-2 for expressive realtime voice output.

Does Inworld support non-English companion apps?

TTS-2 (research preview) supports 100+ languages (15 GA, 90+ experimental) and preserves cross-lingual voice identity so the same cloned voice keeps its character across languages.

Published by Inworld AI. Pricing reflects published competitor rates as of May 2026 and may change.

Voice AI for AI Companions: How to Build Expressive, Low-Latency Voice Into Consumer Apps at Scale