Get started
Realtime TTS

Engage every user with the #1 ranked, most natural Voice AI

#1 ranked TTS with human-like expression and realtime sub-200ms latency that feels like a real conversation. Custom voices with instant cloning or text-based voice design. Fully multilingual, built for streaming, and a fraction of the cost of other providers.
#1
Ranked Quality
Realtime
Latency
100+
Languages

One API. Streaming, cloning, voice design.

Stream audio chunks back as the model generates them. Sub-200ms first-chunk latency keeps the conversation feeling natural.

curl -X POST https://api.inworld.ai/tts/v1/voice:stream \ -H "Authorization: Basic $INWORLD_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "text": "Hi! What can I help you with today?", "voice_id": "Clive", "model_id": "inworld-tts-2", "audio_config": { "audio_encoding": "OGG_OPUS", "sample_rate_hertz": 16000 } }'

The top ranked TTS in the world. Proven by real users.

3 of the top 5 models on Artificial Analysis are Inworld. Blind tests by thousands of real users, not internal evals. Realtime TTS 1.5 Max delivers over 30% more expressiveness than previous models, with optimized stability to eliminate hallucinations and artifacts.

Test out Quality
#1
Ranked on
Artificial Analysis

Clone any voice. Localize to any language.

Create a custom voice from 15 seconds of audio, then localize it to speak over 100 languages as a native speaker — same identity, no accent carryover. Production-ready voices you can use in the Playground or via API.

  • Instant cloning: 15 seconds of audio → ready in seconds
  • Localize: one voice, native delivery in over 100 languages
Test out Cloning
Original
Sample uploaded
Clone

Describe any voice. Generate it instantly.

Skip recording entirely. Describe accent, age, tone, and energy in natural language, and Inworld renders a production-ready voice on the fly. Pick a preset on the card to hear how a single sentence becomes a finished voice.

  • No audio sample required — pure natural-language description
  • Per-preset playback so you can compare styles before generating
  • Same voice IDs work across the TTS API, Playground, and Realtime
Test out Voice Design
Voice Description
A confident and inviting Indian female voice, ideal for customer support and professional training materials.

Realtime latency. Instant responses.

Built for realtime from the ground up — audio generates the instant it's synthesized via WebSocket. No buffering delay. Comparable latency to competitors at a fraction of the cost.

  • First-chunk audio in a fraction of a humanlike response time
  • Streaming-native via WebSocket
  • Consistent P90 performance under production load
Test out Latency
Realtime first-chunk latency
~0ms
Realtime TTS 1.5 Mini
~0ms
Realtime TTS 1.5 Max
~0ms
Realtime TTS-2
~0ms
Human

Direct the voice. Match the moment.

Add bracketed instructions anywhere in your text and Realtime TTS-2 adjusts the utterance. Pair with various non-verbals and adjustable pauses for delivery that matches the moment, not just the words.

  • Natural-language steering for tone, speed, volume, vocal style, and pauses
  • Five reliable non-verbal cues that render as the actual sound, not text
  • Mid-utterance adjustment — no separate prompt engineering pipeline
Test out Steering
Steering
Set the emotional tone.
[flirting] Oh, hi. I haven't seen you around before. Are you new around here?

over 100 languages. Native-speaker quality.

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language with cross-lingual cloning. Deploy globally without separate pipelines.

Test out Languages
  • English
  • Spanish
  • French
  • German
  • Japanese
  • Korean

Starting at $15 per million characters.

Realtime TTS 1.5 Mini at $15/million characters. Realtime TTS 1.5 Max and Realtime TTS-2 at $25/million. Comparable providers charge $120/million ($0.12/min) — Inworld is up to 87% cheaper at scale.

View pricing
Cost per million characters
$0
Realtime TTS 1.5 Mini
$>0
ElevenLabs v3

Built for voice-first applications

Interactive, real-time, and voice-driven experiences. Not batch processing.

Full breakdown

Best for
Realtime TTS-2
Most expressive applications
Realtime TTS 1.5 Max
Most applications
Realtime TTS 1.5 Mini
Latency-critical applications
Pricing
Realtime TTS-2
$35/million characters
Realtime TTS 1.5 Max
$25/million characters
Realtime TTS 1.5 Mini
$15/million characters
P90 latency
Realtime TTS-2
<250ms
Realtime TTS 1.5 Max
<250ms
Realtime TTS 1.5 Mini
<130ms
Quality
Realtime TTS-2
Highest expressiveness, native steering
Realtime TTS 1.5 Max
Maximum expressiveness and stability
Realtime TTS 1.5 Mini
High expressiveness
Multilingual
Realtime TTS-2
over 100 languages, cross-lingual (BCP-47)
Realtime TTS 1.5 Max
15 languages
Realtime TTS 1.5 Mini
15 languages
Natural-language steering
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Non-verbal cues
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Voice cloning
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Professional voice cloning
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Character, word, viseme and phoneme timestamps
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Custom pronunciation
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
On-premise
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Zero data retention
Realtime TTS-2
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Realtime TTS-2 is the new flagship — natural-language steering, reliable non-verbals, and 100+ cross-lingual languages on top of the Realtime TTS 1.5 Max foundation. Pick Realtime TTS 1.5 Mini for the lowest-latency, lowest-cost workloads.

Research

Integrations

Try Realtime TTS now

Get started with Realtime TTS 1.5 Max, the best balance of quality and speed for most applications.

FAQs

Getting started is simple. You can try Realtime TTS instantly in the TTS Playground, where you can test voices, adjust settings, and experiment with features like instant voice cloning.
When you’re ready to integrate TTS into your application, you can follow the Developer Quickstart to make your first API request in minutes. Just create an API key in the Inworld Portal, then synthesize speech with a single POST request. Inworld supports multiple output formats, including MP3, Linear PCM (WAV), and Opus, making it easy to integrate with almost any system.
Realtime TTS is evaluated through blind listening tests by thousands of real users. Realtime TTS 1.5 Max delivers over 30% more expressiveness than its predecessor, with optimized stability and natural conversational delivery.
Realtime TTS-2 is the new flagship Inworld voice model, focused on expressiveness and direct authoring control:
Natural-language steering. Bracketed instructions like [Speak sadly] or [Speak softly] let you direct tone and emotion inline — no separate prompt engineering. The model adapts mid-utterance.
Reliable non-verbals. Five bracketed cues are first-class in 2.0: [laugh], [breathe], [clear_throat], [sigh], [cough]. Drop them anywhere in the text and they render as the actual sound, not text.
Cross-lingual BCP-47 codes. Steering works across over 100 languages with locale-aware output (en-US, ja-JP, etc.) — a single voice can switch languages mid-conversation while staying in character.
For most applications: Realtime TTS 1.5 Max (~200ms latency, $25/1M characters)
Realtime TTS 1.5 Max offers the best balance of quality and speed. The enhanced stability means fewer edge cases, better voice cloning fidelity, and more consistent output across languages.
For latency-critical applications: Realtime TTS 1.5 Mini (<100ms latency, $15/1M characters)
Choose Realtime TTS 1.5 Mini only if minimal latency is your absolute top priority — for example, real-time gaming or ultra-responsive voice agents where every millisecond matters.
Realtime TTS 1.5 Mini achieves <120ms P90 latency. Realtime TTS 1.5 Max delivers ~200ms with enhanced stability and quality. Both support real-time streaming via WebSocket. For most applications, we recommend Realtime TTS 1.5 Max — the quality improvement is worth the marginal latency tradeoff.
Yes. Inworld provides two types of voice cloning:

Instant (zero-shot) voice cloning

  • Available to all users in the Portal
  • Creates a custom voice from just 15 seconds of audio
  • Ready to use in minutes

Professional voice cloning

  • Fine-tuned using 30+ minutes of clean audio (minimum ~5 minutes, 20+ minutes recommended for best results)
  • Recommended for uncommon voice types such as children's voices or unique accents, where instant cloning may not perform well
  • Currently available by contacting the Inworld sales team
Both methods allow you to use your cloned voice directly in the Playground or via API using its unique voiceId.
Realtime TTS 1.5 supports 15 languages: English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew.
For multilingual applications, we recommend Realtime TTS 1.5 Max for the best pronunciation, intonation, and naturalness across all supported languages.
Absolutely. Realtime TTS provides several ways to customize how the speech sounds:

Voice parameters

  • Temperature: Controls expressiveness and randomness
  • Talking speed: 0.5× to 1.5× of the native speaking rate
Yes. Realtime TTS supports timestamp alignment for word, character, phoneme, and viseme level synchronization. This can be helpful for subtitles, captions, lipsync, and more.
You can enable it in your API request by setting timestampType to WORD or CHARACTER.
The API response includes:
  • word or character tokens
  • start and end timestamps (in seconds)
  • structured alignment data matching the generated audio
  • phoneme-level timing and viseme symbols for lip-sync (TTS 1.5 models only)
Timestamp alignment currently supports English, with other languages available experimentally. Note that enabling timestamps currently adds roughly 100 ms of additional latency.
Realtime TTS 1.5 is a major update delivering improvements across speed, quality, and accessibility:
The Fastest: <120ms P90 latency — the fastest realtime TTS available. Realtime TTS 1.5 Max delivers ~200ms with enhanced quality.
The Highest Quality: Optimized stability to minimize hallucinations, cutoffs, and artifacts. Over 30% more expressive than Realtime TTS 1.
The Most Accessible: 15 languages (including Hindi), enhanced voice cloning, on-premise H100/B200 deployment, and 25x lower cost than alternatives.
Which model should I use? For most applications, we recommend Realtime TTS 1.5 Max. Use Realtime TTS 1.5 Mini only when minimal latency is the top priority.
Yes. Inworld provides a free, open-source ElevenLabs Migration Tool that lets you batch-transfer your custom voice clones from ElevenLabs to Inworld. The tool automatically downloads your ElevenLabs voice samples, handles audio processing (format conversion, padding, and trimming), and re-clones them in Inworld.
The migration runs entirely on your local machine with direct API communication — no data is proxied through any intermediary servers. You can also preview your migrated voices with Realtime TTS before finalizing.
Copyright © 2021-2026 Inworld AI
Inworld Voice AI: Top-Rated TTS & Voice Cloning