Realtime TTS

Engage every user with the #1 realtime, most natural Voice AI

Human-like expression at sub-200ms latency, with instant voice cloning and full multilingual support, for a fraction of the cost.

#1 realtime TTS with human-like expression and realtime sub-200ms latency that feels like a real conversation. Custom voices with instant cloning or text-based voice design. Fully multilingual, built for streaming, and a fraction of the cost of other providers.

Sign upGet startedRead the docs Contact Sales

Realtime TTS

Realtime

Latency

$10/1M

Characters at scale

Top-ranked voice, a fraction of the cost

$ per 1M characters · Inworld shown on the Growth plan ($25 to $12.50, ~50% off)

Text-to-speech price vs the market

Other providersInworld

Provider API rates, June 2026. Inworld on the Growth plan; $10 at enterprise scale. *Gemini 3.1 Flash TTS is billed by audio output tokens; ~$180/1M is the effective rate on a typical query. †Cartesia estimated from published tier pricing.

One API. Streaming, cloning, voice design.

Stream audio chunks back as the model generates them. Sub-200ms first-chunk latency keeps the conversation feeling natural.

curl -X POST https://api.inworld.ai/tts/v1/voice:stream \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hi! What can I help you with today?",
    "voice_id": "Clive",
    "model_id": "inworld-tts-2",
    "audio_config": {
      "audio_encoding": "OGG_OPUS",
      "sample_rate_hertz": 16000
    }
  }'

# Clone a voice from an audio sample
curl -X POST https://api.inworld.ai/voices/v1/voices:clone \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -F "audio_bytes=@sample.wav" \
  -F "voice_name=my-custom-voice"

# Then use it
curl -X POST https://api.inworld.ai/tts/v1/voice \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is my cloned voice.",
    "voice_id": "my-custom-voice",
    "model_id": "inworld-tts-2"
  }'

curl -X POST https://api.inworld.ai/tts/v1/voice:stream \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hi! What can I help you with today?",
    "voice_id": "Clive",
    "model_id": "inworld-tts-2",
    "audio_config": {
      "audio_encoding": "OGG_OPUS",
      "sample_rate_hertz": 16000
    }
  }'

# Clone a voice from an audio sample
curl -X POST https://api.inworld.ai/voices/v1/voices:clone \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -F "audio_bytes=@sample.wav" \
  -F "voice_name=my-custom-voice"

# Then use it
curl -X POST https://api.inworld.ai/tts/v1/voice \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is my cloned voice.",
    "voice_id": "my-custom-voice",
    "model_id": "inworld-tts-2"
  }'

One API. Streaming, cloning, voice design.

Stream audio chunks back as the model generates them. Sub-200ms first-chunk latency keeps the conversation feeling natural.

The top ranked TTS in the world. Proven by real users.

3 of the top 5 models on Artificial Analysis are Inworld. Blind tests by thousands of real users, not internal evals. Realtime TTS 1.5 Max delivers over 30% more expressiveness than previous models, with optimized stability to eliminate hallucinations and artifacts.

Test out Quality

Ranked on

Artificial Analysis

The top ranked TTS in the world. Proven by real users.

Test out Quality

Ranked on

Artificial Analysis

Clone any voice. Localize to any language.

Create a custom voice from 5 to 15 seconds of audio, then localize it to speak over 100 languages as a native speaker: same identity, no accent carryover. Production-ready voices you can use in the Playground or via API.

Instant cloning: 5 to 15 seconds of audio → ready in seconds
Localize: one voice, native delivery in over 100 languages

Test out Cloning

Original

Sample uploaded

Clone

Original

Sample uploaded

Clone

Clone any voice. Localize to any language.

Instant cloning: 5 to 15 seconds of audio → ready in seconds
Localize: one voice, native delivery in over 100 languages

Test out Cloning

Describe any voice. Generate it instantly.

Skip recording entirely. Describe accent, age, tone, and energy in natural language, and Inworld renders a production-ready voice on the fly. Pick a preset on the card to hear how a single sentence becomes a finished voice.

No audio sample required: pure natural-language description
Per-preset playback so you can compare styles before generating
Same voice IDs work across the TTS API, Playground, and Realtime

Test out Voice Design

Voice Description

A confident and inviting Indian female voice, ideal for customer support and professional training materials.

Describe any voice. Generate it instantly.

No audio sample required: pure natural-language description
Per-preset playback so you can compare styles before generating
Same voice IDs work across the TTS API, Playground, and Realtime

Test out Voice Design

Voice Description

A confident and inviting Indian female voice, ideal for customer support and professional training materials.

Realtime latency. Instant responses.

Built for realtime from the ground up: audio generates the instant it's synthesized via WebSocket. No buffering delay. Comparable latency to competitors at a fraction of the cost.

First-chunk audio in a fraction of a humanlike response time
Streaming-native via WebSocket
Consistent P90 performance under production load

Test out Latency

Realtime first-chunk latency

~0ms

Realtime TTS 1.5 Mini

~0ms

Realtime TTS 1.5 Max

~0ms

Realtime TTS-2

~0ms

Human

Realtime first-chunk latency

~0ms

Realtime TTS 1.5 Mini

~0ms

Realtime TTS 1.5 Max

~0ms

Realtime TTS-2

~0ms

Human

Realtime latency. Instant responses.

Built for realtime from the ground up: audio generates the instant it's synthesized via WebSocket. No buffering delay. Comparable latency to competitors at a fraction of the cost.

First-chunk audio in a fraction of a humanlike response time
Streaming-native via WebSocket
Consistent P90 performance under production load

Test out Latency

Direct the voice. Match the moment.

Add bracketed instructions anywhere in your text and Realtime TTS-2 adjusts the utterance. Pair with various non-verbals and adjustable pauses for delivery that matches the moment, not just the words.

Natural-language steering for tone, speed, volume, vocal style, and pauses
Five reliable non-verbal cues that render as the actual sound, not text
Mid-utterance adjustment, no separate prompt engineering pipeline

Test out Steering

Steering

Set the emotional tone.

[flirting] Oh, hi. I haven't seen you around before. Are you new around here?

Direct the voice. Match the moment.

Natural-language steering for tone, speed, volume, vocal style, and pauses
Five reliable non-verbal cues that render as the actual sound, not text
Mid-utterance adjustment, no separate prompt engineering pipeline

Test out Steering

Steering

Set the emotional tone.

[flirting] Oh, hi. I haven't seen you around before. Are you new around here?

Over 100 languages. Native-speaker quality.

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language with cross-lingual cloning. Deploy globally without separate pipelines.

Test out Languages

English
Spanish
French
German
Japanese
Korean

English
Spanish
French
German
Japanese
Korean

Over 100 languages. Native-speaker quality.

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language with cross-lingual cloning. Deploy globally without separate pipelines.

Test out Languages

Down to $10 per million characters.

Realtime TTS-2 down to $10 per million characters at scale, about one cent per minute of audio, and Realtime TTS 1.5 Mini down to $5. We cut prices in half or more for most developers so realtime voice can stay always-on at consumer scale, and the rate only falls as you grow. Today's prices are the ceiling, not the floor.

Bible Chat cut TTS costs by about 85 percent and Talkpal by about 40 percent at 10M+ learners, self-reported
One stack, one commit: spend on TTS lowers your rate on STT, LLM routing, and compute

View pricing

Cost per million characters

Realtime TTS-2 at scale

ElevenLabs v3 standard

ElevenLabs standard API rate, June 2026. Inworld Realtime TTS-2 at enterprise scale.

Down to $10 per million characters.

Bible Chat cut TTS costs by about 85 percent and Talkpal by about 40 percent at 10M+ learners, self-reported
One stack, one commit: spend on TTS lowers your rate on STT, LLM routing, and compute

View pricing

Cost per million characters

Realtime TTS-2 at scale

ElevenLabs v3 standard

ElevenLabs standard API rate, June 2026. Inworld Realtime TTS-2 at enterprise scale.

Built for voice-first applications

Interactive, real-time, and voice-driven experiences. Not batch processing.

Voice Agents

Sub-200ms latency and streaming-native architecture for conversational AI that feels real.

Full breakdown

Feature

Realtime TTS-2

Realtime TTS 1.5 Max

Realtime TTS 1.5 Mini

Best for

Realtime TTS-2

Most expressive applications

Realtime TTS 1.5 Max

Most applications

Realtime TTS 1.5 Mini

Latency-critical applications

Pricing

Realtime TTS-2

Down to $10/million characters

Realtime TTS 1.5 Max

See pricing

Realtime TTS 1.5 Mini

Down to $5/million characters

P90 latency

Realtime TTS-2

<250ms

Realtime TTS 1.5 Max

<250ms

Realtime TTS 1.5 Mini

<130ms

Quality

Realtime TTS-2

Highest expressiveness, native steering

Realtime TTS 1.5 Max

Maximum expressiveness and stability

Realtime TTS 1.5 Mini

High expressiveness

Multilingual

Realtime TTS-2

over 100 languages, cross-lingual (BCP-47)

Realtime TTS 1.5 Max

15 languages

Realtime TTS 1.5 Mini

15 languages

Natural-language steering

Realtime TTS-2

Realtime TTS 1.5 Max

—

Realtime TTS 1.5 Mini

—

Non-verbal cues

Realtime TTS-2

Realtime TTS 1.5 Max

—

Realtime TTS 1.5 Mini

—

Voice cloning

Realtime TTS-2

Realtime TTS 1.5 Max

Realtime TTS 1.5 Mini

Professional voice cloning

Realtime TTS-2

Realtime TTS 1.5 Max

Realtime TTS 1.5 Mini

Character, word, viseme and phoneme timestamps

Realtime TTS-2

Realtime TTS 1.5 Max

Realtime TTS 1.5 Mini

Custom pronunciation

Realtime TTS-2

Realtime TTS 1.5 Max

Realtime TTS 1.5 Mini

On-premise

Realtime TTS-2

Realtime TTS 1.5 Max

Realtime TTS 1.5 Mini

Zero data retention

Realtime TTS-2

Realtime TTS 1.5 Max

Realtime TTS 1.5 Mini

Realtime TTS-2 is the new flagship: natural-language steering, reliable non-verbals, and 100+ cross-lingual languages on top of the Realtime TTS 1.5 Max foundation. Pick Realtime TTS 1.5 Mini for the lowest-latency, lowest-cost workloads, or Realtime TTS 1.5 Max when stability matters most.

Research

Cutting-edge research

Publications

Explore our latest research advancing the state of the art in speech synthesis, voice cloning, and real-time TTS

TTS-1 Technical Report

Training code available

Open source

We’ve open-sourced the full training framework behind Realtime TTS-1, everything from codec to SpeechLM fine-tuning, so you can build your own high-quality TTS models faster.

Inworld Text-To-Speech Trainer

Integrations

LiveKit

Real-time web and mobile voice AI with low latency and streaming.

Learn more

NLX

No-code/low-code platform for multichannel voice experiences.

Learn more

Pipecat

Open-source video/audio API with streaming TTS for interactive apps.

Learn more

Vapi

Cloud telephony for voice agents with PSTN and SIP support.

Learn more

Stream

Stream (Vision Agents) is Stream’s open-source framework that helps developers quickly build low-latency vision AI applications.

Learn more

Ultravox

Ultravox is a real-time voice AI infrastructure layer that delivers fast, natural, and scalable voice agents.

Learn more

Voximplant

Voximplant is a serverless Voice AI orchestration platform and cloud communications stack for building real-time voice agents over the phone and the web.

Learn more

Try Realtime TTS now

Get started with Realtime TTS-2, the flagship for expressiveness and steering. Pick Realtime TTS 1.5 Mini for the latency and cost floor, or Realtime TTS 1.5 Max when stability matters most.

Try TTS Playground Try API Talk to our team

FAQs

Getting started is simple. You can try Realtime TTS instantly in the TTS Playground, where you can test voices, adjust settings, and experiment with features like instant voice cloning.

When you’re ready to integrate TTS into your application, you can follow the Developer Quickstart to make your first API request in minutes. Just create an API key in the Inworld Portal, then synthesize speech with a single POST request. Inworld supports multiple output formats, including MP3, Linear PCM (WAV), and Opus, making it easy to integrate with almost any system.

Realtime TTS-2 starts at $25 per million characters on demand, falls to $15 at $300 per month and $12.50 at $1,500 per month, and reaches $10 at Enterprise scale. Realtime TTS 1.5 Mini falls to $5 per million characters.

Voice cloning itself is free. You pay only for synthesis, and the rate falls further as your total spend grows. Today's prices are the ceiling, not the floor. Full plan details and the cost calculator are at inworld.ai/pricing.

Realtime TTS is evaluated through blind listening tests by thousands of real users. Realtime TTS 1.5 Max delivers over 30% more expressiveness than its predecessor, with optimized stability and natural conversational delivery.

Realtime TTS-2 is the new flagship Inworld voice model, focused on expressiveness and direct authoring control:

Natural-language steering. Bracketed instructions like [Speak sadly] or [Speak softly] let you direct tone and emotion inline, with no separate prompt engineering. The model adapts mid-utterance.

Reliable non-verbals. Five bracketed cues are first-class in 2.0: [laugh], [breathe], [clear_throat], [sigh], [cough]. Drop them anywhere in the text and they render as the actual sound, not text.

Cross-lingual BCP-47 codes. Steering works across over 100 languages with locale-aware output (en-US, ja-JP, etc.), so a single voice can switch languages mid-conversation while keeping the same identity.

For most applications: Realtime TTS-2 (<250ms P90 latency)

Realtime TTS-2 is the flagship model, with the most expressive output, natural-language steering, and inline non-verbal cues. Start here unless you have a specific latency, cost, or stability constraint.

For latency-critical or cost-sensitive applications: Realtime TTS 1.5 Mini (<130ms P90, ~100ms median, down to $5 per million characters)

Choose Realtime TTS 1.5 Mini when minimal latency or cost is your top priority, for example, ultra-responsive voice agents where every millisecond matters.

For maximum stability: Realtime TTS 1.5 Max (<250ms P90, ~200ms median)

Realtime TTS 1.5 Max offers enhanced stability: fewer edge cases, better voice cloning fidelity, and more consistent output across languages.

Realtime TTS 1.5 Mini achieves <130ms P90 latency (~100ms median). Realtime TTS 1.5 Max delivers <250ms P90 (~200ms median) with enhanced stability and quality. Realtime TTS-2 delivers <250ms P90. All support real-time streaming via WebSocket. For most applications, we recommend Realtime TTS-2; choose Mini when every millisecond matters.

Yes. Inworld provides two types of voice cloning:

Instant (zero-shot) voice cloning

Available to all users in the Portal
Creates a custom voice from just 5 to 15 seconds of audio
Ready to use in minutes

Professional voice cloning

Fine-tuned using 30+ minutes of clean audio (minimum ~5 minutes, 20+ minutes recommended for best results)
Recommended for uncommon voice types such as children's voices or unique accents, where instant cloning may not perform well
Currently available by contacting the Inworld sales team

Both methods allow you to use your cloned voice directly in the Playground or via API using its unique voiceId.

Realtime TTS supports over 100 languages, with full core support across our most-used languages and experimental support for the rest. Cross-lingual capabilities let you reuse a single voice across multiple languages, designed for multilingual products that don't want a separate voice per market.

See the full list of supported languages and current per-model support tiers in the multilingual support docs.

Absolutely. Realtime TTS provides several ways to customize how the speech sounds:

Voice parameters

Temperature: Controls expressiveness and randomness
Talking speed: 0.5× to 1.5× of the native speaking rate

Yes. Realtime TTS supports timestamp alignment for word, character, phoneme, and viseme level synchronization. This can be helpful for subtitles, captions, lipsync, and more.

You can enable it in your API request by setting timestampType to WORD or CHARACTER.

The API response includes:

word or character tokens
start and end timestamps (in seconds)
structured alignment data matching the generated audio
phoneme-level timing and viseme symbols for lip-sync (TTS 1.5 models only)

Timestamp alignment currently supports English, with other languages available experimentally. Note that enabling timestamps currently adds roughly 100 ms of additional latency.

Realtime TTS 1.5 is a major update delivering improvements across speed, quality, and accessibility:

The Fastest: <130ms P90 latency (~100ms median) on Realtime TTS 1.5 Mini, among the lowest-latency realtime TTS models available. Realtime TTS 1.5 Max delivers <250ms P90 (~200ms median) with enhanced quality.

The Highest Quality: Optimized stability to minimize hallucinations, cutoffs, and artifacts. Over 30% more expressive than Realtime TTS 1.

The Most Accessible: 100+ languages, enhanced voice cloning, on-premise H100/B200 deployment, and pricing down to $5 per million characters with Realtime TTS 1.5 Mini.

Which model should I use? For most applications, we recommend Realtime TTS-2 for its expressiveness and steering. Use Realtime TTS 1.5 Mini when minimal latency or cost is the top priority, and Realtime TTS 1.5 Max when stability matters most.

Yes. Inworld provides a free, open-source ElevenLabs Migration Tool that lets you batch-transfer your custom voice clones from ElevenLabs to Inworld. The tool automatically downloads your ElevenLabs voice samples, handles audio processing (format conversion, padding, and trimming), and re-clones them in Inworld.

The migration runs entirely on your local machine with direct API communication; no data is proxied through any intermediary servers. You can also preview your migrated voices with Realtime TTS before finalizing.

Products

Developers

Socials