Inworld TTS

Engage every user with the #1 ranked, most natural Voice AI

#1 ranked TTS with human-like expression and realtime sub-200ms latency that feels like a real conversation. Custom voices with instant cloning or text-based voice design. Fully multilingual, built for streaming, and a fraction of the cost of other providers.

Get startedRead the docs Talk to an architect

Ranked Quality

Realtime

Latency

15+

Languages

curl -X POST https://api.inworld.ai/tts/v1/voice:stream \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{

    "text": "Hi! What can I help you with today?",
    "voice_id": "Clive",
    "model_id": "inworld-tts-1.5-max",

    "audio_config": {
      "audio_encoding": "OGG_OPUS",
      "sample_rate_hertz": 16000
    }
  }'

# Clone a voice from an audio sample

curl -X POST https://api.inworld.ai/voices/v1/voices:clone \

  -H "Authorization: Basic $INWORLD_API_KEY" \
  -F "audio_bytes=@sample.wav" \
  -F "voice_name=my-custom-voice"

# Then use it
curl -X POST https://api.inworld.ai/tts/v1/voice \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is my cloned voice.",
    "voice_id": "my-custom-voice",
    "model_id": "inworld-tts-1.5-max"
  }'

The top ranked TTS in the world. Proven by real users.

3 of the top 5 models on Artificial Analysis are Inworld. Blind tests by thousands of real users, not internal evals. TTS-1.5 Max delivers over 30% more expressiveness than previous models, with optimized stability to eliminate hallucinations and artifacts.

Test quality in Playground

Ranked on

Artificial Analysis

#1 Inworld TTS 1.5 Max

ELO 1,215

#2 ElevenLabs Eleven v3

ELO 1,179

#3 Inworld TTS 1 Max

ELO 1,169

#4 MiniMax Speech 2.8 HD

ELO 1,169

#5 Inworld TTS 1.5 Mini

ELO 1,165

The top ranked TTS in the world. Proven by real users.

Test quality in Playground

Ranked on

Artificial Analysis

#1 Inworld TTS 1.5 Max

ELO 1,215

#2 ElevenLabs Eleven v3

ELO 1,179

#3 Inworld TTS 1 Max

ELO 1,169

#4 MiniMax Speech 2.8 HD

ELO 1,169

#5 Inworld TTS 1.5 Mini

ELO 1,165

Instant custom voice creation.

Create custom voices instantly from 15 seconds of audio or a text description. Fine-tune with professional voice cloning for maximum fidelity. All methods produce production-ready voices you can use in the Playground or via API.

Instant cloning: 15 seconds of audio → ready in seconds
Text-based voice design for full creative control
Professional cloning: 30+ minutes for maximum fidelity

Try voice cloning

15s audio sample or text description

Voice created

"my-cloned-voice"

Use via API or Playground

15s audio sample or text description

Voice created

"my-cloned-voice"

Use via API or Playground

Instant custom voice creation.

Instant cloning: 15 seconds of audio → ready in seconds
Text-based voice design for full creative control
Professional cloning: 30+ minutes for maximum fidelity

Try voice cloning

Realtime latency. Feels instant.

Built for realtime from the ground up — audio generates the instant it's synthesized via WebSocket. No buffering delay. Comparable latency to competitors at a fraction of the cost.

First-chunk audio in a fraction of a humanlike response time
Streaming-native via WebSocket
Consistent P90 performance under production load

Test speed in Playground

Realtime first-chunk latency

~130ms

TTS-1.5 Mini

~250ms

TTS-1.5 Max

350ms

Human

Realtime latency. Feels instant.

Built for realtime from the ground up — audio generates the instant it's synthesized via WebSocket. No buffering delay. Comparable latency to competitors at a fraction of the cost.

First-chunk audio in a fraction of a humanlike response time
Streaming-native via WebSocket
Consistent P90 performance under production load

Test speed in Playground

Realtime first-chunk latency

~130ms

TTS-1.5 Mini

~250ms

TTS-1.5 Max

350ms

Human

15+ languages. Native-speaker quality.

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language with cross-lingual cloning. Deploy globally without separate pipelines.

Explore voices

🇺🇸Hello world🇨🇳你好世界🇮🇳नमस्ते दुनिया🇪🇸Hola mundo🇫🇷Bonjour le monde🇰🇷안녕하세요 세계🇮🇹Ciao mondo🇯🇵こんにちは世界

And more

15+ languages. Native-speaker quality.

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language with cross-lingual cloning. Deploy globally without separate pipelines.

Explore voices

Starting at $15 per million characters.

TTS-1.5 Mini starts at $15/million characters. TTS-1.5 Max at $30/million. The next best option is over $150. Scale to millions of users without scale-related cost anxiety.

View pricing

TTS-1.5

Inworld

$15

ElevenLabs

>$150

Starting at $15 per million characters.

TTS-1.5 Mini starts at $15/million characters. TTS-1.5 Max at $30/million. The next best option is over $150. Scale to millions of users without scale-related cost anxiety.

View pricing

TTS-1.5

Inworld

$15

ElevenLabs

>$150

Built for voice-first applications

Interactive, real-time, and voice-driven experiences. Not batch processing.

Voice Agents

Sub-200ms latency and streaming-native architecture for conversational AI that feels real.

Full breakdown

Feature

TTS-1.5 Max

TTS-1.5 Mini

Best for

Most applications

Latency-critical applications

Pricing

$30/million characters

$15/million characters

P90 latency

<250ms

<130ms

Quality

Maximum expressiveness and stability

High expressiveness

Multilingual

15 languages

Voice cloning

Professional voice cloning

Character, word, viseme and phoneme timestamps

Custom pronunciation

On-premise

Zero data retention

We recommend TTS-1.5 Max for most use cases. The enhanced stability and quality are worth the marginal latency tradeoff for the vast majority of applications.

Research

Cutting-edge research

Publications

Explore our latest research advancing the state of the art in speech synthesis, voice cloning, and real-time TTS

TTS-1 Technical Report

Training code available

Open source

We’ve open-sourced the full training framework behind Inworld TTS-1 — everything from codec to SpeechLM fine-tuning — so you can build your own high-quality TTS models faster.

Inworld Text-To-Speech Trainer

Integrations

LiveKit

Real-time web and mobile voice AI with low latency and streaming.

Learn more

NLX

No-code/low-code platform for multichannel voice experiences.

Learn more

Pipecat

Open-source video/audio API with streaming TTS for interactive apps.

Learn more

Vapi

Cloud telephony for voice agents with PSTN and SIP support.

Learn more

Stream

Stream (Vision Agents) is Stream’s open-source framework that helps developers quickly build low-latency vision AI applications.

Learn more

Ultravox

Ultravox is a real-time voice AI infrastructure layer that delivers fast, natural, and scalable voice agents.

Learn more

Voximplant

Voximplant is a serverless Voice AI orchestration platform and cloud communications stack for building real-time voice agents over the phone and the web.

Learn more

Try Inworld TTS now

Get started with TTS-1.5 Max, the best balance of quality and speed for most applications.

Try TTS Playground Try API Talk to our team

FAQs

How do I use text-to-speech?

Getting started is simple. You can try Inworld TTS instantly in the TTS Playground, where you can test voices, adjust settings, and experiment with features like instant voice cloning.

When you’re ready to integrate TTS into your application, you can follow the Developer Quickstart to make your first API request in minutes. Just create an API key in the Inworld Portal, then synthesize speech with a single POST request. Inworld supports multiple output formats, including MP3, Linear PCM (WAV), and Opus, making it easy to integrate with almost any system.

How is Inworld TTS quality evaluated?

Inworld TTS is evaluated through blind listening tests by thousands of real users. TTS-1.5 Max delivers over 30% more expressiveness than its predecessor, with optimized stability and natural conversational delivery.

Which TTS-1.5 model should I use?

For most applications: TTS-1.5 Max (~200ms latency, $10/1M characters)

TTS-1.5 Max offers the best balance of quality and speed. The enhanced stability means fewer edge cases, better voice cloning fidelity, and more consistent output across languages.

For latency-critical applications: TTS-1.5 Mini (<100ms latency, $5/1M characters)

Choose TTS-1.5 Mini only if minimal latency is your absolute top priority — for example, real-time gaming or ultra-responsive voice agents where every millisecond matters.

What is the latency and time-to-first-byte (TTFB) of Inworld TTS?

TTS-1.5 Mini achieves <120ms P90 latency. TTS-1.5 Max delivers ~200ms with enhanced stability and quality. Both support real-time streaming via WebSocket. For most applications, we recommend TTS-1.5 Max — the quality improvement is worth the marginal latency tradeoff.

Does Inworld offer voice cloning?

Yes. Inworld provides two types of voice cloning:

Instant (zero-shot) voice cloning

Available to all users in the Portal
Creates a custom voice from just 15 seconds of audio
Ready to use in minutes

Professional voice cloning

Fine-tuned using 30+ minutes of clean audio (minimum ~5 minutes, 20+ minutes recommended for best results)
Recommended for uncommon voice types such as children's voices or unique accents, where instant cloning may not perform well
Currently available by contacting the Inworld sales team

Both methods allow you to use your cloned voice directly in the Playground or via API using its unique voiceId.

Which languages does Inworld TTS support?

Inworld TTS-1.5 supports 15 languages: English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew.

For multilingual applications, we recommend TTS-1.5 Max for the best pronunciation, intonation, and naturalness across all supported languages.

Can I control emotion, speed, and other voice characteristics?

Absolutely. Inworld TTS provides several ways to customize how the speech sounds:

Voice parameters

Temperature: Controls expressiveness and randomness
Talking speed: 0.5× to 1.5× of the native speaking rate

Does Inworld support lipsync, word highlighting, or timestamp alignment?

Yes. Inworld TTS supports timestamp alignment for word, character, phoneme, and viseme level synchronization. This can be helpful for subtitles, captions, lipsync, and more.

You can enable it in your API request by setting timestampType to WORD or CHARACTER.

The API response includes:

word or character tokens
start and end timestamps (in seconds)
structured alignment data matching the generated audio
phoneme-level timing and viseme symbols for lip-sync (TTS 1.5 models only)

Timestamp alignment currently supports English, with other languages available experimentally. Note that enabling timestamps currently adds roughly 100 ms of additional latency.

What's new in TTS-1.5?

TTS-1.5 is a major update delivering improvements across speed, quality, and accessibility:

The Fastest: <120ms P90 latency — the fastest realtime TTS available. TTS-1.5 Max delivers ~200ms with enhanced quality.

The Highest Quality: Optimized stability to minimize hallucinations, cutoffs, and artifacts. Over 30% more expressive than TTS-1.

The Most Accessible: 15 languages (including Hindi), enhanced voice cloning, on-premise H100/B200 deployment, and 25x lower cost than alternatives.

Which model should I use? For most applications, we recommend TTS-1.5 Max. Use TTS-1.5 Mini only when minimal latency is the top priority.

Can I migrate my voices from ElevenLabs to Inworld?

Yes. Inworld provides a free, open-source ElevenLabs Migration Tool that lets you batch-transfer your custom voice clones from ElevenLabs to Inworld. The tool automatically downloads your ElevenLabs voice samples, handles audio processing (format conversion, padding, and trimming), and re-clones them in Inworld.

The migration runs entirely on your local machine with direct API communication — no data is proxied through any intermediary servers. You can also preview your migrated voices with Inworld TTS before finalizing.

Products

Developers

Socials