Inworld TTS-1.5

The #1 ranked TTS model. Production-grade realtime latency under 200ms. Expression and stability optimized for user engagement.

Production-grade realtime latency

Professional voice actor quality at human-native speeds
P90 Latency
<250ms P90 first chunk latency for Max, <130ms for Mini. ~4x faster than TTS-1. Even your slowest requests feel instant.
Median Latency
<200ms median first chunk latency for Max, <100ms for Mini. ~4x faster than TTS-1. Voice agents respond before users notice a delay.
Streaming-native
Built for realtime from the ground up. Audio generates the instant it's synthesized via WebSocket. No buffering delay.

Engagement-optimized quality

The expression and stability you need to keep users engaged
#1 on public benchmarks. Truly expressive.
#1 on Artificial Analysis. Blind tests by thousands of real users, not internal evals. TTS 1.5 adds over 30% more expressiveness.
Optimized stability
Engineered to minimize hallucinations, word cutoffs, and audio artifacts. TTS-1.5 has a 40% lower word error rate.
Enhanced voice cloning
Improved stability and realism. Create custom voices instantly from 5–15 seconds of audio, or fine-tune with professional voice cloning for maximum fidelity.

Built for consumer-scale

Half a cent per minute of interaction. Enhanced Multilingual support. On-prem deployment options.
Enhanced multilingual
English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language. Deploy globally without separate pipelines.
A user interface demonstrating multilingual support, with a dropdown menu to select languages such as Korean, Spanish, and German.
On-prem deployment
Full data sovereignty for enterprises with compliance requirements. Same quality, your environment.
Half a cent per minute
Compare to 25 cents per minute with the next best option. We're $5-10 per million characters, others are over $120.

Full breakdown

Feature
TTS-1.5 Max
TTS-1.5 Mini
Best for
Most applications
Latency-critical applications
Pricing
1c per min ($10/million char)
0.5c per min ($5/million char)
P90 Latency
<250ms
<130ms
Quality
#1 ranked, maximum stability
#1 ranked
Multilingual
15 languages
15 languages
Voice cloning
Professional voice cloning
Character, word, viseme and phoneme timestamps
Custom pronunciation
On-Premise
We recommend TTS-1.5 Max for most use cases. The enhanced stability and quality are worth the marginal latency tradeoff for the vast majority of applications.

Research

Integrations

Try Inworld TTS now

Get started with TTS-1.5 Max, the best balance of quality and speed for most applications.

FAQs

How do I use text-to-speech?
Getting started is simple. You can try Inworld TTS instantly in the TTS Playground, where you can test voices, adjust settings, and experiment with features like instant voice cloning.
When you’re ready to integrate TTS into your application, you can follow the Developer Quickstart to make your first API request in minutes. Just create an API key in the Inworld Portal, then synthesize speech with a single POST request. Inworld supports multiple output formats, including MP3, Linear PCM (WAV), and Opus, making it easy to integrate with almost any system.
How do Inworld's models perform on public benchmarks?
Inworld TTS Max launched as the #1 model on the Artificial Analysis TTS Arena
Which TTS-1.5 model should I use?
For most applications: TTS-1.5 Max (~200ms latency, $10/1M characters)
TTS-1.5 Max offers the best balance of quality and speed. The enhanced stability means fewer edge cases, better voice cloning fidelity, and more consistent output across languages.
For latency-critical applications: TTS-1.5 Mini (<100ms latency, $5/1M characters)
Choose TTS-1.5 Mini only if minimal latency is your absolute top priority — for example, real-time gaming or ultra-responsive voice agents where every millisecond matters.
What is the latency and time-to-first-byte (TTFB) of Inworld TTS?
TTS-1.5 Mini achieves <120ms P90 latency. TTS-1.5 Max delivers ~200ms with enhanced stability and quality. Both support real-time streaming via WebSocket. For most applications, we recommend TTS-1.5 Max — the quality improvement is worth the marginal latency tradeoff.
Does Inworld offer voice cloning?
Yes. Inworld provides two types of voice cloning:

Instant (zero-shot) voice cloning

  • Available to all users in the Portal
  • Creates a custom voice from just 15 seconds of audio
  • Ready to use in minutes

Professional voice cloning

  • Fine-tuned using 30+ minutes of clean audio (minimum ~5 minutes, 20+ minutes recommended for best results)
  • Recommended for uncommon voice types such as children's voices or unique accents, where instant cloning may not perform well
  • Currently available by contacting the Inworld sales team
Both methods allow you to use your cloned voice directly in the Playground or via API using its unique voiceId.
Which languages does Inworld TTS support?
Inworld TTS-1.5 supports 15 languages: English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew.
For multilingual applications, we recommend TTS-1.5 Max for the best pronunciation, intonation, and naturalness across all supported languages.
Can I control emotion, speed, and other voice characteristics?
Absolutely. Inworld TTS provides several ways to customize how the speech sounds:

Voice parameters

  • Temperature: Controls expressiveness and randomness
  • Talking speed: 0.5× to 1.5× of the native speaking rate
Does Inworld support lipsync, word highlighting, or timestamp alignment?
Yes. Inworld TTS supports timestamp alignment for word, character, phoneme, and viseme level synchronization. This can be helpful for subtitles, captions, lipsync, and more.
You can enable it in your API request by setting timestampType to WORD or CHARACTER.
The API response includes:
  • word or character tokens
  • start and end timestamps (in seconds)
  • structured alignment data matching the generated audio
  • phoneme-level timing and viseme symbols for lip-sync (TTS 1.5 models only)
Timestamp alignment currently supports English, with other languages available experimentally. Note that enabling timestamps currently adds roughly 100 ms of additional latency.
What's new in TTS-1.5?
TTS-1.5 is a major update delivering improvements across speed, quality, and accessibility:
The Fastest: <120ms P90 latency — the fastest realtime TTS available. TTS-1.5 Max delivers ~200ms with enhanced quality.
The Highest Quality: Optimized stability to minimize hallucinations, cutoffs, and artifacts. #1 on Artificial Analysis.
The Most Accessible: 15 languages (including Hindi), enhanced voice cloning, on-premise H100/B200 deployment, and 25x lower cost than alternatives.
Which model should I use? For most applications, we recommend TTS-1.5 Max. Use TTS-1.5 Mini only when minimal latency is the top priority.
Copyright © 2021-2026 Inworld AI