We've been listening to developers building with Inworld TTS, and today's release addresses your most-requested features. These updates to Inworld TTS give developers more control, better performance, and new capabilities for creating expressive voice experiences, at the same accessible price.
Whether you're building interactive game characters, consumer applications, or call center agents, these updates address the most common pain points developers face when integrating text-to-speech into their products.
Here's what's new.
Performance improvements - now #1 on Artificial Analysis TTS Leaderboard
Speed and quality are critical for real-time voice. Inworld TTS is now faster, smoother, and more natural across production workloads. Inworld TTS 1 Max just ranked #1 on the
Artificial Analysis Text to Speech Leaderboard, which benchmarks the leading TTS models on realism and performance.
Quality improvements
New TTS models deliver clearer, more consistent, and more human-like speech.
- Clearer articulation: Lower word error rate (WER) and better intelligibility on long or complex sentences.
- Improved voice cloning: Higher speaker-similarity scores; voices retain tone, pacing, and emotion even across languages.
- More accurate multilingual output: Fewer accent mismatches and more natural pronunciation across supported languages.
Latency improvements
We’ve reduced latency across multiple layers of our stack:
- Infrastructure migration: New server placements cut internal round-trip time by ~50 ms, especially benefiting users in the US and Europe.
- Optional text normalization: Disable text normalization in the API to save 30–40 ms for English (up to 300 ms on complex text) and up to 1 sec in other languages.
- WebSocket streaming: Persistent connections reduce handshakes, enabling faster starts and smoother real-time dialogue.
- Faster inference: Inworld TTS Max now runs on an optimized hardware stack, enabling responses that are ~15% faster.
WebSocket support
For real-time conversational applications, our new WebSocket API offers persistent connections with comprehensive streaming controls.
HTTP requests work fine for simple TTS, but they add overhead when you're building voice agents, interactive characters, or phone call agents, as each request requires connection setup.
WebSockets keep a persistent connection open. You can stream text as it arrives from your LLM, maintain conversation context, and handle interruptions gracefully.
Three ways WebSockets give you more control:
- Context management: Run multiple independent audio streams over a single connection. Each context maintains its own voice settings, prosody, and buffer state.
- Smart buffering: Configure when synthesis begins with maxBufferDelayMs and bufferCharThreshold. Start generating audio before complete text arrives, or wait for full sentences.
- Dynamic control: Update voice parameters mid-stream, flush contexts manually, or handle user interruptions without dropping the connection.
Perfect for:
- Interactive voice agents that require low latency
- Dynamic conversations where barge-in or interruption support is needed
Timestamp alignment: Sync audio with visuals & actions
Building lipsync for 3D avatars? Highlighting words as they're spoken? Triggering game play actions at specific moments in speech? Handling barge-in and interruptions? You need timestamps.
Timestamp alignment returns precise timing information that matches your generated audio. Choose the granularity that fits your use case:
Word-level timestamps for:
- Karaoke-style caption highlighting
- Triggering character actions when specific words play
- Tracking where users interrupt the AI
- Syncing UI elements with speech
Character-level timestamps are most common for lipsync animation, where they can be converted to phonemes and visemes.
Timestamps currently support English in non-streaming mode, with other languages experimental.
Voice cloning API for programmatic voice creation
Voice cloning is no longer limited to our UI. Now you can create custom voices directly through the API. Available in beta to select customers.
Why this matters:
If you're building a platform where end users need to clone their own voices, you can now integrate that experience directly into your app, without redirecting users to Inworld's interface. You can also create voices in bulk using a simple script.
Use cases:
- Games where players create their own character voices
- Social platforms where users create their own avatars
- Games or call centers where a large number of voices need to be created in bulk from pre-recorded audio samples
Voice cloning APIs enable third-party platforms to offer voice creation as a native feature in their own workflows or create voices in bulk.
Custom voice tags
When creating a custom voice in the UI or API, we now allow users to apply tags to their voices for grouping and filtering.
Why this matters:
You can now easily manage a large database of voices and filter for the appropriate voice at runtime, which is highly valuable in games and related applications, where characters are often generated on the fly.
Use cases:
- Gaming platforms where characters are generated on the fly and need to be matched to an appropriate voice
- Enterprise apps where the optimal voice is chosen at runtime based on the user profile
- Applications that are still in development, where managing and iterating on a large number of voices is an essential workflow in the design process
Voice tags are the first step toward a larger voice library and management system.
Custom pronunciation: Say it your way
Getting AI voices to pronounce words correctly matters. Brand names, character names, technical terms, and regional dialects are often misspoken by standard TTS models because they aren't represented well in the training data.
We now allow users to manually insert phonetic notation into their text, allowing for consistent and accurate pronunciation of key words. Not sure what phonemes to use? Ask ChatGPT or your favorite AI assistant for the IPA transcription, or check reference sites like
IPA Pronunciation Guide | Vocabulary.comCommon use cases:
- Brand names that need to sound perfect every time
- Unique names
- Medical, legal, or technical terminology
- Regional pronunciation variations
- Fictional locations and proper nouns
We support International Phonetic Alphabet (IPA) notation.
Russian support and multilingual improvements
Inworld TTS now speaks Russian, bringing our total to 12 supported languages. All supported languages include English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Dutch, Polish, and Russian.
Clone a voice and label it as Russian, or choose one of our pre-built Russian voices. As with all languages, voices perform best when synthesizing text in their native language, though cross-language synthesis is possible.
We've also made quality improvements across all non-English languages. Better pronunciation accuracy, more natural intonation, and smoother speech patterns.
For multilingual applications, Inworld TTS Max delivers the strongest results with superior pronunciation and more contextually-aware speech across languages.
Try these features today
All features are available now through our API and TTS Playground, at the same accessible pricing.
Get started:
Frequently asked questions
How do I convert timestamps to visemes for lipsync?
The typical pipeline: character timestamps → phonemes (using tools like PocketSphinx) → visemes (using your game engine's mapping). Our timestamps provide the timing foundation.
How do I gracefully handle interruptions with websocket?
The WebSocket endpoint supports multiple independent contexts, enabling seamless barge-in handling. When a user interrupts, you can start a new, independent context and send the post-interruption agent response to it. The old context can be closed when the interruption occurs.
What are some techniques to optimize end-to-end latency?
To reduce latency, consider using the TTS streaming API, keeping a persistent WebSocket connection, and disabling text normalization by instructing your LLM to create speech-ready text via a system prompt.