Introducing Timestamp Alignment, WebSockets and More for Inworld TTS

We've been listening to developers building with Inworld TTS, and today's release addresses your most-requested features. These updates to Inworld TTS give developers more control, better performance, and new capabilities for creating expressive voice experiences, at the same accessible price.

Whether you're building interactive game characters, consumer applications, or call center agents, these updates address the most common pain points developers face when integrating text-to-speech into their products.

Here's what's new.

Performance improvements - now #1 on Artificial Analysis TTS Leaderboard

Speed and quality are critical for real-time voice. Inworld TTS is now faster, smoother, and more natural across production workloads. Inworld TTS 1 Max just ranked #1 on the Artificial Analysis Text to Speech Leaderboard, which benchmarks the leading TTS models on realism and performance.

Quality improvements

New TTS models deliver clearer, more consistent, and more human-like speech.

Clearer articulation: Lower word error rate (WER) and better intelligibility on long or complex sentences.
Improved voice cloning: Higher speaker-similarity scores; voices retain tone, pacing, and emotion even across languages.
More accurate multilingual output: Fewer accent mismatches and more natural pronunciation across supported languages.

Latency improvements

We’ve reduced latency across multiple layers of our stack:

Infrastructure migration: New server placements cut internal round-trip time by ~50 ms, especially benefiting users in the US and Europe.
Optional text normalization: Disable text normalization in the API to save 30–40 ms for English (up to 300 ms on complex text) and up to 1 sec in other languages.
WebSocket streaming: Persistent connections reduce handshakes, enabling faster starts and smoother real-time dialogue.
Faster inference: Inworld TTS Max now runs on an optimized hardware stack, enabling responses that are ~15% faster.

WebSocket support

For real-time conversational applications, our new WebSocket API offers persistent connections with comprehensive streaming controls.

HTTP requests work fine for simple TTS, but they add overhead when you're building voice agents, interactive characters, or phone call agents, as each request requires connection setup.

WebSockets keep a persistent connection open. You can stream text as it arrives from your LLM, maintain conversation context, and handle interruptions gracefully.

Three ways WebSockets give you more control:

Context management: Run multiple independent audio streams over a single connection. Each context maintains its own voice settings, prosody, and buffer state.
Smart buffering: Configure when synthesis begins with maxBufferDelayMs and bufferCharThreshold. Start generating audio before complete text arrives, or wait for full sentences.
Dynamic control: Update voice parameters mid-stream, flush contexts manually, or handle user interruptions without dropping the connection.

Perfect for:

Interactive voice agents that require low latency
Dynamic conversations where barge-in or interruption support is needed

Read the docs

Timestamp alignment: Sync audio with visuals & actions

Building lipsync for 3D avatars? Highlighting words as they're spoken? Triggering game play actions at specific moments in speech? Handling barge-in and interruptions? You need timestamps.

Timestamp alignment returns precise timing information that matches your generated audio. Choose the granularity that fits your use case:

Use word-level timestamps for:

Karaoke-style caption highlighting
Triggering character actions when specific words play
Tracking where users interrupt the AI
Syncing UI elements with speech

Character-level timestamps are most common for lipsync animation, where they can be converted to phonemes and visemes.

Timestamps currently support English for both streaming and non-streaming, with other languages experimental.

Read the docs

Voice cloning API for programmatic voice creation

Voice cloning is no longer limited to our UI. Now you can create custom voices directly through the API. Available in beta to select customers.

Why this matters:

If you're building a platform where end users need to clone their own voices, you can now integrate that experience directly into your app, without redirecting users to Inworld's interface. You can also create voices in bulk using a simple script.

Use cases:

Games where players create their own character voices
Social platforms where users create their own avatars
Games or call centers where a large number of voices need to be created in bulk from pre-recorded audio samples

Voice cloning APIs enable third-party platforms to offer voice creation as a native feature in their own workflows or create voices in bulk.

Read the docs

Custom voice tags

When creating a custom voice in the UI or API, we now allow users to apply tags to their voices for grouping and filtering.

Why this matters:

You can now easily manage a large database of voices and filter for the appropriate voice at runtime, which is highly valuable in games and related applications, where characters are often generated on the fly.

Use cases:

Gaming platforms where characters are generated on the fly and need to be matched to an appropriate voice
Enterprise apps where the optimal voice is chosen at runtime based on the user profile
Applications that are still in development, where managing and iterating on a large number of voices is an essential workflow in the design process

Voice tags are the first step toward a larger voice library and management system.

Read the docs

Custom pronunciation: Say it your way

Getting AI voices to pronounce words correctly matters. Brand names, character names, technical terms, and regional dialects are often misspoken by standard TTS models because they aren't represented well in the training data.

We now allow users to manually insert phonetic notation into their text, allowing for consistent and accurate pronunciation of key words. Not sure what phonemes to use? Ask ChatGPT or your favorite AI assistant for the IPA transcription, or check reference sites like IPA Pronunciation Guide | Vocabulary.com

Common use cases:

Brand names that need to sound perfect every time
Unique names
Medical, legal, or technical terminology
Regional pronunciation variations
Fictional locations and proper nouns

We support International Phonetic Alphabet (IPA) notation.

Read the docs

Russian support and multilingual improvements

Inworld TTS now speaks Russian, bringing our total to 12 supported languages. All supported languages include English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Dutch, Polish, and Russian.

Clone a voice and label it as Russian, or choose one of our pre-built Russian voices. As with all languages, voices perform best when synthesizing text in their native language, though cross-language synthesis is possible.

We've also made quality improvements across all non-English languages. Better pronunciation accuracy, more natural intonation, and smoother speech patterns.

For multilingual applications, Inworld TTS Max delivers the strongest results with superior pronunciation and more contextually-aware speech across languages.

Read the docs

Try these features today

All features are available now through our API and TTS Playground, at the same accessible pricing.

Get started:

Try TTS Playground
Read the docs
You can also access Inworld voices and text-to-speech models via LiveKit, NLX, Pipecat, and Vapi.

Frequently asked questions

How do I convert timestamps to visemes for lipsync?

The typical pipeline: character timestamps → phonemes (using tools like PocketSphinx) → visemes (using your game engine's mapping). Our timestamps provide the timing foundation.

How do I gracefully handle interruptions with websocket?

The WebSocket endpoint supports multiple independent contexts, enabling seamless barge-in handling. When a user interrupts, you can start a new, independent context and send the post-interruption agent response to it. The old context can be closed when the interruption occurs.

What are some techniques to optimize end-to-end latency?

To reduce latency, consider using the TTS streaming API, keeping a persistent WebSocket connection, and disabling text normalization by instructing your LLM to create speech-ready text via a system prompt.

Introducing timestamp alignment, WebSockets and more for Inworld TTS

Performance improvements - now #1 on Artificial Analysis TTS Leaderboard

Quality improvements

Latency improvements

Voice cloning API for programmatic voice creation

Custom voice tags

Custom pronunciation: Say it your way

Russian support and multilingual improvements

Try these features today

Frequently asked questions