Introducing Inworld TTS

Author

Published on

June 25, 2025

Today, we are launching Inworld TTS - a new generation of text-to-speech models that deliver cutting-edge quality and latency for the most accessible price on the market. Our flagship model Inworld TTS-1 offers realistic, context-aware speech synthesis and precise zero-shot voice cloning, outperforming comparable solutions from leading labs.

Inworld TTS-1 is available today via API and can be experienced in the TTS Playground, where you can test pre-built voices or clone your own from a short audio sample.

We are also releasing Inworld TTS-1-Max, a larger, more expressive model, as experimental.

Powering the Next Generation of AI Applications

For too long, developers have faced a false choice: use high-quality, expressive speech that is slow and expensive, or settle for affordable solutions that lack realism. Our goal was to eliminate this trade-off and build the voice layer for the next generation of consumer AI applications. Here’s what makes TTS-1 different.

Unmatched quality. Inworld TTS delivers rich, emotionally nuanced speech virtually indistinguishable from human speaking. It captures subtle nuances in tone and prosody, making interactions feel natural and engaging. This power is now at your fingertips in 11 languages with Inworld TTS-1 and TTS-1-Max [1]. We’re also releasing a research preview of audio markups, such as [happy] or [whispering], which give users a new level of control over how the model speaks, not just what it says.

[2]

Blazing-fast for real-time interactions. With the first 2-second audio chunk ready in as few as 200ms[3], Inworld TTS-1 is built for real-time applications. The model is already available through popular AI voice platforms like LiveKit and Vapi, with additional integrations coming soon, and can power everything from educational companions and fitness trainers, to shopping assistants and open world games. The development and technical achievements of Inworld’s TTS-1 were accelerated by partners like Modular and Lightning AI. We’ll be sharing more about each of these partnerships and use cases in the coming weeks.

[4]

Radically affordable for every developer. State-of-the-art AI should not be a luxury. We’ve optimized our entire stack to offer Inworld TTS-1 at a disruptive price of $5 per 1 million characters. On top of that, we’ve made our powerful zero-shot voice cloning free for all users. Now, every developer and team, from indie hacker to enterprise, can integrate production-grade voice AI into their products without breaking the budget.

[5]

We are excited to see how developers across all verticals will leverage our tech to build experiences we haven't even imagined.

A Commitment to Open Innovation

We believe that transparency and community collaboration are the catalysts for true progress. In that spirit, we are making our research accessible to all. In the coming weeks, we will publish a detailed technical report on Inworld TTS’s architecture and training methodology.

Furthermore, we will open source our ready-to-use training repository on GitHub under a commercially permissive license. This will provide a step-by-step guide to recreating our work, from SpeechLM pre-training to SFT and RLHF, empowering researchers and developers to build upon our foundation.

This is just the beginning. We’ll be working on continuously improving models’ quality and affordability. This TTS architecture has proven to be an incredibly flexible framework, and we are already experimenting with new capabilities, such as creating voices from their natural language descriptions, which we plan to release later this year.

Trust & Safety

Powerful technology demands profound responsibility. We are committed to ensuring our voice generation technology is used safely and ethically.

All synthesized audio from our platform contains an imperceptible watermark to ensure it can be identified as AI-generated.
We have implemented robust safeguards to prevent the cloning of voices without explicit consent.
We actively prohibit and will act against any uses that violate our policies, such as malicious impersonation or fraudulent activity.

We are dedicated to collaborating with the broader research community to advance safety standards for all voice AI.

How to Get Started

Experience the Inworld TTS difference today:

Try the TTS Playground to hear the quality for yourself.
Clone your voice instantly with just a few seconds of audio.
Read the API Docs and start building now.

For even higher fidelity, fine-tuned voice clones and customized enterprise plans for high-volume use cases, please reach out to our team for more information.

Let's Build Together

Your feedback is crucial as we refine and expand our TTS capabilities. If you have suggestions or encounter issues, please share them with our team via the feedback form in the Inworld Portal. We can't wait to see what you build.

Appendix

[1] Language support.

The Inworld TTS-1 and TTS-1-Max models are available in the following languages:

- Production-ready: English (including all accents), Chinese, Korean, Dutch, French, and Spanish

- Experimental: Japanese, German, Italian, Polish, and Portuguese.

[2] Quality comparison methodology for text-to-speech providers.

For the quality comparison of different text-to-speech providers, a zero-shot voice cloning test was conducted using four randomly selected speakers (two male, two female) from an open-sourced audio collection (e.g., libritts). The evaluation was performed across eight distinct TTS application domains. For each domain, 200 text samples were generated using Gemini 2.5 Pro, resulting in a total of 1600 samples per provider (Inworld, 11Labs, and Cartesia).

The quality of the generated audio was assessed using two standard metrics: Word Error Rate (WER) and Speaker Similarity (SIM). The WER was calculated using the Whisper-large-v3 ASR model, and the SIM was measured with the wavLM-large-finetune model. This evaluation process follows common practices in the field.

[3] Median latency excluding networking.

[4] The latency of streaming speech synthesis APIs was tested with audio generations of approximately 10 seconds in length. The tests were conducted with the API client located in the US-west region and included network latency in the measurements. Audio was synthesized in PCM 24kHz, 16-bit, mono format for all providers, with the exception of Hume AI, which was tested at a 48kHz sample rate as 24kHz was not a configurable option. To reflect real-world application performance, latency was measured for the first two-second chunk of audio to ensure it contained meaningful content for playback.

[5] The cost comparison is based on monthly pricing plans where applicable. The specific plans used for this comparison were the Scale plan for Cartesia and the Business plans for 11Labs and Hume AI.