Get started
Voice Generator

The AI voice generator that sounds like a person

Pick a voice from the library, clone a real one, or design a brand new one. Every output ships at the quality real users vote best, fast enough to power a live conversation. Built for production apps, not weekend demos.
Voice generation
Input
modelId inworld-tts-1.5-maxvoiceId SarahlangCode EN_US

Welcome back. Let's pick up where you left off.

Output
first_chunk 148msformat streaming NDJSON

MP3 · 24kHz · 2.1s of audio

Works with
TTSRealtime API

The voice stack production apps already ship on.

The voice users vote best, fast enough for live conversation, commercial-ready out of the box.
#1 ranked voice AI

The voice that real users vote best.

Real listeners pick our voices over every other TTS on the Artificial Analysis Speech Arena, an independent human-vote leaderboard. The top spot, and most of the voices around it, are ours.
Artificial Analysis · Speech Arena
#1
Inworld TTS 1.5 Max
#2
Next best voice
#3
Inworld TTS 1 Max
#5
Inworld TTS 1.5 Mini
3 of top 5 are Inworld voices
Three ways to get a voice

Pick from the library, clone a real one, or design a new one.

One TTS API, three routes to a voice. Choose a library voice, clone from 5-15 seconds of audio, or design from a text prompt. All return at top-ranked quality.
See Voice Design
Three routes to a voiceId
Library
Instant
Clone
~1 min
Design
Seconds
One voiceId, every pipeline
Sub-200ms streaming
Works with
TTSRealtime API

Fast enough to drive a live conversation.

Median first audio chunk lands in under 200ms. Streaming JSON lets you play before synthesis finishes. Built for live voice agents, not pre-rendered audio.
TTS 1.5 · streaming latency
<200ms
Median first audio chunk
Built for realtime agents, not pre-rendered audio
Expressive by design

Your voice actually performs the line.

Drop emotion and non-verbal cues inline with the text. TTS reads them and delivers the moment. No extra emotion dial to tune.
Expressive control
audio markup
[happy] You made it.
I was starting to get worried[sigh].
[whispering] Quick, before they see us[laugh].
[happy]
[sad]
[angry]
[surprised]
[sigh]
[whispering]
[laugh]
15 languages

One voice, every market on your roadmap.

Generate across 15 languages from a single voice. Cross-lingual synthesis lets a cloned or designed voice speak every market on your roadmap without re-recording.
15 languages, one voice
cross-lingual
English
Spanish
French
German
Italian
Portuguese
Polish
Dutch
Russian
Arabic
Japanese
Korean
Chinese
Hindi
Hebrew
Design or clone once. Deploy across every market on your roadmap.
Production-ready today
Works with
TTSRealtime API

Already powering voices at millions of DAU.

Wishroll, Talkpal, Death by AI, and OtherHalf generate Inworld voices at real production scale. SOC 2 Type II, GDPR, zero data retention, on-premise available.
Powering voices for
Wishroll
Consumer app · 500K+ DAU · published case study
Talkpal
Language learning · 5M+ learners
Death by AI
Interactive media · 20M+ players
OtherHalf
AI companion app

Generate audio in three lines

Auth, text, voice. Playback starts before the sentence finishes generating.
import { InworldTTS } from '@inworld/tts'; const tts = InworldTTS(); // reads INWORLD_API_KEY // 1. Stream audio back as NDJSON chunks const stream = tts.synthesizeStream({ text: 'Welcome back. Let\'s pick up where you left off.', voiceId: 'Sarah', modelId: 'inworld-tts-1.5-max', audioConfig: { audioEncoding: 'MP3', sampleRateHertz: 48000, }, }); for await (const chunk of stream) { // chunk.result.audioContent is base64 MP3 audioPlayer.append(Buffer.from(chunk.result.audioContent, 'base64')); }

Prefer clicking? Generate in the playground.

Open the TTS Playground, pick a voice, type or paste your script, add emotion tags inline, and hit Generate. Tweak speed, pitch, or model. When it sounds right, copy the voiceId and drop it into your code.
Open the playground

FAQ

Inworld TTS is independently ranked at the top of the Artificial Analysis Speech Arena, a human-vote leaderboard. Inworld voices hold three of the top five spots.
15 languages: English, Spanish, French, German, Italian, Portuguese, Polish, Dutch, Russian, Arabic, Japanese, Korean, Chinese, Hindi, Hebrew. Cross-lingual synthesis lets one cloned or designed voice speak every supported language.
Instant cloning with 5-15 seconds of audio. Professional cloning with 30+ minutes for higher fidelity. Account limit is 1,000 cloned voices; contact sales for higher limits. See voice cloning docs.
Yes. Voice Design generates a voice from a 30-250 character English description. Get up to three previews per call, publish the one you want as a permanent voiceId, and use it anywhere in the Inworld pipeline.
Median first audio chunk streams in under 200ms on TTS 1.5 Max, roughly 120ms on TTS 1.5 Mini. Streaming endpoint returns NDJSON so playback can start before synthesis finishes.
Yes. Voices generated, cloned, or designed through Inworld are yours to use commercially under the Inworld terms of service. See pricing for current rates and tiers.
Yes. Every voice generated here works in the Realtime API as a live voice-agent voice — same voiceId, same quality, same streaming latency.
MP3, WAV, PCM, LINEAR16, OGG_OPUS, μ-law, A-law, FLAC. Sample rates 8-48kHz on compressed formats. Max output audio size 16MB per request; chunk long content at sentence or paragraph breaks.

The voice generator your production can ship on.

Top-ranked quality. 15 languages. Sub-200ms streaming. Already running at real scale.
Copyright © 2021-2026 Inworld AI
AI Voice Generator: The #1 Ranked Voice on Artificial Analysis | Inworld AI