Voice Generator

The AI voice generator that sounds like a person

Q: What makes the voice generator the best?

Realtime TTS is independently ranked at the top of the Artificial Analysis Speech Arena, a human-vote leaderboard. Inworld voices hold three of the top five spots.

Q: Which languages are supported?

Over 100 languages with Realtime TTS-2, including English, Spanish, French, German, Italian, Portuguese, Polish, Dutch, Russian, Arabic, Japanese, Korean, Chinese, Hindi, and Hebrew. Cross-lingual synthesis preserves one voice identity across every language, including mid-utterance switching.

Q: Can I clone an existing voice?

Instant cloning with 5-15 seconds of audio. Professional cloning with 30+ minutes for higher fidelity. Account limit is 1,000 cloned voices; contact sales for higher limits. See voice cloning docs.

Q: Can I design a voice without audio input?

Yes. Voice Design generates a voice from a 30-250 character English description. Get up to three previews per call, publish the one you want as a permanent voiceId, and use it anywhere in the Inworld pipeline.

Q: How fast is the voice generator?

Median first audio chunk streams in under 200ms on TTS 1.5 Max, roughly 120ms on TTS 1.5 Mini. Streaming endpoint returns NDJSON so playback can start before synthesis finishes.

Q: Can I use these voices commercially?

Yes. Voices generated, cloned, or designed through Inworld are yours to use commercially under the Inworld terms of service. See pricing for current rates and tiers.

Q: Can I use these voices in a voice agent?

Yes. Every voice generated here works in the Realtime API as a live voice-agent voice — same voiceId, same quality, same streaming latency.

Q: What audio formats are supported?

MP3, WAV, PCM, LINEAR16, OGG_OPUS, μ-law, A-law, FLAC. Sample rates 8-48kHz on compressed formats. Max output audio size 16MB per request; chunk long content at sentence or paragraph breaks.

Pick a voice from the library, clone a real one, or design a brand new one. Every output ships at the quality real users vote best, fast enough to power a live conversation. Built for production apps, not weekend demos.

Generate a voice Read the docs

Voice generation

Input

modelId inworld-tts-2voiceId SarahlangCode EN_US

Welcome back. Let's pick up where you left off.

Output

first_chunk 148msformat streaming NDJSON

MP3 · 24kHz · 2.1s of audio

Works with

TTS Realtime API

The voice stack production apps already ship on.

The voice users vote best, fast enough for live conversation, commercial-ready out of the box.

#1 realtime voice AI

The voice that real users vote best.

Real listeners pick our voices over every other TTS on the Artificial Analysis Speech Arena, an independent human-vote leaderboard. The top spot, and most of the voices around it, are ours.

Artificial Analysis · Speech Arena

Realtime TTS 1.5 Max

Next best voice

Realtime TTS 1 Max

Realtime TTS 1.5 Mini

3 of top 5 are Inworld voices

#1 realtime voice AI

The voice that real users vote best.

Real listeners pick our voices over every other TTS on the Artificial Analysis Speech Arena, an independent human-vote leaderboard. The top spot, and most of the voices around it, are ours.

Artificial Analysis · Speech Arena

Realtime TTS 1.5 Max

Next best voice

Realtime TTS 1 Max

Realtime TTS 1.5 Mini

3 of top 5 are Inworld voices

Three ways to get a voice

Pick from the library, clone a real one, or design a new one.

One TTS API, three routes to a voice. Choose a library voice, clone from 5-15 seconds of audio, or design from a text prompt. All return at top-ranked quality.

See Voice Design

Three routes to a voiceId

Library

Instant

Clone

~1 min

Design

Seconds

One voiceId, every pipeline

Three routes to a voiceId

Library

Instant

Clone

~1 min

Design

Seconds

One voiceId, every pipeline

Three ways to get a voice

Pick from the library, clone a real one, or design a new one.

One TTS API, three routes to a voice. Choose a library voice, clone from 5-15 seconds of audio, or design from a text prompt. All return at top-ranked quality.

See Voice Design

Sub-200ms streaming

Works with

TTS

Realtime API

Fast enough to drive a live conversation.

Median first audio chunk lands in under 200ms. Streaming JSON lets you play before synthesis finishes. Built for live voice agents, not pre-rendered audio.

TTS 1.5 · streaming latency

<200ms

Median first audio chunk

Built for realtime agents, not pre-rendered audio

Sub-200ms streaming

Works with

TTS

Realtime API

Fast enough to drive a live conversation.

Median first audio chunk lands in under 200ms. Streaming JSON lets you play before synthesis finishes. Built for live voice agents, not pre-rendered audio.

TTS 1.5 · streaming latency

<200ms

Median first audio chunk

Built for realtime agents, not pre-rendered audio

Expressive by design

Your voice actually performs the line.

Drop emotion and non-verbal cues inline with the text. TTS reads them and delivers the moment. No extra emotion dial to tune.

Expressive control

audio markup

[happy] You made it.
I was starting to get worried[sigh].
[whispering] Quick, before they see us[laugh].

[happy]

[sad]

[angry]

[surprised]

[sigh]

[whispering]

[laugh]

Expressive control

audio markup

[happy] You made it.
I was starting to get worried[sigh].
[whispering] Quick, before they see us[laugh].

[happy]

[sad]

[angry]

[surprised]

[sigh]

[whispering]

[laugh]

Expressive by design

Your voice actually performs the line.

Drop emotion and non-verbal cues inline with the text. TTS reads them and delivers the moment. No extra emotion dial to tune.

over 100 languages

One voice, every market on your roadmap.

Generate across over 100 languages from a single voice. Cross-lingual synthesis lets a cloned or designed voice speak every market on your roadmap without re-recording.

over 100 languages, one voice

cross-lingual

English

Spanish

French

German

Italian

Portuguese

Polish

Dutch

Russian

Arabic

Japanese

Korean

Chinese

Hindi

Hebrew

Design or clone once. Deploy across every market on your roadmap.

over 100 languages

One voice, every market on your roadmap.

Generate across over 100 languages from a single voice. Cross-lingual synthesis lets a cloned or designed voice speak every market on your roadmap without re-recording.

over 100 languages, one voice

cross-lingual

English

Spanish

French

German

Italian

Portuguese

Polish

Dutch

Russian

Arabic

Japanese

Korean

Chinese

Hindi

Hebrew

Design or clone once. Deploy across every market on your roadmap.

Production-ready today

Works with

TTS

Realtime API

Already powering voices at millions of DAU.

Wishroll, Talkpal, Death by AI, and OtherHalf generate Inworld voices at real production scale. SOC 2 Type II, GDPR, zero data retention, on-premise available.

Powering voices for

Wishroll

Consumer app · 500K+ DAU · published case study

Talkpal

Language learning · 5M+ learners

Death by AI

Interactive media · 20M+ players

OtherHalf

AI companion app

Powering voices for

Wishroll

Consumer app · 500K+ DAU · published case study

Talkpal

Language learning · 5M+ learners

Death by AI

Interactive media · 20M+ players

OtherHalf

AI companion app

Production-ready today

Works with

TTS

Realtime API

Already powering voices at millions of DAU.

Wishroll, Talkpal, Death by AI, and OtherHalf generate Inworld voices at real production scale. SOC 2 Type II, GDPR, zero data retention, on-premise available.

Generate audio in three lines

Auth, text, voice. Playback starts before the sentence finishes generating.

import { InworldTTS } from '@inworld/tts';

const tts = InworldTTS(); // reads INWORLD_API_KEY

// 1. Stream audio back as NDJSON chunks
const stream = tts.synthesizeStream({
  text: 'Welcome back. Let\'s pick up where you left off.',
  voiceId: 'Sarah',
  modelId: 'inworld-tts-2',
  audioConfig: {
    audioEncoding: 'MP3',
    sampleRateHertz: 48000,
  },
});

for await (const chunk of stream) {
  // chunk.result.audioContent is base64 MP3
  audioPlayer.append(Buffer.from(chunk.result.audioContent, 'base64'));
}

import { InworldTTS } from '@inworld/tts';

const tts = InworldTTS(); // reads INWORLD_API_KEY

// 1. Stream audio back as NDJSON chunks
const stream = tts.synthesizeStream({
  text: 'Welcome back. Let\'s pick up where you left off.',
  voiceId: 'Sarah',
  modelId: 'inworld-tts-2',
  audioConfig: {
    audioEncoding: 'MP3',
    sampleRateHertz: 48000,
  },
});

for await (const chunk of stream) {
  // chunk.result.audioContent is base64 MP3
  audioPlayer.append(Buffer.from(chunk.result.audioContent, 'base64'));
}

Prefer clicking? Generate in the playground.

Open the TTS Playground, pick a voice, type or paste your script, add emotion tags inline, and hit Generate. Tweak speed, pitch, or model. When it sounds right, copy the voiceId and drop it into your code.

Open the playground

FAQ