What audio format does the Realtime TTS API return?

The API returns base64-encoded audio in the audioContent field of the JSON response. The default format is MP3 at 24kHz. You can specify other formats (PCM, WAV, OGG_OPUS, FLAC, LINEAR16, MULAW, ALAW) via the audioConfig.audioEncoding parameter. Decode the base64 string to get raw audio bytes.

What is the difference between inworld-tts-1.5-max and inworld-tts-1.5-mini?

Max is optimized for quality, producing expressive, natural-sounding speech. Mini is optimized for speed with realtime latency, making it better for latency-sensitive conversational agents. Both use the same API.

How does streaming work?

POST to /tts/v1/voice:stream. The response is newline-delimited JSON (NDJSON). Each line contains a JSON object with result.audioContent holding a base64-encoded audio chunk. Parse each line, decode the base64, and play or buffer the audio as chunks arrive.

What voices are available?

Inworld has a library of built-in voices. Sarah is the default. You can also create custom voices with zero-shot voice cloning from 5-15 seconds of reference audio at no additional cost.

Do I need separate keys for streaming vs non-streaming?

No. The same API key works for both endpoints. Authentication is identical: pass your key in the Authorization header as Basic {API_KEY}.

What languages does Realtime TTS support?

Realtime TTS supports 15 languages with native-quality pronunciation. The model handles mixed-language text automatically without explicit language tags.

Can I control speed and expressiveness?

Yes. The API accepts speakingRate (0.5 to 1.5) inside the audioConfig object and temperature parameters for fine-grained control over delivery style and expressiveness.

Is there a WebSocket option for realtime use cases?

Yes. For persistent bidirectional audio sessions, use the Inworld Realtime API, which handles TTS, STT, and LLM processing over a single WebSocket connection. See the Realtime API quickstart for details.

TTS API Quickstart - Inworld AI

Last updated: April 5, 2026

Inworld AI's Realtime TTS-2 (research preview) is a first-party realtime voice model that produces expressive, natural-sounding speech at low latency. Inworld's Realtime TTS-2 is the #1 realtime TTS.

Set your API key and get audio:

curl -X POST https://api.inworld.ai/tts/v1/voice \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Inworld.", "voiceId": "Sarah", "modelId": "inworld-tts-1.5-max"}' \
  | python3 -c "import sys,json,base64; open('hello.mp3','wb').write(base64.b64decode(json.load(sys.stdin)['audioContent']))"

That writes an .mp3 file to disk. Three fields, one endpoint, done.

Get an API Key

Sign up at platform.inworld.ai
Generate an API key in the Inworld Portal
Set it as an environment variable:

export INWORLD_API_KEY=your_key_here

Every example on this page reads from INWORLD_API_KEY. Set it once, and everything works.

API Reference

Endpoint: POST to /tts/v1/voice at https://api.inworld.ai

Headers:

Request body:

Response:

The response is JSON with a audioContent field containing base64-encoded audio. Decode this to get raw audio bytes.

{
  "audioContent": "base64-encoded-audio...",
  "usage": { ... }
}

Models

Both models use the same API. Swap modelId in the request body to switch between them.

Full Non-Streaming Example

This synthesizes a longer sentence and saves the output to a file. The response comes back as a single JSON object with all audio in one base64 string.

import requests
import base64
import os

# pip install requests
INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "text": "Realtime TTS generates speech that sounds like a real person. Try a longer sentence to hear the natural prosody and pacing.",
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max"
    },
    timeout=30
)
response.raise_for_status()

data = response.json()
audio_bytes = base64.b64decode(data["audioContent"])

with open("output.mp3", "wb") as f:
    f.write(audio_bytes)

print(f"Wrote {len(audio_bytes)} bytes to output.mp3")
print(f"Usage: {data.get('usage', {})}")

Streaming

For conversational applications, you want audio playing before the full response is generated. The streaming endpoint returns chunks as they become available.

Endpoint: POST to /tts/v1/voice:stream at https://api.inworld.ai

The request body is identical to the non-streaming endpoint. The response is newline-delimited JSON (NDJSON): one JSON object per line, each containing a chunk of audio.

Each line looks like this:

{"result": {"audioContent": "base64-encoded-chunk..."}}

Parse each line as JSON, decode result.audioContent from base64, and play or buffer the audio bytes.

import requests
import base64
import json
import os

INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "text": "Streaming delivers audio chunks as they are generated. This reduces time-to-first-byte significantly, which matters for conversational applications where users are waiting for a response.",
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max"
    },
    stream=True,
    timeout=30
)
response.raise_for_status()

audio_chunks = []
for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        audio_data = base64.b64decode(chunk["result"]["audioContent"])
        audio_chunks.append(audio_data)
        print(f"Received chunk: {len(audio_data)} bytes")

full_audio = b"".join(audio_chunks)
with open("streamed_output.mp3", "wb") as f:
    f.write(full_audio)

print(f"Total: {len(full_audio)} bytes across {len(audio_chunks)} chunks")

The -N flag in the curl example disables output buffering so you see chunks as they arrive.

Voice Cloning

Create a custom voice from 5-15 seconds of reference audio. No training step, no additional cost. First, clone the voice via the /voices/v1/voices:clone endpoint. Then use the returned voiceId in your TTS calls.

import requests
import base64
import os

INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

# Step 1: Clone a voice from a reference audio file (5-15 seconds of clear speech)
with open("reference_voice.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode("utf-8")

clone_response = requests.post(
    "https://api.inworld.ai/voices/v1/voices:clone",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "displayName": "MyClonedVoice",
        "langCode": "EN_US",
        "voiceSamples": [{"audioData": audio_b64}]
    },
    timeout=60
)
clone_response.raise_for_status()

cloned_voice_id = clone_response.json()["voice"]["voiceId"]
print(f"Cloned voice ID: {cloned_voice_id}")

# Step 2: Use the cloned voice for TTS
response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "text": "This sentence will be spoken in the cloned voice.",
        "voiceId": cloned_voice_id,
        "modelId": "inworld-tts-1.5-max"
    }
)

audio_bytes = base64.b64decode(response.json()["audioContent"])
with open("cloned_output.mp3", "wb") as f:
    f.write(audio_bytes)

Use clear, single-speaker audio with minimal background noise for best results.

Error Handling

Production code should handle network timeouts, invalid API keys, and unexpected response formats. Here is a complete example:

import requests
import base64
import os

INWORLD_API_KEY = os.environ.get("INWORLD_API_KEY")
if not INWORLD_API_KEY:
    raise ValueError("Set the INWORLD_API_KEY environment variable")

try:
    response = requests.post(
        "https://api.inworld.ai/tts/v1/voice",
        headers={
            "Authorization": f"Basic {INWORLD_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "text": "Handle errors gracefully.",
            "voiceId": "Sarah",
            "modelId": "inworld-tts-1.5-max"
        },
        timeout=30
    )
    response.raise_for_status()

    data = response.json()
    audio_bytes = base64.b64decode(data["audioContent"])

    with open("output.mp3", "wb") as f:
        f.write(audio_bytes)

except requests.exceptions.HTTPError as e:
    print(f"API error {e.response.status_code}: {e.response.text}")
except requests.exceptions.Timeout:
    print("Request timed out. Check your network or try again.")
except KeyError:
    print(f"Unexpected response format: {response.text[:200]}")

Common error responses:

When to Use Streaming vs Non-Streaming

Use non-streaming when you need the complete audio file before doing anything with it: saving to storage, post-processing, embedding in other media, or any batch workflow.

Use streaming when a user is waiting to hear the output: voice agents, realtime assistants, interactive applications, or any scenario where time-to-first-audio matters.

In most production voice applications, streaming is the right default. The difference between hearing audio at realtime latency versus waiting a full second for a complete response is the difference between a conversation that feels natural and one that feels broken.

Next Steps

Realtime voice agents: The Realtime API handles TTS, STT, and LLM processing over a single WebSocket for full speech-to-speech agents.
API reference: Full endpoint documentation at docs.inworld.ai
Voice library: Browse available voices and create custom clones in the Inworld Portal
Model details: Read the TTS model comparison for independent benchmark data.

Realtime TTS API Quickstart: Get Audio in 3 Lines