Get started
Published 04.03.2026

Inworld TTS API Quickstart: Get Audio in 3 Lines

Last updated: April 5, 2026
Inworld TTS is ranked #1 on Artificial Analysis with the highest quality score on independent benchmarks.
Set your API key and get audio:
curl -X POST https://api.inworld.ai/tts/v1/voice \
  -H "Authorization: Basic $INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Inworld.", "voiceId": "Sarah", "modelId": "inworld-tts-1.5-max"}' \
  | python3 -c "import sys,json,base64; open('hello.wav','wb').write(base64.b64decode(json.load(sys.stdin)['audioContent']))"
That writes a .wav file to disk. Three fields, one endpoint, done.

Get an API Key

  1. Sign up at platform.inworld.ai
  2. Generate an API key in the Inworld Portal
  3. Set it as an environment variable:
export INWORLD_API_KEY=your_key_here
Every example on this page reads from INWORLD_API_KEY. Set it once, and everything works.

API Reference

Endpoint: POST https://api.inworld.ai/tts/v1/voice
Headers:
Request body:
Response:
The response is JSON with a audioContent field containing base64-encoded audio. Decode this to get raw audio bytes.
{
  "audioContent": "base64-encoded-audio...",
  "usage": { ... }
}

Models

Both models use the same API. Swap modelId in the request body to switch between them.

Full Non-Streaming Example

This synthesizes a longer sentence and saves the output to a file. The response comes back as a single JSON object with all audio in one base64 string.
import requests
import base64
import os

# pip install requests
INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "text": "Inworld TTS generates speech that sounds like a real person. Try a longer sentence to hear the natural prosody and pacing.",
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max"
    },
    timeout=30
)
response.raise_for_status()

data = response.json()
audio_bytes = base64.b64decode(data["audioContent"])

with open("output.wav", "wb") as f:
    f.write(audio_bytes)

print(f"Wrote {len(audio_bytes)} bytes to output.wav")
print(f"Usage: {data.get('usage', {})}")

Streaming

For conversational applications, you want audio playing before the full response is generated. The streaming endpoint returns chunks as they become available.
Endpoint: POST https://api.inworld.ai/tts/v1/voice:stream
The request body is identical to the non-streaming endpoint. The response is newline-delimited JSON (NDJSON): one JSON object per line, each containing a chunk of audio.
Each line looks like this:
{"result": {"audioContent": "base64-encoded-chunk..."}}
Parse each line as JSON, decode result.audioContent from base64, and play or buffer the audio bytes.
import requests
import base64
import json
import os

INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "text": "Streaming delivers audio chunks as they are generated. This reduces time-to-first-byte significantly, which matters for conversational applications where users are waiting for a response.",
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max"
    },
    stream=True,
    timeout=30
)
response.raise_for_status()

audio_chunks = []
for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        audio_data = base64.b64decode(chunk["result"]["audioContent"])
        audio_chunks.append(audio_data)
        print(f"Received chunk: {len(audio_data)} bytes")

full_audio = b"".join(audio_chunks)
with open("streamed_output.wav", "wb") as f:
    f.write(full_audio)

print(f"Total: {len(full_audio)} bytes across {len(audio_chunks)} chunks")
The -N flag in the curl example disables output buffering so you see chunks as they arrive.

Voice Cloning

Create a custom voice from 5-15 seconds of reference audio. No training step, no additional cost. First, clone the voice via the /voices/v1/voices:clone endpoint. Then use the returned voiceId in your TTS calls.
import requests
import base64
import os

INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

# Step 1: Clone a voice from a reference audio file (5-15 seconds of clear speech)
with open("reference_voice.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode("utf-8")

clone_response = requests.post(
    "https://api.inworld.ai/voices/v1/voices:clone",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "displayName": "MyClonedVoice",
        "langCode": "EN_US",
        "voiceSamples": [{"audioData": audio_b64}]
    },
    timeout=60
)
clone_response.raise_for_status()

cloned_voice_id = clone_response.json()["voice"]["voiceId"]
print(f"Cloned voice ID: {cloned_voice_id}")

# Step 2: Use the cloned voice for TTS
response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "text": "This sentence will be spoken in the cloned voice.",
        "voiceId": cloned_voice_id,
        "modelId": "inworld-tts-1.5-max"
    }
)

audio_bytes = base64.b64decode(response.json()["audioContent"])
with open("cloned_output.wav", "wb") as f:
    f.write(audio_bytes)
Use clear, single-speaker audio with minimal background noise for best results.

Error Handling

Production code should handle network timeouts, invalid API keys, and unexpected response formats. Here is a complete example:
import requests
import base64
import os

INWORLD_API_KEY = os.environ.get("INWORLD_API_KEY")
if not INWORLD_API_KEY:
    raise ValueError("Set the INWORLD_API_KEY environment variable")

try:
    response = requests.post(
        "https://api.inworld.ai/tts/v1/voice",
        headers={
            "Authorization": f"Basic {INWORLD_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "text": "Handle errors gracefully.",
            "voiceId": "Sarah",
            "modelId": "inworld-tts-1.5-max"
        },
        timeout=30
    )
    response.raise_for_status()

    data = response.json()
    audio_bytes = base64.b64decode(data["audioContent"])

    with open("output.wav", "wb") as f:
        f.write(audio_bytes)

except requests.exceptions.HTTPError as e:
    print(f"API error {e.response.status_code}: {e.response.text}")
except requests.exceptions.Timeout:
    print("Request timed out. Check your network or try again.")
except KeyError:
    print(f"Unexpected response format: {response.text[:200]}")
Common error responses:

When to Use Streaming vs Non-Streaming

Use non-streaming when you need the complete audio file before doing anything with it: saving to storage, post-processing, embedding in other media, or any batch workflow.
Use streaming when a user is waiting to hear the output: voice agents, realtime assistants, interactive applications, or any scenario where time-to-first-audio matters.
In most production voice applications, streaming is the right default. The difference between hearing audio in 150ms versus waiting 800ms for a complete response is the difference between a conversation that feels natural and one that feels broken.

Next Steps

  • Realtime voice agents: The Realtime API handles TTS, STT, and LLM processing over a single WebSocket for full speech-to-speech agents.
  • API reference: Full endpoint documentation at docs.inworld.ai
  • Voice library: Browse available voices and create custom clones in the Inworld Portal
  • Model details: Read the TTS model comparison for independent benchmark data.
Copyright © 2021-2026 Inworld AI