Last updated: April 5, 2026
Inworld TTS is
ranked #1 on Artificial Analysis with the highest quality score on independent benchmarks.
Set your API key and get audio:
curl -X POST https://api.inworld.ai/tts/v1/voice \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "Hello from Inworld.", "voiceId": "Sarah", "modelId": "inworld-tts-1.5-max"}' \
| python3 -c "import sys,json,base64; open('hello.wav','wb').write(base64.b64decode(json.load(sys.stdin)['audioContent']))"
That writes a .wav file to disk. Three fields, one endpoint, done.
Get an API Key
- Sign up at platform.inworld.ai
- Generate an API key in the Inworld Portal
- Set it as an environment variable:
export INWORLD_API_KEY=your_key_here
Every example on this page reads from INWORLD_API_KEY. Set it once, and everything works.
API Reference
Endpoint: POST https://api.inworld.ai/tts/v1/voice
Headers:
Request body:
Response:
The response is JSON with a audioContent field containing base64-encoded audio. Decode this to get raw audio bytes.
{
"audioContent": "base64-encoded-audio...",
"usage": { ... }
}
Models
Both models use the same API. Swap modelId in the request body to switch between them.
Full Non-Streaming Example
This synthesizes a longer sentence and saves the output to a file. The response comes back as a single JSON object with all audio in one base64 string.
import requests
import base64
import os
# pip install requests
INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]
response = requests.post(
"https://api.inworld.ai/tts/v1/voice",
headers={
"Authorization": f"Basic {INWORLD_API_KEY}",
"Content-Type": "application/json"
},
json={
"text": "Inworld TTS generates speech that sounds like a real person. Try a longer sentence to hear the natural prosody and pacing.",
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-max"
},
timeout=30
)
response.raise_for_status()
data = response.json()
audio_bytes = base64.b64decode(data["audioContent"])
with open("output.wav", "wb") as f:
f.write(audio_bytes)
print(f"Wrote {len(audio_bytes)} bytes to output.wav")
print(f"Usage: {data.get('usage', {})}")
Streaming
For conversational applications, you want audio playing before the full response is generated. The streaming endpoint returns chunks as they become available.
Endpoint: POST https://api.inworld.ai/tts/v1/voice:stream
The request body is identical to the non-streaming endpoint. The response is newline-delimited JSON (NDJSON): one JSON object per line, each containing a chunk of audio.
Each line looks like this:
{"result": {"audioContent": "base64-encoded-chunk..."}}
Parse each line as JSON, decode result.audioContent from base64, and play or buffer the audio bytes.
import requests
import base64
import json
import os
INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]
response = requests.post(
"https://api.inworld.ai/tts/v1/voice:stream",
headers={
"Authorization": f"Basic {INWORLD_API_KEY}",
"Content-Type": "application/json"
},
json={
"text": "Streaming delivers audio chunks as they are generated. This reduces time-to-first-byte significantly, which matters for conversational applications where users are waiting for a response.",
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-max"
},
stream=True,
timeout=30
)
response.raise_for_status()
audio_chunks = []
for line in response.iter_lines():
if line:
chunk = json.loads(line)
audio_data = base64.b64decode(chunk["result"]["audioContent"])
audio_chunks.append(audio_data)
print(f"Received chunk: {len(audio_data)} bytes")
full_audio = b"".join(audio_chunks)
with open("streamed_output.wav", "wb") as f:
f.write(full_audio)
print(f"Total: {len(full_audio)} bytes across {len(audio_chunks)} chunks")
The -N flag in the curl example disables output buffering so you see chunks as they arrive.
Voice Cloning
Create a custom voice from 5-15 seconds of reference audio. No training step, no additional cost. First, clone the voice via the /voices/v1/voices:clone endpoint. Then use the returned voiceId in your TTS calls.
import requests
import base64
import os
INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]
# Step 1: Clone a voice from a reference audio file (5-15 seconds of clear speech)
with open("reference_voice.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
clone_response = requests.post(
"https://api.inworld.ai/voices/v1/voices:clone",
headers={
"Authorization": f"Basic {INWORLD_API_KEY}",
"Content-Type": "application/json"
},
json={
"displayName": "MyClonedVoice",
"langCode": "EN_US",
"voiceSamples": [{"audioData": audio_b64}]
},
timeout=60
)
clone_response.raise_for_status()
cloned_voice_id = clone_response.json()["voice"]["voiceId"]
print(f"Cloned voice ID: {cloned_voice_id}")
# Step 2: Use the cloned voice for TTS
response = requests.post(
"https://api.inworld.ai/tts/v1/voice",
headers={
"Authorization": f"Basic {INWORLD_API_KEY}",
"Content-Type": "application/json"
},
json={
"text": "This sentence will be spoken in the cloned voice.",
"voiceId": cloned_voice_id,
"modelId": "inworld-tts-1.5-max"
}
)
audio_bytes = base64.b64decode(response.json()["audioContent"])
with open("cloned_output.wav", "wb") as f:
f.write(audio_bytes)
Use clear, single-speaker audio with minimal background noise for best results.
Error Handling
Production code should handle network timeouts, invalid API keys, and unexpected response formats. Here is a complete example:
import requests
import base64
import os
INWORLD_API_KEY = os.environ.get("INWORLD_API_KEY")
if not INWORLD_API_KEY:
raise ValueError("Set the INWORLD_API_KEY environment variable")
try:
response = requests.post(
"https://api.inworld.ai/tts/v1/voice",
headers={
"Authorization": f"Basic {INWORLD_API_KEY}",
"Content-Type": "application/json"
},
json={
"text": "Handle errors gracefully.",
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-max"
},
timeout=30
)
response.raise_for_status()
data = response.json()
audio_bytes = base64.b64decode(data["audioContent"])
with open("output.wav", "wb") as f:
f.write(audio_bytes)
except requests.exceptions.HTTPError as e:
print(f"API error {e.response.status_code}: {e.response.text}")
except requests.exceptions.Timeout:
print("Request timed out. Check your network or try again.")
except KeyError:
print(f"Unexpected response format: {response.text[:200]}")
Common error responses:
When to Use Streaming vs Non-Streaming
Use non-streaming when you need the complete audio file before doing anything with it: saving to storage, post-processing, embedding in other media, or any batch workflow.
Use streaming when a user is waiting to hear the output: voice agents, realtime assistants, interactive applications, or any scenario where time-to-first-audio matters.
In most production voice applications, streaming is the right default. The difference between hearing audio in 150ms versus waiting 800ms for a complete response is the difference between a conversation that feels natural and one that feels broken.
Next Steps
- Realtime voice agents: The Realtime API handles TTS, STT, and LLM processing over a single WebSocket for full speech-to-speech agents.
- API reference: Full endpoint documentation at docs.inworld.ai
- Voice library: Browse available voices and create custom clones in the Inworld Portal
- Model details: Read the TTS model comparison for independent benchmark data.