Last updated: April 5, 2026
Inworld AI ranks #1 on
Artificial Analysis for text-to-speech quality with an ELO of 1,236. This tutorial shows how to call the Inworld TTS API from Python, starting with a 5-line example and building up to streaming, voice cloning, and a full voice pipeline. Every code block below is copy-paste ready. You just need a free API key from
platform.inworld.ai and the
requests library (
pip install requests).
import requests, base64
response = requests.post(
"https://api.inworld.ai/tts/v1/voice",
headers={"Authorization": f"Basic {API_KEY}"},
json={"voiceId": "Sarah", "modelId": "inworld-tts-1.5-max", "text": "Hello world"}
)
audio = base64.b64decode(response.json()["audioContent"])
with open("output.mp3", "wb") as f:
f.write(audio)
That is all it takes to generate speech. The rest of this guide covers streaming, voice cloning, long-text chunking, and a full STT-to-LLM-to-TTS pipeline.
How Do I Call the Inworld TTS API from Python?
The synchronous endpoint accepts a JSON payload with three required fields: voiceId, modelId, and text. Authentication uses Basic auth with your API key. The response returns a JSON object with base64-encoded audio in the audioContent field. You must decode it before writing to a file.
import requests
import base64
# pip install requests
API_KEY = "your_api_key_here" # From https://platform.inworld.ai
session = requests.Session()
response = session.post(
"https://api.inworld.ai/tts/v1/voice",
headers={
"Authorization": f"Basic {API_KEY}",
"Content-Type": "application/json"
},
json={
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-max",
"text": "Welcome to our application. This audio was generated with the Inworld TTS API."
},
timeout=30
)
response.raise_for_status()
result = response.json()
audio_bytes = base64.b64decode(result["audioContent"])
with open("output.mp3", "wb") as f:
f.write(audio_bytes)
print(f"Saved {len(audio_bytes)} bytes to output.mp3")
A few details:
requests.Session() reuses the underlying TCP and TLS connection across multiple calls. If you are generating audio in a loop or handling user requests in a server, this avoids repeated handshake overhead and cuts latency on subsequent requests.
voiceId selects the voice. The default is "Sarah" (a fast-talking young adult woman with a curious tone). There are 271+ voices available across 15 languages. Call GET /voices/v1/voices to list them all.
modelId selects the model. inworld-tts-1.5-max optimizes for quality. inworld-tts-1.5-mini cuts latency to around 120ms median if speed matters more than fidelity.
- Audio format defaults to MP3 at 24kHz. You can change this with an
audioConfig object: set audioEncoding to LINEAR16, WAV, OGG_OPUS, MULAW, ALAW, or FLAC, and sampleRateHertz to your preferred rate (8000-48000).
- Max input is 2,000 characters per request. For longer text, chunk at sentence boundaries (see the chunking section below).
import requests
API_KEY = "your_api_key_here"
response = requests.get(
"https://api.inworld.ai/voices/v1/voices",
headers={"Authorization": f"Basic {API_KEY}"}
)
voices = response.json()["voices"]
print(f"Available voices: {len(voices)}")
for voice in voices[:10]:
print(f" {voice['voiceId']}: {voice['displayName']}")
How Do I Stream TTS Audio in Python?
Streaming is the recommended approach for any interactive application. Instead of waiting for the entire audio file to be generated, the streaming endpoint (/tts/v1/voice:stream) returns audio chunks as they are synthesized. First audio arrives in under 200ms.
The response format is NDJSON (newline-delimited JSON). Each line is a standalone JSON object containing a base64-encoded audio chunk. This is not raw binary. You must parse each line with json.loads and then decode the base64 audioContent.
import requests
import base64
import json
API_KEY = "your_api_key_here"
session = requests.Session()
text = """This is a longer passage that benefits from streaming.
The API returns audio chunks as they are generated,
so playback can start before the full synthesis is complete."""
response = session.post(
"https://api.inworld.ai/tts/v1/voice:stream",
headers={
"Authorization": f"Basic {API_KEY}",
"Content-Type": "application/json"
},
json={
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-max",
"text": text
},
stream=True,
timeout=30
)
response.raise_for_status()
audio_chunks = []
for line in response.iter_lines():
if line:
data = json.loads(line)
chunk = base64.b64decode(data["result"]["audioContent"])
audio_chunks.append(chunk)
audio = b"".join(audio_chunks)
with open("streamed_output.mp3", "wb") as f:
f.write(audio)
print(f"Received {len(audio_chunks)} chunks, {len(audio)} bytes total")
Realtime Playback with PyAudio
For applications that need to play audio as it arrives (voice agents, chatbots, accessibility tools), combine streaming with pyaudio to write PCM chunks directly to the speaker:
import requests
import base64
import json
import pyaudio
API_KEY = "your_api_key_here"
session = requests.Session()
response = session.post(
"https://api.inworld.ai/tts/v1/voice:stream",
headers={
"Authorization": f"Basic {API_KEY}",
"Content-Type": "application/json"
},
json={
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-max",
"text": "Streaming audio plays back in realtime as chunks arrive.",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 24000
}
},
stream=True
)
# Play each chunk as it arrives
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
for line in response.iter_lines():
if line:
data = json.loads(line)
chunk = base64.b64decode(data["result"]["audioContent"])
stream.write(chunk)
stream.stop_stream()
stream.close()
p.terminate()
The key difference from the file-saving example: set audioEncoding to LINEAR16 and sampleRateHertz to 24000, then write raw PCM bytes to the audio stream. Each chunk plays the moment it arrives. Users hear the first word in under 200ms while the rest of the sentence is still being generated.
Common Streaming Mistakes
Avoid these patterns that look correct but produce broken audio:
response.iter_content(chunk_size=4096) treats the response as raw binary. The stream is NDJSON, not binary audio. You will get corrupted output.
- Writing
response.content directly to a file without base64 decoding. The response body is JSON, not audio bytes. You will get a text file with base64 strings.
- Using
voice instead of voiceId in the request body. The REST TTS API uses voiceId. The voice field is for the Realtime API WebSocket protocol.
How Do I Clone a Voice with Python?
Voice cloning creates a custom voice from a 5-15 second audio sample. The cloned voice can then be used in any TTS call. Samples longer than 15 seconds are automatically trimmed. Supported formats: wav, mp3, webm. Maximum file size: 4MB.
import requests
import base64
API_KEY = "your_api_key_here"
session = requests.Session()
# Step 1: Clone a voice from an audio sample (5-15 seconds, wav/mp3/webm, max 4MB)
with open("voice_sample.wav", "rb") as f:
audio_data = base64.b64encode(f.read()).decode()
clone_response = session.post(
"https://api.inworld.ai/voices/v1/voices:clone",
headers={
"Authorization": f"Basic {API_KEY}",
"Content-Type": "application/json"
},
json={
"displayName": "my-custom-voice",
"langCode": "EN_US",
"voiceSamples": [{"audioData": audio_data}]
},
timeout=60
)
clone_response.raise_for_status()
cloned_voice = clone_response.json()
custom_voice_id = cloned_voice["voice"]["voiceId"]
print(f"Cloned voice ID: {custom_voice_id}")
# Step 2: Use the cloned voice for TTS
tts_response = session.post(
"https://api.inworld.ai/tts/v1/voice",
headers={
"Authorization": f"Basic {API_KEY}",
"Content-Type": "application/json"
},
json={
"voiceId": custom_voice_id,
"modelId": "inworld-tts-1.5-max",
"text": "This speech uses my cloned voice."
}
)
audio = base64.b64decode(tts_response.json()["audioContent"])
with open("cloned_voice_output.mp3", "wb") as f:
f.write(audio)
The cloned voiceId works exactly like any built-in voice. Pass it to the synchronous or streaming endpoint. Each account can create up to 1,000 cloned voices.
How Do I Build a Full Voice Pipeline in Python?
A voice pipeline chains three APIs into one flow: STT transcribes audio input, an LLM generates a response, and TTS converts that response back to speech. With Inworld, all three steps use the same API key and authentication.
import requests
import base64
import json
API_KEY = "your_api_key_here"
session = requests.Session()
headers = {
"Authorization": f"Basic {API_KEY}",
"Content-Type": "application/json"
}
# Step 1: Transcribe audio with Inworld STT
with open("user_audio.wav", "rb") as f:
audio_input = base64.b64encode(f.read()).decode()
stt_response = session.post(
"https://api.inworld.ai/stt/v1/transcribe",
headers=headers,
json={
"transcribeConfig": {
"modelId": "groq/whisper-large-v3",
"audioEncoding": "AUTO_DETECT",
"language": "en-US"
},
"audioData": {
"content": audio_input
}
},
timeout=30
)
stt_response.raise_for_status()
transcript = stt_response.json()["transcription"]["transcript"]
print(f"User said: {transcript}")
# Step 2: Send transcript to an LLM via Inworld Router
llm_response = session.post(
"https://api.inworld.ai/v1/chat/completions",
headers=headers,
json={
"model": "openai/gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a helpful voice assistant. Keep responses under 200 words."},
{"role": "user", "content": transcript}
]
},
timeout=30
)
llm_response.raise_for_status()
reply = llm_response.json()["choices"][0]["message"]["content"]
print(f"Assistant: {reply}")
# Step 3: Convert the LLM reply to speech with Inworld TTS
tts_response = session.post(
"https://api.inworld.ai/tts/v1/voice:stream",
headers=headers,
json={
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-max",
"text": reply
},
stream=True
)
audio_chunks = []
for line in tts_response.iter_lines():
if line:
data = json.loads(line)
chunk = base64.b64decode(data["result"]["audioContent"])
audio_chunks.append(chunk)
audio = b"".join(audio_chunks)
with open("response.mp3", "wb") as f:
f.write(audio)
print(f"Full pipeline complete: transcription -> reasoning -> speech")
This pipeline uses three Inworld APIs:
- STT (
/stt/v1/transcribe) converts user audio to text
- Router (
/v1/chat/completions) sends the transcript to any LLM (the Router routes to 200+ models from 17 providers)
- TTS (
/tts/v1/voice:stream) converts the LLM reply back to speech with streaming
For production voice agents that need lower latency and bidirectional audio, the
Inworld Realtime API handles all three steps over a single WebSocket connection with built-in turn detection and barge-in support.
How Do I Handle Long Text in Python?
The TTS API accepts a maximum of 2,000 characters per request. For longer content (articles, documentation, email bodies), split the text at sentence boundaries and synthesize each chunk separately:
def chunk_text(text, max_chars=1500):
"""Split text at sentence boundaries, keeping chunks under max_chars."""
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current = ""
for sentence in sentences:
if len(current) + len(sentence) + 1 > max_chars:
if current:
chunks.append(current.strip())
current = sentence
else:
current = f"{current} {sentence}" if current else sentence
if current:
chunks.append(current.strip())
return chunks
# Usage: synthesize long text in chunks
long_text = "..." # Any length
for i, chunk in enumerate(chunk_text(long_text)):
response = session.post(
"https://api.inworld.ai/tts/v1/voice:stream",
headers={"Authorization": f"Basic {API_KEY}"},
json={"voiceId": "Sarah", "modelId": "inworld-tts-1.5-max", "text": chunk},
stream=True
)
# Process each chunk's streaming response...
Keep chunks between 500 and 1,600 characters. Splitting mid-sentence creates unnatural pauses. Splitting at paragraph or sentence boundaries preserves natural prosody.
How Does Inworld Compare to Other Python TTS Options?
| Feature | Inworld AI TTS | ElevenLabs Python SDK | OpenAI TTS |
|---|
| Quality ranking (Artificial Analysis) | #1 (ELO 1,236) | #5 (ELO 1,108) | #3 (ELO 1,111) |
| Median latency | <200ms (Max), ~120ms (Mini) | ~300ms | ~250ms |
| Streaming | JSON streaming (line-by-line) | Chunked binary | Chunked transfer encoding |
| Python integration | requests (no SDK required) | elevenlabs Python SDK | openai Python SDK |
| Voice cloning | 5-15s sample, included | Instant + Professional (paid tiers) | Not publicly available |
| Voices available | 271+ | 10,000+ (community library) | 13 built-in |
| Languages | 15 | 70+ (v3) | 57+ |
| Auth | Basic (single API key) | Custom header (xi-api-key) | Bearer token |
| Full voice pipeline | STT + Router + TTS under one key | TTS + STT + Conversational AI | TTS + Realtime API |
| On-premise deployment | Available | Partial (AWS Marketplace only) | Not available |
The biggest differentiator for Python developers: Inworld requires zero SDK installation. The entire API surface is accessible with requests, a library already installed in virtually every Python environment. No proprietary client, no version conflicts, no dependency tree.
Frequently Asked Questions
How do I add text-to-speech to a Python app?
Install requests, then call the Inworld AI TTS API with a POST request to https://api.inworld.ai/tts/v1/voice. Pass your text, voiceId, and modelId in the JSON body. Decode the base64 audioContent from the response and write it to a file. Five lines of code, no SDK required.
Does Inworld TTS support streaming in Python?
Yes. Use the /tts/v1/voice:stream endpoint with stream=True in your requests call. The response is NDJSON where each line contains a JSON object with base64-encoded audio. Parse line-by-line with json.loads and base64.b64decode. First audio arrives in under 200ms.
How do I clone a voice with the Inworld Python API?
POST to https://api.inworld.ai/voices/v1/voices:clone with a displayName, langCode, and voiceSamples array containing base64-encoded audio (5-15 seconds, wav/mp3/webm, max 4MB). The API returns a voice object with a custom voiceId you can use in any subsequent TTS call. Up to 1,000 cloned voices per account.
What is the best TTS model for Python developers?
Inworld TTS 1.5 Max is ranked #1 on
Artificial Analysis with an ELO of 1,236 from thousands of blind comparisons. It delivers sub-200ms median latency with 271+ voices across 15 languages. For latency-sensitive applications, TTS 1.5 Mini drops median latency to around 120ms.
Can I build a full voice pipeline in Python?
Yes. Combine Inworld STT (transcription), Router (LLM reasoning across 200+ models), and TTS (speech output) in a single Python script. All three APIs share the same API key and Basic auth. For realtime bidirectional voice, the
Inworld Realtime API handles the full pipeline over WebSocket.
What is the maximum text length for Inworld TTS?
2,000 characters per request. For longer text, chunk at sentence boundaries (500-1,600 characters per chunk) and make multiple streaming requests. The chunking example above shows how to split text cleanly without breaking mid-sentence.
What audio formats does Inworld TTS support?
MP3 (default), LINEAR16, WAV, OGG_OPUS, MULAW, ALAW, and FLAC. Sample rates from 8kHz to 48kHz. Set these via the audioConfig object in your request. Use LINEAR16 at 24kHz for realtime playback with PyAudio.
Do I need an SDK to use Inworld TTS in Python?
No. The API is a standard REST endpoint. The Python requests library (included in virtually every Python environment) is all you need. No proprietary SDK, no version conflicts. This makes Inworld TTS straightforward to integrate into any Python project, framework, or deployment environment.