Python TTS API Tutorial: Text-to-Speech in 5 Lines (2026)

Q: Does Realtime TTS support streaming in Python?

Yes. Use the /tts/v1/voice:stream endpoint. The response is NDJSON where each line contains a base64-encoded audio chunk. Parse line-by-line with json.loads and base64.b64decode.

Q: How do I clone a voice with the Inworld Python API?

POST to /voices/v1/voices:clone with a displayName, langCode, and voiceSamples array. The API returns a voice object containing a custom voiceId you can use in subsequent TTS calls.

Last updated: April 5, 2026

Inworld AI's Realtime TTS-2 (research preview) is a first-party realtime voice model built for expressive, low-latency speech. Inworld's Realtime TTS-2 is the #1 realtime TTS. This tutorial shows how to call the Realtime TTS API from Python, starting with a 5-line example and building up to streaming, voice cloning, and a full voice pipeline. Every code block below is copy-paste ready. You just need a free API key from platform.inworld.ai and the requests library (pip install requests).

import requests, base64

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={"Authorization": f"Basic {API_KEY}"},
    json={"voiceId": "Sarah", "modelId": "inworld-tts-1.5-max", "text": "Hello world"}
)
audio = base64.b64decode(response.json()["audioContent"])
with open("output.mp3", "wb") as f:
    f.write(audio)

That is all it takes to generate speech. The rest of this guide covers streaming, voice cloning, long-text chunking, and a full STT-to-LLM-to-TTS pipeline.

How Do I Call the Realtime TTS API from Python?

The synchronous endpoint accepts a JSON payload with three required fields: voiceId, modelId, and text. Authentication uses Basic auth with your API key. The response returns a JSON object with base64-encoded audio in the audioContent field. You must decode it before writing to a file.

import requests
import base64

# pip install requests
API_KEY = "your_api_key_here"  # From https://platform.inworld.ai

session = requests.Session()

response = session.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": "Welcome to our application. This audio was generated with the Realtime TTS API."
    },
    timeout=30
)
response.raise_for_status()

result = response.json()
audio_bytes = base64.b64decode(result["audioContent"])

with open("output.mp3", "wb") as f:
    f.write(audio_bytes)

print(f"Saved {len(audio_bytes)} bytes to output.mp3")

A few details:

requests.Session() reuses the underlying TCP and TLS connection across multiple calls. If you are generating audio in a loop or handling user requests in a server, this avoids repeated handshake overhead and cuts latency on subsequent requests.
voiceId selects the voice. The default is "Sarah." Browse the full voice library across 15 GA languages (TTS 1.5) or 100+ languages (TTS-2 cross-lingual) via GET /voices/v1/voices.
modelId selects the model. inworld-tts-1.5-max optimizes for quality. inworld-tts-1.5-mini optimizes for realtime latency if speed matters more than fidelity.
Audio format defaults to MP3 at 24kHz. You can change this with an audioConfig object: set audioEncoding to LINEAR16, WAV, OGG_OPUS, MULAW, ALAW, or FLAC, and sampleRateHertz to your preferred rate (8000-48000).
Max input is 2,000 characters per request. For longer text, chunk at sentence boundaries (see the chunking section below).

import requests

API_KEY = "your_api_key_here"

response = requests.get(
    "https://api.inworld.ai/voices/v1/voices",
    headers={"Authorization": f"Basic {API_KEY}"}
)

voices = response.json()["voices"]
print(f"Available voices: {len(voices)}")
for voice in voices[:10]:
    print(f"  {voice['voiceId']}: {voice['displayName']}")

How Do I Stream TTS Audio in Python?

Streaming is the recommended approach for any interactive application. Instead of waiting for the entire audio file to be generated, the streaming endpoint (/tts/v1/voice:stream) returns audio chunks as they are synthesized at realtime latency.

The response format is NDJSON (newline-delimited JSON). Each line is a standalone JSON object containing a base64-encoded audio chunk. This is not raw binary. You must parse each line with json.loads and then decode the base64 audioContent.

import requests
import base64
import json

API_KEY = "your_api_key_here"

session = requests.Session()

text = """This is a longer passage that benefits from streaming.
The API returns audio chunks as they are generated,
so playback can start before the full synthesis is complete."""

response = session.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers={
        "Authorization": f"Basic {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": text
    },
    stream=True,
    timeout=30
)
response.raise_for_status()

audio_chunks = []
for line in response.iter_lines():
    if line:
        data = json.loads(line)
        chunk = base64.b64decode(data["result"]["audioContent"])
        audio_chunks.append(chunk)

audio = b"".join(audio_chunks)
with open("streamed_output.mp3", "wb") as f:
    f.write(audio)

print(f"Received {len(audio_chunks)} chunks, {len(audio)} bytes total")

Realtime Playback with PyAudio

For applications that need to play audio as it arrives (voice agents, chatbots, accessibility tools), combine streaming with pyaudio to write PCM chunks directly to the speaker:

import requests
import base64
import json
import pyaudio

API_KEY = "your_api_key_here"

session = requests.Session()

response = session.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers={
        "Authorization": f"Basic {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": "Streaming audio plays back in realtime as chunks arrive.",
        "audioConfig": {
            "audioEncoding": "LINEAR16",
            "sampleRateHertz": 24000
        }
    },
    stream=True
)

# Play each chunk as it arrives
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

for line in response.iter_lines():
    if line:
        data = json.loads(line)
        chunk = base64.b64decode(data["result"]["audioContent"])
        stream.write(chunk)

stream.stop_stream()
stream.close()
p.terminate()

The key difference from the file-saving example: set audioEncoding to LINEAR16 and sampleRateHertz to 24000, then write raw PCM bytes to the audio stream. Each chunk plays the moment it arrives. Users hear the first word at realtime latency while the rest of the sentence is still being generated.

Common Streaming Mistakes

Avoid these patterns that look correct but produce broken audio:

response.iter_content(chunk_size=4096) treats the response as raw binary. The stream is NDJSON, not binary audio. You will get corrupted output.
Writing response.content directly to a file without base64 decoding. The response body is JSON, not audio bytes. You will get a text file with base64 strings.
Using voice instead of voiceId in the request body. The REST TTS API uses voiceId. The voice field is for the Realtime API WebSocket protocol.

How Do I Clone a Voice with Python?

Voice cloning creates a custom voice from a 5-15 second audio sample. The cloned voice can then be used in any TTS call. Samples longer than 15 seconds are automatically trimmed. Supported formats: wav, mp3, webm. Maximum file size: 4MB.

import requests
import base64

API_KEY = "your_api_key_here"

session = requests.Session()

# Step 1: Clone a voice from an audio sample (5-15 seconds, wav/mp3/webm, max 4MB)
with open("voice_sample.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

clone_response = session.post(
    "https://api.inworld.ai/voices/v1/voices:clone",
    headers={
        "Authorization": f"Basic {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "displayName": "my-custom-voice",
        "langCode": "EN_US",
        "voiceSamples": [{"audioData": audio_data}]
    },
    timeout=60
)
clone_response.raise_for_status()

cloned_voice = clone_response.json()
custom_voice_id = cloned_voice["voice"]["voiceId"]
print(f"Cloned voice ID: {custom_voice_id}")

# Step 2: Use the cloned voice for TTS
tts_response = session.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "voiceId": custom_voice_id,
        "modelId": "inworld-tts-1.5-max",
        "text": "This speech uses my cloned voice."
    }
)

audio = base64.b64decode(tts_response.json()["audioContent"])
with open("cloned_voice_output.mp3", "wb") as f:
    f.write(audio)

The cloned voiceId works exactly like any built-in voice. Pass it to the synchronous or streaming endpoint. Each account can create up to 1,000 cloned voices.

How Do I Build a Full Voice Pipeline in Python?

A voice pipeline chains three APIs into one flow: STT transcribes audio input, an LLM generates a response, and TTS converts that response back to speech. With Inworld, all three steps use the same API key and authentication.

import requests
import base64
import json

API_KEY = "your_api_key_here"

session = requests.Session()
headers = {
    "Authorization": f"Basic {API_KEY}",
    "Content-Type": "application/json"
}

# Step 1: Transcribe audio with Realtime STT
with open("user_audio.wav", "rb") as f:
    audio_input = base64.b64encode(f.read()).decode()

stt_response = session.post(
    "https://api.inworld.ai/stt/v1/transcribe",
    headers=headers,
    json={
        "transcribeConfig": {
            "modelId": "inworld/inworld-stt-1",
            "audioEncoding": "AUTO_DETECT",
            "language": "en-US"
        },
        "audioData": {
            "content": audio_input
        }
    },
    timeout=30
)
stt_response.raise_for_status()
transcript = stt_response.json()["transcription"]["transcript"]
print(f"User said: {transcript}")

# Step 2: Send transcript to an LLM via Inworld Router
llm_response = session.post(
    "https://api.inworld.ai/v1/chat/completions",
    headers=headers,
    json={
        "model": "openai/gpt-5.5",
        "messages": [
            {"role": "system", "content": "You are a helpful voice assistant. Keep responses under 200 words."},
            {"role": "user", "content": transcript}
        ]
    },
    timeout=30
)
llm_response.raise_for_status()
reply = llm_response.json()["choices"][0]["message"]["content"]
print(f"Assistant: {reply}")

# Step 3: Convert the LLM reply to speech with Realtime TTS
tts_response = session.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers=headers,
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": reply
    },
    stream=True
)

audio_chunks = []
for line in tts_response.iter_lines():
    if line:
        data = json.loads(line)
        chunk = base64.b64decode(data["result"]["audioContent"])
        audio_chunks.append(chunk)

audio = b"".join(audio_chunks)
with open("response.mp3", "wb") as f:
    f.write(audio)

print(f"Full pipeline complete: transcription -> reasoning -> speech")

This pipeline uses three Inworld APIs:

STT (/stt/v1/transcribe) converts user audio to text
Router (/v1/chat/completions) sends the transcript to any LLM (the Router routes to 220+ models from major providers)
TTS (/tts/v1/voice:stream) converts the LLM reply back to speech with streaming

For production voice agents that need lower latency and bidirectional audio, the Inworld Realtime API handles all three steps over a single WebSocket connection with built-in turn detection and barge-in support.

Note how the Router uses model (not modelId) because it follows the OpenAI Chat Completions format. The TTS endpoint uses voiceId and modelId. Different APIs, different field names.

How Do I Handle Long Text in Python?

The TTS API accepts a maximum of 2,000 characters per request. For longer content (articles, documentation, email bodies), split the text at sentence boundaries and synthesize each chunk separately:

def chunk_text(text, max_chars=1500):
    """Split text at sentence boundaries, keeping chunks under max_chars."""
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    current = ""

    for sentence in sentences:
        if len(current) + len(sentence) + 1 > max_chars:
            if current:
                chunks.append(current.strip())
            current = sentence
        else:
            current = f"{current} {sentence}" if current else sentence

    if current:
        chunks.append(current.strip())

    return chunks

# Usage: synthesize long text in chunks
long_text = "..." # Any length
for i, chunk in enumerate(chunk_text(long_text)):
    response = session.post(
        "https://api.inworld.ai/tts/v1/voice:stream",
        headers={"Authorization": f"Basic {API_KEY}"},
        json={"voiceId": "Sarah", "modelId": "inworld-tts-1.5-max", "text": chunk},
        stream=True
    )
    # Process each chunk's streaming response...

Keep chunks between 500 and 1,600 characters. Splitting mid-sentence creates unnatural pauses. Splitting at paragraph or sentence boundaries preserves natural prosody.

How Does Inworld Compare to Other Python TTS Options?

Feature	Inworld AI TTS	ElevenLabs Python SDK	OpenAI TTS
Voice quality approach	#1 realtime TTS, expressive and steerable	High-fidelity, large voice library	Solid general-purpose voices
Median latency	Realtime latency	~300ms	~250ms
Streaming	JSON streaming (line-by-line)	Chunked binary	Chunked transfer encoding
Python integration	requests (no SDK required)	elevenlabs Python SDK	openai Python SDK
Voice cloning	5-15s sample, included	Instant + Professional (paid tiers)	Not publicly available
Voices available	Large built-in library + cloning	10,000+ (community library)	Limited built-in
Languages	15 GA (TTS 1.5) / 100+ (TTS-2 cross-lingual)	70+ (v3)	57+
Auth	Basic (single API key)	Custom header (xi-api-key)	Bearer token
Full voice pipeline	STT + Router + TTS under one key	TTS + STT + Conversational AI	TTS + Realtime API
On-premise deployment	Available	Available (on-premise + on-device)	Not available

Sources: ElevenLabs docs, OpenAI TTS docs.

The biggest differentiator for Python developers: Inworld requires zero SDK installation. The entire API surface is accessible with requests, a library already installed in virtually every Python environment. No proprietary client, no version conflicts, no dependency tree.

Frequently Asked Questions

How do I add text-to-speech to a Python app?

Install requests, then call the Inworld AI TTS API with a POST request to the /tts/v1/voice endpoint at api.inworld.ai. Pass your text, voiceId, and modelId in the JSON body. Decode the base64 audioContent from the response and write it to a file. Five lines of code, no SDK required.

Does Realtime TTS support streaming in Python?

Yes. Use the /tts/v1/voice:stream endpoint with stream=True in your requests call. The response is NDJSON where each line contains a JSON object with base64-encoded audio. Parse line-by-line with json.loads and base64.b64decode. First audio arrives at realtime latency.

How do I clone a voice with the Inworld Python API?

POST to the /voices/v1/voices:clone endpoint at api.inworld.ai with a displayName, langCode, and voiceSamples array containing base64-encoded audio (5-15 seconds, wav/mp3/webm, max 4MB). The API returns a voice object with a custom voiceId you can use in any subsequent TTS call. Up to 1,000 cloned voices per account.

What is the best TTS model for Python developers?

Inworld AI's Realtime TTS-2 (research preview) is a first-party realtime voice model built for expressive, low-latency speech. It delivers sub-200ms TTFT with cross-lingual voice identity across 100+ languages (15 GA + 90+ experimental). For the lowest latency, choose inworld-tts-1.5-mini.

Can I build a full voice pipeline in Python?

Yes. Combine Realtime STT (transcription), Router (LLM reasoning across 220+ models from major providers), and TTS (speech output) in a single Python script. All three APIs share the same API key and Basic auth. For realtime bidirectional voice, the Inworld Realtime API handles the full pipeline over WebSocket.

What is the maximum text length for Realtime TTS?

2,000 characters per request. For longer text, chunk at sentence boundaries (500-1,600 characters per chunk) and make multiple streaming requests. The chunking example above shows how to split text cleanly without breaking mid-sentence.

What audio formats does Realtime TTS support?

MP3 (default), LINEAR16, WAV, OGG_OPUS, MULAW, ALAW, and FLAC. Sample rates from 8kHz to 48kHz. Set these via the audioConfig object in your request. Use LINEAR16 at 24kHz for realtime playback with PyAudio.

Do I need an SDK to use Realtime TTS in Python?

No. The API is a standard REST endpoint. The Python requests library (included in virtually every Python environment) is all you need. No proprietary SDK, no version conflicts. This makes Realtime TTS straightforward to integrate into any Python project, framework, or deployment environment.

How to Add Text-to-Speech to a Python App with Inworld AI