How to Build an STT-LLM-TTS Voice Pipeline (Python, 100 lines)

Q: What is the simplest way to build a voice agent in Python?

Three sequential REST calls: `POST /stt/v1/transcribe` (audio in, transcript out), `POST /v1/chat/completions` (transcript in, reply out), `POST /tts/v1/voice` (reply in, audio out). About 40 lines of Python. Good for batch use cases. For real-time conversation, switch to streaming versions of each call or use the Realtime API.

By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026

A complete real-time voice pipeline (speech in → language reasoning → speech out) is now under 100 lines of Python. Inworld AI's Realtime API collapses the three components into a single WebSocket connection: Realtime STT for transcription, the Realtime Router for LLM selection across hundreds of models, and Realtime TTS (#1 on the Artificial Analysis Speech Arena, three of the top five) for synthesis.

This tutorial walks through three implementations: the simplest possible single-call pipeline, a streaming voice loop, and the production-ready architecture used in voice agents and AI companions running at scale.

The Architecture

[microphone] --> Realtime STT --> Realtime Router --> Realtime TTS --> [speaker]
                          \                            /
                           \---- one WebSocket -------/

Three components, one WebSocket. The Realtime API surfaces them as a single OpenAI-compatible event stream so you can ship a working voice agent in a single afternoon.

Implementation 1: Sync Three-Step Pipeline (50 lines)

The simplest version: record audio, transcribe it, generate a response, synthesize speech. Good for batch use cases like voicemail summarization or async voice replies.

import requests
import base64
import json

API = "https://api.inworld.ai"
AUTH = {"Authorization": "Basic <your-api-key>"}

def stt(audio_path: str) -> str:
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    r = requests.post(
        f"{API}/stt/v1/transcribe",
        headers=AUTH,
        json={
            "transcribeConfig": {
                "modelId": "groq/whisper-large-v3",
                "audioEncoding": "AUTO_DETECT",
                "language": "en-US"
            },
            "audioData": {"content": audio_b64}
        }
    )
    return r.json()["transcription"]["transcript"]

def llm(prompt: str) -> str:
    r = requests.post(
        f"{API}/v1/chat/completions",
        headers=AUTH,
        json={
            "model": "gpt-5.5",
            "messages": [{"role": "user", "content": prompt}]
        }
    )
    return r.json()["choices"][0]["message"]["content"]

def tts(text: str, out_path: str) -> None:
    r = requests.post(
        f"{API}/tts/v1/voice",
        headers=AUTH,
        json={
            "text": text,
            "voiceId": "Sarah",
            "modelId": "inworld-tts-1.5-max",
            "audioConfig": {
                "audioEncoding": "MP3",
                "sampleRateHertz": 24000
            }
        }
    )
    audio = base64.b64decode(r.json()["audioContent"])
    with open(out_path, "wb") as f:
        f.write(audio)

# Run the full pipeline
transcript = stt("user_audio.wav")
print(f"User: {transcript}")
reply = llm(transcript)
print(f"Assistant: {reply}")
tts(reply, "assistant_reply.mp3")

That is the entire sync pipeline. About 40 lines of real code.

Implementation 2: Streaming Loop (~100 lines)

For real-time conversation, switch the LLM and TTS calls to streaming. The LLM streams tokens; the TTS picks them up and starts synthesizing audio while the LLM is still generating.

import requests
import base64
import json

API = "https://api.inworld.ai"
AUTH = {"Authorization": "Basic <your-api-key>"}

def stream_llm(prompt: str):
    """Yield tokens from the Realtime Router as they arrive."""
    r = requests.post(
        f"{API}/v1/chat/completions",
        headers=AUTH,
        json={
            "model": "gpt-5.5",
            "messages": [{"role": "user", "content": prompt}],
            "stream": True
        },
        stream=True
    )
    for line in r.iter_lines():
        if not line:
            continue
        if line.startswith(b"data: "):
            data = line[6:]
            if data == b"[DONE]":
                break
            chunk = json.loads(data)
            delta = chunk["choices"][0]["delta"].get("content", "")
            if delta:
                yield delta

def stream_tts(text: str):
    """Yield PCM audio chunks from the Realtime TTS streaming endpoint."""
    with requests.post(
        f"{API}/tts/v1/voice:stream",
        headers=AUTH,
        json={
            "text": text,
            "voiceId": "Sarah",
            "modelId": "inworld-tts-1.5-mini",
            "audioConfig": {
                "audioEncoding": "PCM",
                "sampleRateHertz": 24000
            }
        },
        stream=True
    ) as r:
        for line in r.iter_lines():
            if not line:
                continue
            audio = base64.b64decode(
                json.loads(line)["result"]["audioContent"]
            )
            yield audio

# In a real loop, you would buffer LLM tokens by sentence boundary
# and start TTS as soon as the first sentence is ready.
def voice_reply(user_text: str, audio_callback):
    sentence_buffer = ""
    for token in stream_llm(user_text):
        sentence_buffer += token
        # Naive sentence split; replace with proper segmenter in production.
        if any(p in token for p in ".!?"):
            for chunk in stream_tts(sentence_buffer.strip()):
                audio_callback(chunk)
            sentence_buffer = ""
    if sentence_buffer.strip():
        for chunk in stream_tts(sentence_buffer.strip()):
            audio_callback(chunk)

# audio_callback writes PCM bytes to your audio output
# (PortAudio, sounddevice, WebSocket to browser, etc.)

This pattern delivers the first audible word within ~200ms of the user finishing speaking, which is the threshold for a conversation to feel natural.

Implementation 3: Production-Ready Realtime API (~80 lines)

For real production voice agents, the Realtime API replaces all of the above with a single WebSocket. You stream microphone audio in, you receive synthesized audio out, and the STT, Router, and TTS run in concert behind one connection. The components reinforce each other in ways they cannot when assembled separately: STT acoustic signals (emotion, hesitation, speaker profile) feed the Router so model choice adapts to who is speaking.

import asyncio
import json
import base64
import websockets

URL = (
    "wss://api.inworld.ai/api/v1/realtime/session"
    "?key=<session-id>&protocol=realtime"
)
AUTH_HEADERS = {"Authorization": "Basic <your-api-key>"}

async def voice_session(mic_stream, speaker_callback):
    async with websockets.connect(URL, extra_headers=AUTH_HEADERS) as ws:
        # Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "model": "gpt-5.5",
                "instructions": "You are a helpful voice assistant. Be concise.",
                "output_modalities": ["audio", "text"],
                "audio": {
                    "input": {
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "turn_detection": {
                            "type": "semantic_vad",
                            "eagerness": "auto"
                        }
                    },
                    "output": {
                        "voice": "Sarah",
                        "model": "inworld-tts-1.5-mini",
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "speed": 1.0
                    }
                }
            }
        }))

        async def send_mic():
            async for chunk in mic_stream:
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(chunk).decode()
                }))

        async def receive_audio():
            async for raw in ws:
                event = json.loads(raw)
                if event["type"] == "response.audio.delta":
                    audio_bytes = base64.b64decode(event["delta"])
                    speaker_callback(audio_bytes)
                elif event["type"] == "response.done":
                    pass  # turn complete; ready for next input

        await asyncio.gather(send_mic(), receive_audio())

# mic_stream is an async generator yielding PCM16 24kHz chunks (e.g., 100ms each).
# speaker_callback writes PCM16 bytes to your audio output.

The Realtime API uses the OpenAI Realtime event format extended for additional Inworld features. Use WebSocket on the server side (Basic auth) and WebRTC in the browser (JWT token minted by your backend).

What the Realtime API Adds Over the Sync Pipeline

Voice activity detection. The server detects when the user finishes speaking using semantic VAD, not just silence. No client-side push-to-talk required.
Interruption handling. When the user starts talking while the assistant is speaking, the Realtime API stops audio output and starts processing the new input within milliseconds.
Pipelined latency. STT, LLM, and TTS overlap. The first audible word reaches the user within ~200ms of the user finishing.
Voice-aware routing. STT acoustic signals (emotion, hesitation, speaker profile) are passed to the Router for model selection.
Tool calling inside the audio loop. Function calls happen without dropping the audio stream.

FAQ

What is the simplest way to build a voice agent in Python?

Three sequential REST calls: POST /stt/v1/transcribe (audio in, transcript out), POST /v1/chat/completions (transcript in, reply out), POST /tts/v1/voice (reply in, audio out). About 40 lines of Python. Good for batch use cases. For real-time conversation, switch to streaming versions of each call or use the Realtime API.

How do I get sub-1-second response time?

Use the Realtime API WebSocket. STT, LLM, and TTS overlap inside one connection. First audible word reaches the user within ~200ms of the user finishing. Total round-trip stays under 800ms.

What audio format does the Realtime API expect?

PCM16 at 24 kHz mono, base64-encoded inside JSON events. Recommended chunk size is 100-200ms. For telephony use cases (8 kHz MULAW), the API also accepts 8 kHz formats; configure them in the session.update event.

Can I run the LLM on a different provider?

Yes. The Realtime Router routes to hundreds of models from OpenAI, Anthropic, Google, Mistral, Meta, DeepSeek, xAI, and others. Specify the model in the session.update event (gpt-5.5, claude-opus-4-7, gemini-3.1-pro, llama-4-maverick, etc.) or use auto for dynamic selection.

Do I need to handle interruptions and turn-taking myself?

Not with the Realtime API. Semantic voice activity detection, turn-taking, and interruption handling are managed server-side. With the sync pipeline (Implementation 1), you handle silence detection on the client. With the streaming loop (Implementation 2), you handle turn boundaries in your own code.