Get started
Published 04.02.2026

Migrate from ElevenLabs to Inworld TTS: Complete Developer Guide

Last updated: April 5, 2026
Moving a production TTS integration from ElevenLabs to Inworld is mostly mechanical: swap the endpoint, adjust auth, and handle the response format difference. Most teams finish in under an hour.

Why migrate

Inworld TTS-1.5 Max ranks #1 on the Artificial Analysis Speech Arena (ELO 1,236), holding 3 of the top 5 positions across the model family.
The quality gap is significant, but the architectural differences matter more:
  • Model-agnostic voice pipeline. Inworld includes STT, LLM Router (200+ models from any provider), TTS, and Realtime API under a single API key. ElevenLabs offers TTS, STT (Scribe), and Conversational AI, but locks you to their models. Inworld's Router lets you choose from OpenAI, Anthropic, Google, Groq, Fireworks, and others.
  • Model-agnostic architecture. Inworld's Router lets you swap underlying LLMs without changing your application code. Your voice pipeline is not locked to one model provider.
  • On-premise deployment. Full on-premise deployment with zero latency penalty. ElevenLabs only offers private VPC on AWS Marketplace/SageMaker.
  • Voice cloning included. Instant cloning from 5-15 seconds of audio. Professional cloning (30+ minutes of audio) available via sales.

API differences at a glance

Three structural differences between the APIs:
The biggest change in your code will be the response handling. ElevenLabs returns raw binary audio that you write directly to a file or pipe to a player. Inworld returns JSON, so you parse the response and base64-decode the audioContent field.

Step 1: Standard TTS (non-streaming)

This is the most common integration pattern. Send text, get audio back.

curl

Before (ElevenLabs):
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM" \
  -H "xi-api-key: YOUR_ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test of text to speech.",
    "model_id": "eleven_flash_v2_5",
    "voice_settings": {
      "stability": 0.5,
      "similarity_boost": 0.75
    }
  }' \
  --output speech.mp3
After (Inworld):
curl -X POST "https://api.inworld.ai/tts/v1/voice" \
  -H "Authorization: Basic YOUR_INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voiceId": "Sarah",
    "modelId": "inworld-tts-1.5-max",
    "text": "Hello, this is a test of text to speech."
  }' \
  | python3 -c "import sys,json,base64; d=json.load(sys.stdin); open('speech.mp3','wb').write(base64.b64decode(d['audioContent']))"
Key differences in the curl example:
  • The ElevenLabs voice ID (21m00Tcm4TlvDq8ikWAM) is in the URL path. Inworld puts voiceId in the request body.
  • ElevenLabs uses --output speech.mp3 because the response is raw bytes. Inworld returns JSON, so you need to parse and decode the audioContent field. The pipe to python3 handles that.
  • Auth switches from xi-api-key header to standard Authorization: Basic.

Python

Before (ElevenLabs):
import requests

ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM"

response = requests.post(
    f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}",
    headers={
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
    },
    json={
        "text": "Hello, this is a test of text to speech.",
        "model_id": "eleven_flash_v2_5",
        "voice_settings": {
            "stability": 0.5,
            "similarity_boost": 0.75,
        },
    },
)

# ElevenLabs returns raw audio bytes
with open("speech.mp3", "wb") as f:
    f.write(response.content)
After (Inworld):
import requests
import base64

# pip install requests
INWORLD_API_KEY = "YOUR_INWORLD_API_KEY"  # From https://platform.inworld.ai

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": "Hello, this is a test of text to speech.",
    },
    timeout=30,
)
response.raise_for_status()

# Inworld returns JSON with base64-encoded audio
data = response.json()
audio_bytes = base64.b64decode(data["audioContent"])

with open("speech.mp3", "wb") as f:
    f.write(audio_bytes)
The critical difference: ElevenLabs gives you response.content (raw bytes), Inworld gives you response.json() with a audioContent string that you base64.b64decode(). Everything else is a straightforward find-and-replace.

JavaScript (Node.js)

Before (ElevenLabs):
const fs = require("fs");

const ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY";
const VOICE_ID = "21m00Tcm4TlvDq8ikWAM";

async function synthesize() {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${VOICE_ID}`,
    {
      method: "POST",
      headers: {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        text: "Hello, this is a test of text to speech.",
        model_id: "eleven_flash_v2_5",
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75,
        },
      }),
    }
  );

  // ElevenLabs returns raw audio bytes
  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync("speech.mp3", buffer);
}

synthesize();
After (Inworld):
const fs = require("fs");

const INWORLD_API_KEY = "YOUR_INWORLD_API_KEY";

async function synthesize() {
  const response = await fetch(
    "https://api.inworld.ai/tts/v1/voice",
    {
      method: "POST",
      headers: {
        "Authorization": `Basic ${INWORLD_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        voiceId: "Sarah",
        modelId: "inworld-tts-1.5-max",
        text: "Hello, this is a test of text to speech.",
      }),
    }
  );

  // Inworld returns JSON with base64-encoded audio
  const data = await response.json();
  const audioBuffer = Buffer.from(data.audioContent, "base64");
  fs.writeFileSync("speech.mp3", audioBuffer);
}

synthesize();
Same pattern as Python: response.arrayBuffer() (ElevenLabs raw bytes) becomes response.json() followed by Buffer.from(data.audioContent, "base64").

Step 2: Streaming TTS

For realtime applications where you need to start playing audio before the full response is generated. The streaming APIs differ more significantly between the two platforms.

curl

Before (ElevenLabs):
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM/stream" \
  -H "xi-api-key: YOUR_ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a streaming test.",
    "model_id": "eleven_flash_v2_5"
  }' \
  --output streamed.mp3
After (Inworld):
curl -X POST "https://api.inworld.ai/tts/v1/voice:stream" \
  -H "Authorization: Basic YOUR_INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voiceId": "Sarah",
    "modelId": "inworld-tts-1.5-max",
    "text": "Hello, this is a streaming test."
  }'
ElevenLabs streams raw binary chunks that you write directly to a file. Inworld streams NDJSON (newline-delimited JSON), where each line looks like:
{"result": {"audioContent": "base64encodedaudiodata..."}}
You parse each line as JSON and base64-decode the audioContent field to get the audio bytes for that chunk.

Python

Before (ElevenLabs):
import requests

ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM"

response = requests.post(
    f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
    headers={
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
    },
    json={
        "text": "Hello, this is a streaming test.",
        "model_id": "eleven_flash_v2_5",
    },
    stream=True,
)

# ElevenLabs streams raw audio bytes via chunked transfer
with open("streamed.mp3", "wb") as f:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            f.write(chunk)
After (Inworld):
import requests
import base64
import json

INWORLD_API_KEY = "YOUR_INWORLD_API_KEY"  # From https://platform.inworld.ai

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": "Hello, this is a streaming test.",
    },
    stream=True,
    timeout=30,
)
response.raise_for_status()

# Inworld streams JSON lines: each line is {"result": {"audioContent": "base64..."}}
audio_chunks = []
for line in response.iter_lines():
    if line:
        data = json.loads(line)
        chunk = base64.b64decode(data["result"]["audioContent"])
        audio_chunks.append(chunk)

with open("streamed.mp3", "wb") as f:
    for chunk in audio_chunks:
        f.write(chunk)
With ElevenLabs, iter_content gives you raw audio bytes. With Inworld, iter_lines gives you JSON strings that you parse and decode. In production, you would feed each decoded chunk to your audio player as it arrives rather than collecting them all first.

JavaScript (Node.js)

Before (ElevenLabs):
const fs = require("fs");

const ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY";
const VOICE_ID = "21m00Tcm4TlvDq8ikWAM";

async function streamSynthesize() {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${VOICE_ID}/stream`,
    {
      method: "POST",
      headers: {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        text: "Hello, this is a streaming test.",
        model_id: "eleven_flash_v2_5",
      }),
    }
  );

  // ElevenLabs streams raw audio bytes
  const chunks = [];
  const reader = response.body.getReader();
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    chunks.push(value);
  }

  fs.writeFileSync("streamed.mp3", Buffer.concat(chunks));
}

streamSynthesize();
After (Inworld):
const fs = require("fs");

const INWORLD_API_KEY = "YOUR_INWORLD_API_KEY";

async function streamSynthesize() {
  const response = await fetch(
    "https://api.inworld.ai/tts/v1/voice:stream",
    {
      method: "POST",
      headers: {
        "Authorization": `Basic ${INWORLD_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        voiceId: "Sarah",
        modelId: "inworld-tts-1.5-max",
        text: "Hello, this is a streaming test.",
      }),
    }
  );

  // Inworld streams NDJSON lines
  const decoder = new TextDecoder();
  const reader = response.body.getReader();
  const audioChunks = [];
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop();
    for (const line of lines) {
      if (line.trim()) {
        const data = JSON.parse(line);
        const chunk = Buffer.from(
          data.result.audioContent, "base64"
        );
        audioChunks.push(chunk);
      }
    }
  }

  fs.writeFileSync("streamed.mp3", Buffer.concat(audioChunks));
}

streamSynthesize();
The Inworld streaming code is slightly more involved because you need to handle NDJSON line splitting on top of the ReadableStream API. The buffer variable accumulates partial lines across chunks to handle the case where a JSON line is split across two read() calls.

Step 3: Update your voice configuration

ElevenLabs and Inworld use different voice ID systems. You cannot carry over ElevenLabs voice IDs directly.
Pre-built voices: Browse the Inworld voice catalog to find voices that match your current selection. The catalog includes pre-built voices across languages and styles.
Custom/cloned voices: If you are using ElevenLabs voice cloning, you can recreate your custom voices on Inworld. Inworld cloning requires 5-15 seconds of reference audio (compared to 30 seconds to 5 minutes on ElevenLabs). Upload your source audio through the Inworld platform to generate new voice IDs. Important: Use your original audio recordings for cloning, not audio generated by ElevenLabs. AI-generated output from one provider does not produce good clones on another because the synthesis artifacts of the source model interfere with the target model's voice extraction.
Model selection: Replace your ElevenLabs model ID with an Inworld equivalent:

Migration checklist

Use this checklist to track your migration:
  1. Get your API key. Sign up at platform.inworld.ai and generate an API key.
  2. Select voices. Map your existing ElevenLabs voice IDs to Inworld voices. Test each one to confirm the voice characteristics meet your requirements.
  3. Update auth. Replace xi-api-key: KEY headers with Authorization: Basic KEY.
  4. Update endpoints. Standard: https://api.inworld.ai/tts/v1/voice. Streaming: https://api.inworld.ai/tts/v1/voice:stream.
  5. Update request bodies. Move voice ID from the URL to voiceId in the body. Rename model_id to modelId. Remove voice_settings (Inworld handles this per-voice).
  6. Update response handling. Parse JSON and base64-decode audioContent instead of reading raw bytes. For streaming, parse NDJSON lines instead of reading binary chunks.
  7. Test in staging. Run your test suite against the new endpoints. Both Inworld and ElevenLabs default to MP3, but verify your audio pipeline handles the base64 JSON response format.
  8. Deploy. Swap your production configuration.

Beyond TTS: the full pipeline

Once you are on Inworld for TTS, you have access to the rest of the platform without additional vendor integrations:
  • Speech-to-Text with multiple provider options (Inworld STT-1, Groq Whisper, AssemblyAI) through a single API. See pricing for current rates.
  • LLM Router with 200+ models. Route requests across providers, run experiments, and switch models without code changes. Free during research preview.
  • Realtime API for bidirectional voice. Audio in, audio out, over a single WebSocket or WebRTC connection. Handles turn-taking, interruption, and voice output so you do not need to orchestrate STT + LLM + TTS yourself.
If you are currently using ElevenLabs for TTS and a separate vendor for STT and LLM, migrating to Inworld consolidates your entire voice pipeline into one platform, one auth system, and one bill.

Getting help

Copyright © 2021-2026 Inworld AI