How long does it take to migrate from ElevenLabs to Realtime TTS?

Most teams complete the migration in under an hour. The API surface is small: one endpoint for standard TTS, one for streaming. The main code change is switching from raw audio bytes in the response to JSON with base64-encoded audio. Auth moves from a custom header to standard HTTP Basic auth.

What are the key API differences between ElevenLabs and Realtime TTS?

Three differences matter: (1) Auth uses standard HTTP Basic instead of a custom xi-api-key header. (2) Voice ID and model ID go in the JSON body, not the URL path. (3) The response is JSON containing base64-encoded audio, not raw binary bytes. Streaming returns NDJSON lines instead of chunked binary.

Does Realtime TTS support the same audio formats as ElevenLabs?

Realtime TTS returns audio as base64-encoded data in JSON responses. You decode the base64 string to get the audio bytes. The default format is MP3. ElevenLabs returns raw audio bytes directly, also defaulting to MP3.

Can I use the same voices after migrating?

Inworld has its own voice library with pre-built voices and supports custom voice cloning from 5-15 seconds of audio. You will need to select new voice IDs from the Inworld voice catalog or clone your custom voices using the Inworld platform.

Does Realtime TTS support streaming?

Yes. POST to /tts/v1/voice:stream and the response is NDJSON, where each line is a JSON object with a result.audioContent field containing base64-encoded audio. This differs from ElevenLabs, which streams raw binary chunks.

Migrate from ElevenLabs to Realtime TTS (Code Examples)

Last updated: April 5, 2026

Moving a production TTS integration from ElevenLabs to Inworld is mostly mechanical: swap the endpoint, adjust auth, and handle the response format difference. Most teams finish in under an hour.

Why migrate

Inworld Realtime TTS-2 (research preview) is the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena. Realtime TTS 1.5 Max also ranks among the top realtime models. ElevenLabs Eleven v3 sits outside the top-ranked realtime tier.

Beyond raw quality, the architectural differences matter:

Model-agnostic voice pipeline. Inworld includes STT, Router (200+ LLMs from major providers), TTS, and Realtime API under a single API key. ElevenLabs offers Eleven v3 TTS, Scribe v2 STT, ElevenAgents / Conversational AI, Flows, Music v2, Dubbing v2, and a Government tier, but locks you to their models for agents. Inworld's Router lets you choose from OpenAI, Anthropic, Google, Groq, Fireworks, Mistral, DeepSeek, and others.
Model-agnostic architecture. Inworld's Router lets you swap underlying LLMs without changing your application code. Your voice pipeline is not locked to one model provider.
On-premise deployment. Full on-premise deployment with no latency penalty. ElevenLabs also offers on-premise and on-device options (April 2026) and a Government tier (February 2026). The Inworld differentiator is on-premise TTS combined with model-agnostic routing across 200+ LLMs under a single deployment.
Voice cloning included. Instant cloning from 5-15 seconds of audio. Professional cloning (30+ minutes of audio) available via sales. TTS-2 also supports voice design from a natural-language description.

API differences at a glance

Three structural differences between the APIs:

The biggest change in your code will be the response handling. ElevenLabs returns raw binary audio that you write directly to a file or pipe to a player. Inworld returns JSON, so you parse the response and base64-decode the audioContent field.

Step 1: Standard TTS (non-streaming)

This is the most common integration pattern. Send text, get audio back.

curl

Before (ElevenLabs):

curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM" \
  -H "xi-api-key: YOUR_ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test of text to speech.",
    "model_id": "eleven_flash_v2_5",
    "voice_settings": {
      "stability": 0.5,
      "similarity_boost": 0.75
    }
  }' \
  --output speech.mp3

After (Inworld):

curl -X POST "https://api.inworld.ai/tts/v1/voice" \
  -H "Authorization: Basic YOUR_INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voiceId": "Sarah",
    "modelId": "inworld-tts-1.5-max",
    "text": "Hello, this is a test of text to speech."
  }' \
  | python3 -c "import sys,json,base64; d=json.load(sys.stdin); open('speech.mp3','wb').write(base64.b64decode(d['audioContent']))"

Key differences in the curl example:

The ElevenLabs voice ID (21m00Tcm4TlvDq8ikWAM) is in the URL path. Inworld puts voiceId in the request body.
ElevenLabs uses --output speech.mp3 because the response is raw bytes. Inworld returns JSON, so you need to parse and decode the audioContent field. The pipe to python3 handles that.
Auth switches from xi-api-key header to standard Authorization: Basic.

Python

Before (ElevenLabs):

import requests

ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM"

response = requests.post(
    f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}",
    headers={
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
    },
    json={
        "text": "Hello, this is a test of text to speech.",
        "model_id": "eleven_flash_v2_5",
        "voice_settings": {
            "stability": 0.5,
            "similarity_boost": 0.75,
        },
    },
)

# ElevenLabs returns raw audio bytes
with open("speech.mp3", "wb") as f:
    f.write(response.content)

After (Inworld):

import requests
import base64

# pip install requests
INWORLD_API_KEY = "YOUR_INWORLD_API_KEY"  # From https://platform.inworld.ai

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": "Hello, this is a test of text to speech.",
    },
    timeout=30,
)
response.raise_for_status()

# Inworld returns JSON with base64-encoded audio
data = response.json()
audio_bytes = base64.b64decode(data["audioContent"])

with open("speech.mp3", "wb") as f:
    f.write(audio_bytes)

The critical difference: ElevenLabs gives you response.content (raw bytes), Inworld gives you response.json() with a audioContent string that you base64.b64decode(). Everything else is a straightforward find-and-replace.

JavaScript (Node.js)

Before (ElevenLabs):

const fs = require("fs");

const ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY";
const VOICE_ID = "21m00Tcm4TlvDq8ikWAM";

async function synthesize() {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${VOICE_ID}`,
    {
      method: "POST",
      headers: {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        text: "Hello, this is a test of text to speech.",
        model_id: "eleven_flash_v2_5",
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75,
        },
      }),
    }
  );

  // ElevenLabs returns raw audio bytes
  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync("speech.mp3", buffer);
}

synthesize();

After (Inworld):

const fs = require("fs");

const INWORLD_API_KEY = "YOUR_INWORLD_API_KEY";

async function synthesize() {
  const response = await fetch(
    "https://api.inworld.ai/tts/v1/voice",
    {
      method: "POST",
      headers: {
        "Authorization": `Basic ${INWORLD_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        voiceId: "Sarah",
        modelId: "inworld-tts-1.5-max",
        text: "Hello, this is a test of text to speech.",
      }),
    }
  );

  if (!response.ok) {
    throw new Error(`API error ${response.status}: ${await response.text()}`);
  }

  // Inworld returns JSON with base64-encoded audio
  const data = await response.json();
  const audioBuffer = Buffer.from(data.audioContent, "base64");
  fs.writeFileSync("speech.mp3", audioBuffer);
}

synthesize();

Same pattern as Python: response.arrayBuffer() (ElevenLabs raw bytes) becomes response.json() followed by Buffer.from(data.audioContent, "base64").

Step 2: Streaming TTS

For realtime applications where you need to start playing audio before the full response is generated. The streaming APIs differ more significantly between the two platforms.

curl

Before (ElevenLabs):

curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM/stream" \
  -H "xi-api-key: YOUR_ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a streaming test.",
    "model_id": "eleven_flash_v2_5"
  }' \
  --output streamed.mp3

After (Inworld):

curl -X POST "https://api.inworld.ai/tts/v1/voice:stream" \
  -H "Authorization: Basic YOUR_INWORLD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voiceId": "Sarah",
    "modelId": "inworld-tts-1.5-max",
    "text": "Hello, this is a streaming test."
  }'

ElevenLabs streams raw binary chunks that you write directly to a file. Inworld streams NDJSON (newline-delimited JSON), where each line looks like:

{"result": {"audioContent": "base64encodedaudiodata..."}}

You parse each line as JSON and base64-decode the audioContent field to get the audio bytes for that chunk.

Python

Before (ElevenLabs):

import requests

ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM"

response = requests.post(
    f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
    headers={
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
    },
    json={
        "text": "Hello, this is a streaming test.",
        "model_id": "eleven_flash_v2_5",
    },
    stream=True,
)

# ElevenLabs streams raw audio bytes via chunked transfer
with open("streamed.mp3", "wb") as f:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            f.write(chunk)

After (Inworld):

import requests
import base64
import json

INWORLD_API_KEY = "YOUR_INWORLD_API_KEY"  # From https://platform.inworld.ai

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "voiceId": "Sarah",
        "modelId": "inworld-tts-1.5-max",
        "text": "Hello, this is a streaming test.",
    },
    stream=True,
    timeout=30,
)
response.raise_for_status()

# Inworld streams JSON lines: each line is {"result": {"audioContent": "base64..."}}
audio_chunks = []
for line in response.iter_lines():
    if line:
        data = json.loads(line)
        chunk = base64.b64decode(data["result"]["audioContent"])
        audio_chunks.append(chunk)

with open("streamed.mp3", "wb") as f:
    for chunk in audio_chunks:
        f.write(chunk)

With ElevenLabs, iter_content gives you raw audio bytes. With Inworld, iter_lines gives you JSON strings that you parse and decode. In production, you would feed each decoded chunk to your audio player as it arrives rather than collecting them all first.

JavaScript (Node.js)

Before (ElevenLabs):

const fs = require("fs");

const ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_API_KEY";
const VOICE_ID = "21m00Tcm4TlvDq8ikWAM";

async function streamSynthesize() {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${VOICE_ID}/stream`,
    {
      method: "POST",
      headers: {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        text: "Hello, this is a streaming test.",
        model_id: "eleven_flash_v2_5",
      }),
    }
  );

  // ElevenLabs streams raw audio bytes
  const chunks = [];
  const reader = response.body.getReader();
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    chunks.push(value);
  }

  fs.writeFileSync("streamed.mp3", Buffer.concat(chunks));
}

streamSynthesize();

After (Inworld):

const fs = require("fs");

const INWORLD_API_KEY = "YOUR_INWORLD_API_KEY";

async function streamSynthesize() {
  const response = await fetch(
    "https://api.inworld.ai/tts/v1/voice:stream",
    {
      method: "POST",
      headers: {
        "Authorization": `Basic ${INWORLD_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        voiceId: "Sarah",
        modelId: "inworld-tts-1.5-max",
        text: "Hello, this is a streaming test.",
      }),
    }
  );

  if (!response.ok) {
    throw new Error(`Stream request failed: ${response.status}`);
  }

  // Inworld streams NDJSON lines
  const decoder = new TextDecoder();
  const reader = response.body.getReader();
  const audioChunks = [];
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop();
    for (const line of lines) {
      if (line.trim()) {
        const data = JSON.parse(line);
        const chunk = Buffer.from(
          data.result.audioContent, "base64"
        );
        audioChunks.push(chunk);
      }
    }
  }

  // Process any remaining data in buffer
  if (buffer.trim()) {
    const data = JSON.parse(buffer);
    audioChunks.push(Buffer.from(data.result.audioContent, "base64"));
  }

  fs.writeFileSync("streamed.mp3", Buffer.concat(audioChunks));
}

streamSynthesize();

The Inworld streaming code is slightly more involved because you need to handle NDJSON line splitting on top of the ReadableStream API. The buffer variable accumulates partial lines across chunks to handle the case where a JSON line is split across two read() calls.

Step 3: Update your voice configuration

ElevenLabs and Inworld use different voice ID systems. You cannot carry over ElevenLabs voice IDs directly.

Pre-built voices: Browse the Inworld voice catalog to find voices that match your current selection, or list voices programmatically via GET /voices/v1/voices (the legacy /tts/v1/voices endpoint is deprecated July 1, 2026). The catalog includes pre-built voices across languages and styles.

Custom/cloned voices: If you are using ElevenLabs voice cloning, you can recreate your custom voices on Inworld. Voice cloning on Inworld is a 2-step API: POST /voices/v1/voices:clone returns a voiceId, which you then pass to /tts/v1/voice. There is no referenceAudio field on the TTS endpoint. Inworld cloning requires 5-15 seconds of reference audio (compared to 30 seconds to 5 minutes on ElevenLabs). Upload your source audio through the Inworld platform or call the API directly to generate new voice IDs. Important: Use your original human-recorded audio for cloning, not audio generated by ElevenLabs. Generation-on-generation cloning compounds synthesis artifacts and degrades quality across providers.

Model selection: Replace your ElevenLabs model ID with an Inworld equivalent:

Migration checklist

Use this checklist to track your migration:

Get your API key. Sign up at platform.inworld.ai and generate an API key.
Select voices. Map your existing ElevenLabs voice IDs to Inworld voices. Test each one to confirm the voice characteristics meet your requirements.
Update auth. Replace xi-api-key: KEY headers with Authorization: Basic KEY.
Update endpoints. Standard endpoint is /tts/v1/voice and streaming is /tts/v1/voice:stream, both served at api.inworld.ai.
Update request bodies. Move voice ID from the URL to voiceId in the body. Rename model_id to modelId. Remove voice_settings (Inworld handles this per-voice).
Update response handling. Parse JSON and base64-decode audioContent instead of reading raw bytes. For streaming, parse NDJSON lines instead of reading binary chunks.
Test in staging. Run your test suite against the new endpoints. Both Inworld and ElevenLabs default to MP3, but verify your audio pipeline handles the base64 JSON response format.
Deploy. Swap your production configuration.

Beyond TTS: the full pipeline

Once you are on Inworld for TTS, you have access to the rest of the platform without additional vendor integrations:

Speech-to-Text with multiple provider options (Realtime STT-1, Groq Whisper, AssemblyAI) through a single API. See pricing for current rates.
Router with 200+ LLMs. Route requests across OpenAI, Anthropic, Google, Groq, Fireworks, Mistral, DeepSeek and more. Run experiments, fallback chains, and switch models without code changes.
Realtime API for bidirectional voice. Audio in, audio out, over a single WebSocket or WebRTC connection. Handles turn-taking, interruption, and voice output so you do not need to orchestrate STT + LLM + TTS yourself.

If you are currently using ElevenLabs for TTS and a separate vendor for STT and LLM, migrating to Inworld consolidates your entire voice pipeline into one platform, one auth system, and one bill.

Getting help

Realtime TTS documentation
API reference
Inworld Discord for developer support
Talk to an architect for enterprise deployments and on-premise options

Migrate from ElevenLabs to Realtime TTS: Complete Developer Guide

Why migrate

API differences at a glance

Step 1: Standard TTS (non-streaming)

curl

Python

JavaScript (Node.js)

Step 2: Streaming TTS

curl

Python

JavaScript (Node.js)

Step 3: Update your voice configuration

Migration checklist

Beyond TTS: the full pipeline

Getting help