What stays the same between OpenAI Realtime and Inworld Realtime?

The event schema, the turn lifecycle, and the streaming format. Both APIs use session.update, input_audio_buffer.append, input_audio_buffer.commit, response.create, response.output_audio.delta, response.done, and conversation.item.create with identical semantics. Audio is base64-encoded PCM16 at 24kHz in both directions. server_vad and semantic_vad turn detection work the same way from the client's perspective. Tool calling uses the same response.function_call_arguments.delta and response.function_call_arguments.done events.

What changes when migrating to Inworld Realtime?

Four things. First, the endpoint URL becomes wss://api.inworld.ai/api/v1/realtime/session. Second, auth is HTTP Basic with a base64-encoded key colon secret pair, not Bearer. Third, model and voice live inside session.audio.output (audio.output.model and audio.output.voice) rather than at the session root. Fourth, server_vad on Inworld is Inworld's own Silero VAD plus a Smart Turn detector, not OpenAI's default VAD, so endpointing behavior differs slightly. The Realtime API also accepts image content parts as of May 2026.

Can I keep using the OpenAI SDK after migrating?

Yes for the Router. The Inworld Realtime Router exposes an OpenAI Chat Completions compatible endpoint at https://api.inworld.ai/v1, so the official OpenAI Python and JavaScript SDKs work by changing base_url. For the Realtime API itself, both OpenAI and Inworld use a raw WebSocket protocol that is identical at the event-schema level. Most client libraries built against OpenAI's Realtime API will work against Inworld with the four field changes listed above.

Does Inworld Realtime API support voice cloning?

Yes. Voice cloning is a separate two-step API. POST your reference audio to /voices/v1/voices:clone to receive a voiceId, then pass that voiceId on the Realtime session.update as audio.output.voice. Use original human-recorded audio for cloning, not audio generated by another TTS provider. Generation-on-generation cloning compounds synthesis artifacts.

Migrate from OpenAI Realtime API to Inworld Realtime API

Q: How do I migrate from OpenAI Realtime API to Inworld Realtime API?

Inworld AI implements the OpenAI Realtime protocol over WebSocket, so the migration is mostly a base URL swap. Change wss://api.openai.com/v1/realtime to wss://api.inworld.ai/api/v1/realtime/session, replace the Bearer auth with Authorization Basic, then update the session.update payload to use audio.output.voice and audio.output.model (Inworld's WebSocket fields) instead of OpenAI's top-level voice and model fields. Existing event handlers for input_audio_buffer.append, response.create, and response.done continue to work.

Q: Why migrate from OpenAI Realtime to Inworld?

Three reasons. Model choice: the Inworld Realtime API runs on the Inworld Realtime Router, which routes to 220+ models from OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek, Meta, Groq, and Inworld-optimized open-source models. You are not locked to GPT for reasoning. Voice quality: Realtime TTS-2 (research preview) is engineered for realtime streaming, with sub-200ms time-to-first-audio and expressive steering. Voice features: cross-lingual voice identity across 100+ languages and natural-language emotion steering on TTS-2.

Last updated: May 28, 2026

Inworld AI implements the OpenAI Realtime protocol on the Inworld Realtime API, so most existing OpenAI voice-agent clients move over by swapping the WebSocket URL, the auth header, and a handful of field names inside session.update. The structural events (input_audio_buffer.append, response.create, response.output_audio.delta, response.done) keep the same semantics. The differences live in three places: where you put the model and voice in session.update, how server_vad actually behaves, and what the Router unlocks once you are off gpt-realtime.

Why migrate from OpenAI Realtime to Inworld Realtime

OpenAI Realtime (gpt-realtime) defined the WebSocket event schema the rest of the industry now builds against. That ubiquity is its strength. The constraint is that the model, the voice, and the STT are all OpenAI. You cannot drop in Claude as the reasoning engine, you cannot point the voice output at an independent, purpose-built realtime TTS, and you cannot run a workload on optimized open-source models when you want to.

The Inworld Realtime API addresses the lock-in without breaking the protocol contract:

Model choice over the Inworld Realtime Router. The Realtime API runs on top of the Realtime Router, which routes to 220+ models from OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek, Meta, Groq, and DeepInfra (including deepinfra/openai/gpt-oss-120b on the 3P track), plus Realtime Inference: Inworld-optimized open-source models on the 1P track (Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2). Production companion apps swap the reasoning model without changing client code.
#1 realtime TTS for the voice output. Realtime TTS-2 (research preview) is the #1 realtime TTS, engineered for streaming with sub-200ms time-to-first-audio and expressive steering. The OpenAI Realtime voices are bundled with the model and cannot be selected independently.
Cross-lingual voice identity. TTS-2 preserves a single voice identity across 100+ languages (15 GA plus 90+ experimental), with natural-language emotion steering through bracketed tags like [say warmly] at the start of text.
OpenAI SDK still works for the Router. The Inworld Router exposes an OpenAI Chat Completions endpoint at https://api.inworld.ai/v1. You can keep using the OpenAI Python and JavaScript SDKs for non-realtime calls by changing base_url.

Trade-offs worth knowing before you cut over. Inworld's realtime inference is currently US-hosted, which can be a blocker for EU-resident workloads with strict data-residency requirements. In an internal customer benchmark (May 2026), end-to-end Realtime API latency landed above ElevenLabs in one specific pipeline. The Realtime API is GA on WebSocket; WebRTC and SIP are early access. None of that breaks the migration path, but it is worth knowing up front.

What stays the same between the two APIs

The event surface is the same. If you have working code against gpt-realtime, the following events keep their schema and semantics on Inworld:

Audio is base64-encoded PCM16 at 24 kHz on both sides. Output modalities are configured with the same output_modalities array. Tool calling uses the same response.function_call_arguments.delta and response.function_call_arguments.done events.

If you wrote your own client library against the OpenAI Realtime API, the loop body keeps working. The only changes are at the edges: connection, auth, and the shape of session.update.

What changes in the migration

Four concrete changes. None of them touch the event loop.

The most common gotcha is the model fields. On OpenAI Realtime, gpt-realtime is one bundled audio model so the top-level model is everything. On Inworld, the Realtime API is a cascaded pipeline (STT + LLM + TTS), so the LLM lives at the top of session and the TTS model lives inside session.audio.output. If you forget the second one, you get the default TTS model rather than inworld-tts-2.

How session.update compares side by side

The simplest possible voice session looks like this in each API. Same modality, same audio format, same prompt.

OpenAI Realtime (`gpt-realtime`)

{
  "type": "session.update",
  "session": {
    "type": "realtime",
    "model": "gpt-realtime",
    "output_modalities": ["audio"],
    "audio": {
      "input": {
        "format": { "type": "audio/pcm", "rate": 24000 },
        "turn_detection": { "type": "server_vad" }
      },
      "output": {
        "format": { "type": "audio/pcm", "rate": 24000 },
        "voice": "marin"
      }
    },
    "instructions": "You are a helpful voice assistant."
  }
}

Inworld Realtime API

{
  "type": "session.update",
  "session": {
    "type": "realtime",
    "model": "openai/gpt-5.5",
    "output_modalities": ["audio"],
    "audio": {
      "input": {
        "format": { "type": "audio/pcm", "rate": 24000 },
        "transcription": { "model": "inworld/inworld-stt-1" },
        "turn_detection": { "type": "server_vad" }
      },
      "output": {
        "format": { "type": "audio/pcm", "rate": 24000 },
        "voice": "Sarah",
        "model": "inworld-tts-2"
      }
    },
    "instructions": "You are a helpful voice assistant."
  }
}

Two structural differences are visible. First, Inworld names the STT model explicitly inside audio.input.transcription because you can swap it (Inworld STT-1, AssemblyAI streaming, Soniox WebSocket). Second, Inworld puts the TTS model (inworld-tts-2) inside audio.output alongside the voice, while OpenAI bundles both into gpt-realtime at the session root.

session.audio.input.turn_detection.type accepts the same values (server_vad and semantic_vad) on both APIs. The behavior of server_vad is different in practice. On Inworld, server_vad runs Inworld's Silero VAD plus a Smart Turn detector that considers what was said, not only whether the user has gone silent. If you were tuning OpenAI's VAD thresholds for false-positive endpointing, expect to retune on Inworld.

How semantic VAD compares

Semantic VAD is supported on both APIs but the underlying detectors differ. Both let the model decide turn boundaries based on speech content rather than silence, and both expose a configurable eagerness setting (low, medium, high, auto).

The high-eagerness setting on Inworld is well suited to long-session companion apps where users pause mid-thought; Status (Wishroll) runs 90-plus-minute sessions on Inworld with this configuration. Low-eagerness is closer to the OpenAI default and a safer starting point for transactional flows.

How the Python client changes

The Python loop body is unchanged. The diff is the URL, the auth header, and the session.update shape.

Before (OpenAI)

import asyncio
import json
import os
import websockets

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime"

async def main():
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1",
    }
    async with websockets.connect(URL, additional_headers=headers) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "type": "realtime",
                "model": "gpt-realtime",
                "output_modalities": ["audio"],
                "audio": {
                    "input": {
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "turn_detection": {"type": "server_vad"},
                    },
                    "output": {
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "voice": "marin",
                    },
                },
                "instructions": "You are a helpful voice assistant.",
            },
        }))

        async for raw in ws:
            event = json.loads(raw)
            if event["type"] == "response.output_audio.delta":
                # event['delta'] is base64-encoded PCM16 audio
                pass
            elif event["type"] == "response.done":
                break

asyncio.run(main())

After (Inworld)

import asyncio
import base64
import json
import os
import uuid
import websockets

# pip install websockets
# Inworld auth is HTTP Basic with key:secret, base64-encoded
KEY = os.environ["INWORLD_API_KEY"]
SECRET = os.environ["INWORLD_API_SECRET"]
BASIC = base64.b64encode(f"{KEY}:{SECRET}".encode()).decode()

SESSION_ID = str(uuid.uuid4())
URL = (
    f"wss://api.inworld.ai/api/v1/realtime/session"
    f"?key={SESSION_ID}&protocol=realtime"
)

async def main():
    headers = {"Authorization": f"Basic {BASIC}"}
    async with websockets.connect(URL, additional_headers=headers) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "type": "realtime",
                "model": "openai/gpt-5.5",
                "output_modalities": ["audio"],
                "audio": {
                    "input": {
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "transcription": {"model": "inworld/inworld-stt-1"},
                        "turn_detection": {"type": "server_vad"},
                    },
                    "output": {
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "voice": "Sarah",
                        "model": "inworld-tts-2",
                    },
                },
                "instructions": "You are a helpful voice assistant.",
            },
        }))

        async for raw in ws:
            event = json.loads(raw)
            if event["type"] == "response.output_audio.delta":
                # event['delta'] is base64-encoded PCM16 audio
                pass
            elif event["type"] == "response.done":
                break

asyncio.run(main())

The Inworld example uses requests-style Basic auth (base64-encoded key:secret) and threads a session-id query parameter on the URL. The inworld-framework-py package exists but its main repo has not seen commits since August 2025, so the raw websockets library plus requests is the recommended Python pattern.

How the JavaScript client changes

Same pattern in Node. The event handlers stay identical; only the connection and session payload change.

Before (OpenAI)

import WebSocket from "ws";

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime";

const ws = new WebSocket(url, {
  headers: {
    Authorization: `Bearer ${OPENAI_API_KEY}`,
    "OpenAI-Beta": "realtime=v1",
  },
});

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      type: "realtime",
      model: "gpt-realtime",
      output_modalities: ["audio"],
      audio: {
        input: {
          format: { type: "audio/pcm", rate: 24000 },
          turn_detection: { type: "server_vad" },
        },
        output: {
          format: { type: "audio/pcm", rate: 24000 },
          voice: "marin",
        },
      },
      instructions: "You are a helpful voice assistant.",
    },
  }));
});

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());
  if (event.type === "response.output_audio.delta") {
    // event.delta is base64-encoded PCM16 audio
  } else if (event.type === "response.done") {
    ws.close();
  }
});

After (Inworld)

import { randomUUID } from "crypto";
import WebSocket from "ws";

const KEY = process.env.INWORLD_API_KEY;
const SECRET = process.env.INWORLD_API_SECRET;
const BASIC = Buffer.from(`${KEY}:${SECRET}`).toString("base64");

const sessionId = randomUUID();
const url =
  `wss://api.inworld.ai/api/v1/realtime/session` +
  `?key=${sessionId}&protocol=realtime`;

const ws = new WebSocket(url, {
  headers: { Authorization: `Basic ${BASIC}` },
});

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      type: "realtime",
      model: "openai/gpt-5.5",
      output_modalities: ["audio"],
      audio: {
        input: {
          format: { type: "audio/pcm", rate: 24000 },
          transcription: { model: "inworld/inworld-stt-1" },
          turn_detection: { type: "server_vad" },
        },
        output: {
          format: { type: "audio/pcm", rate: 24000 },
          voice: "Sarah",
          model: "inworld-tts-2",
        },
      },
      instructions: "You are a helpful voice assistant.",
    },
  }));
});

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());
  if (event.type === "response.output_audio.delta") {
    // event.delta is base64-encoded PCM16 audio
  } else if (event.type === "response.done") {
    ws.close();
  }
});

For browsers, prefer the WebRTC transport. Mint a short-lived JWT server-side with POST /auth/v1/tokens/token:generate (IW1-HMAC-SHA256 signed-request) and pass it as Authorization: Bearer <jwt> from the client. The same event schema applies.

How to migrate cloned voices

If your OpenAI Realtime deployment uses custom voices via an enterprise agreement, you cannot carry the voice IDs over directly. Recreate them through the Inworld voice-cloning API. It is two steps, by design: POST /voices/v1/voices:clone returns a voiceId, then you pass that voiceId to the Realtime session as audio.output.voice. There is no referenceAudio field on the Realtime or TTS endpoints.

Two practical rules for migration:

Use the original human-recorded source audio, not audio generated by OpenAI's TTS or any other provider. Cloning on top of synthesized audio compounds the artifacts of both systems and degrades quality measurably. This applies across every TTS vendor, not just Inworld.
5 to 15 seconds is enough for instant cloning. Professional cloning (30-plus minutes of clean audio) is a Growth-tier add-on delivered as a professional service rather than a self-serve flow.

Browse the pre-built voice catalog via GET /voices/v1/voices (the legacy /tts/v1/voices endpoint is deprecated July 1, 2026).

What the Router unlocks once you are on Inworld

The Realtime API runs on the Inworld Realtime Router under the hood. That is what changes the conversation around model choice.

Swap the LLM without rewriting the client. Set session.model to openai/gpt-5.5, anthropic/claude-sonnet-4-6, google-ai-studio/gemini-3.5-flash, deepseek/deepseek-v4-pro, or deepinfra/openai/gpt-oss-120b. The Realtime event schema does not change.
Run live A/B tests on production traffic. Production companion apps A/B test multiple providers on live traffic without changing client code.
Route on metadata. Pass user, language, country, tier, or intent and let the Router pick the right model variant.
Failover for free. Specify a fallback pool via extra_body.models; the Router records each attempt in metadata.attempts so you can debug.
OpenAI SDK still works for non-realtime calls. Set base_url="https://api.inworld.ai/v1" and the official Python or JavaScript SDK behaves as expected.

Migration checklist

Generate an Inworld API key and secret in the Inworld portal and base64-encode key:secret for the Basic auth header.
Swap the WebSocket URL from wss://api.openai.com/v1/realtime?model=gpt-realtime to wss://api.inworld.ai/api/v1/realtime/session?key=<session-id>&protocol=realtime. Generate session-id as a UUID per session.
Replace Authorization: Bearer ... with Authorization: Basic <base64(key:secret)>. Drop the OpenAI-Beta: realtime=v1 header.
Update session.update:
- Move the LLM into session.model as a routed model ID (e.g. openai/gpt-5.5 or anthropic/claude-sonnet-4-6).
- Add session.audio.input.transcription.model (e.g. inworld/inworld-stt-1).
- Move the TTS model into session.audio.output.model (e.g. inworld-tts-2).
- Replace the OpenAI voice with an Inworld voice or a cloned voiceId.
Retune server_vad thresholds. Inworld's Silero VAD plus Smart Turn behaves differently from OpenAI's default detector.
Recreate any custom cloned voices using original human-recorded source audio via POST /voices/v1/voices:clone.
For browser deployments, switch to WebRTC and mint short-lived JWTs server-side with POST /auth/v1/tokens/token:generate.
Run a parallel deployment, compare turn-latency and barge-in behavior, then cut over.

How does the Inworld Realtime API differ architecturally?

OpenAI's gpt-realtime is a bundled multimodal model. The audio model, the reasoning, the STT, and the TTS are one artifact. Lower theoretical latency, zero component flexibility.

The Inworld Realtime API is a cascaded pipeline (STT + LLM + TTS) co-located in one realtime service. The cost of a cascaded design is one extra cross-stage hop; the benefit is that every stage is independently selectable. A C++ rewrite of the TTS serving path cut realtime latency by 10 to 15% and the Node-to-Go migration of the orchestrator removed another 150 to 200ms.

For workloads where natural-language voice steering, cross-lingual voice identity, and model choice matter, the cascaded design wins. For workloads where you want the simplest possible single-model surface and are comfortable inside the OpenAI ecosystem, gpt-realtime remains the cleanest option.

When should you not migrate yet

We try to be honest about where Inworld is not the right answer today.

EU-resident workloads with strict data-residency requirements. Inworld's realtime inference is currently US-hosted. If your contract demands EU-only processing today, wait until EU inference lands.
Hard guarantee of full-pipeline latency leadership. OpenAI Realtime and ElevenLabs both ran ahead of Inworld in at least one customer benchmark (May 2026). Realtime TTS is engineered for streaming quality; full end-to-end latency is workload-dependent. Run your own benchmark.
SIP-native deployments. SIP is early access on Inworld. If you need a hardened SIP bridge today, OpenAI Realtime's SIP support is older.

Beyond migration: what you get on the full Inworld pipeline

Once your Realtime API integration is on Inworld, the rest of the pipeline is already available under the same API key:

Realtime Router routes to 220+ models from OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek, Meta, Groq, and Inworld-optimized open-source models. Same OpenAI-compatible event schema, same auth.
Realtime TTS for non-realtime synthesis at the same quality bar (Realtime TTS-2 and TTS 1.5 Max are purpose-built for streaming with sub-200ms time-to-first-audio and expressive steering).
Realtime STT with Inworld STT-1, Soniox stt-rt-v4 (WebSocket-only, new May 2026), or AssemblyAI streaming, selectable per session.
Voice cloning via the two-step /voices/v1/voices:clone API.
Inworld GitHub samples for working code examples across REST TTS, STT, and the Realtime API.

Getting help

Inworld Realtime API documentation
OpenAI to Inworld Realtime migration guide
Realtime API reference
Inworld Discord for developer support
Talk to an architect for enterprise deployments

Migrate from OpenAI Realtime API to Inworld Realtime API

Why migrate from OpenAI Realtime to Inworld Realtime

What stays the same between the two APIs

What changes in the migration

How session.update compares side by side

OpenAI Realtime (gpt-realtime)

Inworld Realtime API

How semantic VAD compares

How the Python client changes

Before (OpenAI)

After (Inworld)

How the JavaScript client changes

Before (OpenAI)

After (Inworld)

How to migrate cloned voices

What the Router unlocks once you are on Inworld

Migration checklist

How does the Inworld Realtime API differ architecturally?

When should you not migrate yet

Beyond migration: what you get on the full Inworld pipeline

Getting help

OpenAI Realtime (`gpt-realtime`)