Last updated: May 28, 2026
Inworld AI implements the OpenAI Realtime protocol on the Inworld Realtime API, so most existing OpenAI voice-agent clients move over by swapping the WebSocket URL, the auth header, and a handful of field names inside session.update. The structural events (input_audio_buffer.append, response.create, response.output_audio.delta, response.done) keep the same semantics. The differences live in three places: where you put the model and voice in session.update, how server_vad actually behaves, and what the Router unlocks once you are off gpt-realtime.
Why migrate from OpenAI Realtime to Inworld Realtime
OpenAI Realtime (gpt-realtime) defined the WebSocket event schema the rest of the industry now builds against. That ubiquity is its strength. The constraint is that the model, the voice, and the STT are all OpenAI. You cannot drop in Claude as the reasoning engine, you cannot point the voice output at a top-ranked realtime TTS, and you cannot run a workload on optimized open-source models when you want to.
The Inworld Realtime API addresses the lock-in without breaking the protocol contract:
- Model choice over the Inworld Realtime Router. The Realtime API runs on top of the Realtime Router, which routes to 200+ LLMs from OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek, Meta, Groq, and DeepInfra (including
deepinfra/openai/gpt-oss-120b on the 3P track), plus Realtime Inference: Inworld-optimized open-source models on the 1P track (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5). Latitude (the heaviest realtime user on Inworld) beat OpenAI by a point in a 3-way A/B by switching the reasoning model without changing client code.
- #1 realtime TTS for the voice output. Realtime TTS-2 (research preview) is the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena. Realtime TTS 1.5 Max also ranks among the top realtime models. The OpenAI Realtime voices are bundled with the model and not independently ranked at that level.
- Cross-lingual voice identity. TTS-2 preserves a single voice identity across 100+ languages (15 GA plus 90+ experimental), with natural-language emotion steering through bracketed tags like
[say warmly] at the start of text.
- OpenAI SDK still works for the Router. The Inworld Router exposes an OpenAI Chat Completions endpoint at
https://api.inworld.ai/v1. You can keep using the OpenAI Python and JavaScript SDKs for non-realtime calls by changing base_url.
Trade-offs worth knowing before you cut over. Inworld's realtime inference is currently US-hosted, which can be a blocker for EU-resident workloads with strict data-residency requirements. In at least one customer benchmark (Microvoz, May 2026), end-to-end Realtime API latency landed above ElevenLabs in that specific pipeline. The Realtime API is GA on WebSocket; WebRTC and SIP are early access. None of that breaks the migration path, but it is worth knowing up front.
What stays the same between the two APIs
The event surface is the same. If you have working code against gpt-realtime, the following events keep their schema and semantics on Inworld:
Audio is base64-encoded PCM16 at 24 kHz on both sides. Output modalities are configured with the same output_modalities array. Tool calling uses the same response.function_call_arguments.delta and response.function_call_arguments.done events.
If you wrote your own client library against the OpenAI Realtime API, the loop body keeps working. The only changes are at the edges: connection, auth, and the shape of session.update.
What changes in the migration
Four concrete changes. None of them touch the event loop.
The most common gotcha is the model fields. On OpenAI Realtime, gpt-realtime is one bundled audio model so the top-level model is everything. On Inworld, the Realtime API is a cascaded pipeline (STT + LLM + TTS), so the LLM lives at the top of session and the TTS model lives inside session.audio.output. If you forget the second one, you get the default TTS model rather than inworld-tts-2.
How session.update compares side by side
The simplest possible voice session looks like this in each API. Same modality, same audio format, same prompt.
OpenAI Realtime (gpt-realtime)
{
"type": "session.update",
"session": {
"type": "realtime",
"model": "gpt-realtime",
"output_modalities": ["audio"],
"audio": {
"input": {
"format": { "type": "audio/pcm", "rate": 24000 },
"turn_detection": { "type": "server_vad" }
},
"output": {
"format": { "type": "audio/pcm", "rate": 24000 },
"voice": "marin"
}
},
"instructions": "You are a helpful voice assistant."
}
}
Inworld Realtime API
{
"type": "session.update",
"session": {
"type": "realtime",
"model": "openai/gpt-5.5",
"output_modalities": ["audio"],
"audio": {
"input": {
"format": { "type": "audio/pcm", "rate": 24000 },
"transcription": { "model": "inworld/inworld-stt-1" },
"turn_detection": { "type": "server_vad" }
},
"output": {
"format": { "type": "audio/pcm", "rate": 24000 },
"voice": "Sarah",
"model": "inworld-tts-2"
}
},
"instructions": "You are a helpful voice assistant."
}
}
Two structural differences are visible. First, Inworld names the STT model explicitly inside audio.input.transcription because you can swap it (Inworld STT-1, AssemblyAI streaming, Soniox WebSocket). Second, Inworld puts the TTS model (inworld-tts-2) inside audio.output alongside the voice, while OpenAI bundles both into gpt-realtime at the session root.
session.audio.input.turn_detection.type accepts the same values (server_vad and semantic_vad) on both APIs. The behavior of server_vad is different in practice. On Inworld, server_vad runs Inworld's Silero VAD plus a Smart Turn detector that considers what was said, not only whether the user has gone silent. If you were tuning OpenAI's VAD thresholds for false-positive endpointing, expect to retune on Inworld.
How semantic VAD compares
Semantic VAD is supported on both APIs but the underlying detectors differ. Both let the model decide turn boundaries based on speech content rather than silence, and both expose a configurable eagerness setting (low, medium, high, auto).
The high-eagerness setting on Inworld is well suited to long-session companion apps where users pause mid-thought; Status (Wishroll) runs 90-plus-minute sessions on Inworld with this configuration. Low-eagerness is closer to the OpenAI default and a safer starting point for transactional flows.
How the Python client changes
The Python loop body is unchanged. The diff is the URL, the auth header, and the session.update shape.
Before (OpenAI)
import asyncio
import json
import os
import websockets
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
async def main():
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(URL, additional_headers=headers) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"type": "realtime",
"model": "gpt-realtime",
"output_modalities": ["audio"],
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"turn_detection": {"type": "server_vad"},
},
"output": {
"format": {"type": "audio/pcm", "rate": 24000},
"voice": "marin",
},
},
"instructions": "You are a helpful voice assistant.",
},
}))
async for raw in ws:
event = json.loads(raw)
if event["type"] == "response.output_audio.delta":
# event['delta'] is base64-encoded PCM16 audio
pass
elif event["type"] == "response.done":
break
asyncio.run(main())
After (Inworld)
import asyncio
import base64
import json
import os
import uuid
import websockets
# pip install websockets
# Inworld auth is HTTP Basic with key:secret, base64-encoded
KEY = os.environ["INWORLD_API_KEY"]
SECRET = os.environ["INWORLD_API_SECRET"]
BASIC = base64.b64encode(f"{KEY}:{SECRET}".encode()).decode()
SESSION_ID = str(uuid.uuid4())
URL = (
f"wss://api.inworld.ai/api/v1/realtime/session"
f"?key={SESSION_ID}&protocol=realtime"
)
async def main():
headers = {"Authorization": f"Basic {BASIC}"}
async with websockets.connect(URL, additional_headers=headers) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"type": "realtime",
"model": "openai/gpt-5.5",
"output_modalities": ["audio"],
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"transcription": {"model": "inworld/inworld-stt-1"},
"turn_detection": {"type": "server_vad"},
},
"output": {
"format": {"type": "audio/pcm", "rate": 24000},
"voice": "Sarah",
"model": "inworld-tts-2",
},
},
"instructions": "You are a helpful voice assistant.",
},
}))
async for raw in ws:
event = json.loads(raw)
if event["type"] == "response.output_audio.delta":
# event['delta'] is base64-encoded PCM16 audio
pass
elif event["type"] == "response.done":
break
asyncio.run(main())
The Inworld example uses requests-style Basic auth (base64-encoded key:secret) and threads a session-id query parameter on the URL. The inworld-framework-py package exists but its main repo has not seen commits since August 2025, so the raw websockets library plus requests is the recommended Python pattern.
How the JavaScript client changes
Same pattern in Node. The event handlers stay identical; only the connection and session payload change.
Before (OpenAI)
import WebSocket from "ws";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime";
const ws = new WebSocket(url, {
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1",
},
});
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
model: "gpt-realtime",
output_modalities: ["audio"],
audio: {
input: {
format: { type: "audio/pcm", rate: 24000 },
turn_detection: { type: "server_vad" },
},
output: {
format: { type: "audio/pcm", rate: 24000 },
voice: "marin",
},
},
instructions: "You are a helpful voice assistant.",
},
}));
});
ws.on("message", (raw) => {
const event = JSON.parse(raw.toString());
if (event.type === "response.output_audio.delta") {
// event.delta is base64-encoded PCM16 audio
} else if (event.type === "response.done") {
ws.close();
}
});
After (Inworld)
import { randomUUID } from "crypto";
import WebSocket from "ws";
const KEY = process.env.INWORLD_API_KEY;
const SECRET = process.env.INWORLD_API_SECRET;
const BASIC = Buffer.from(`${KEY}:${SECRET}`).toString("base64");
const sessionId = randomUUID();
const url =
`wss://api.inworld.ai/api/v1/realtime/session` +
`?key=${sessionId}&protocol=realtime`;
const ws = new WebSocket(url, {
headers: { Authorization: `Basic ${BASIC}` },
});
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
model: "openai/gpt-5.5",
output_modalities: ["audio"],
audio: {
input: {
format: { type: "audio/pcm", rate: 24000 },
transcription: { model: "inworld/inworld-stt-1" },
turn_detection: { type: "server_vad" },
},
output: {
format: { type: "audio/pcm", rate: 24000 },
voice: "Sarah",
model: "inworld-tts-2",
},
},
instructions: "You are a helpful voice assistant.",
},
}));
});
ws.on("message", (raw) => {
const event = JSON.parse(raw.toString());
if (event.type === "response.output_audio.delta") {
// event.delta is base64-encoded PCM16 audio
} else if (event.type === "response.done") {
ws.close();
}
});
For browsers, prefer the WebRTC transport. Mint a short-lived JWT server-side with POST /auth/v1/tokens/token:generate (IW1-HMAC-SHA256 signed-request) and pass it as Authorization: Bearer <jwt> from the client. The same event schema applies.
How to migrate cloned voices
If your OpenAI Realtime deployment uses custom voices via an enterprise agreement, you cannot carry the voice IDs over directly. Recreate them through the Inworld voice-cloning API. It is two steps, by design: POST /voices/v1/voices:clone returns a voiceId, then you pass that voiceId to the Realtime session as audio.output.voice. There is no referenceAudio field on the Realtime or TTS endpoints.
Two practical rules for migration:
- Use the original human-recorded source audio, not audio generated by OpenAI's TTS or any other provider. Cloning on top of synthesized audio compounds the artifacts of both systems and degrades quality measurably. This applies across every TTS vendor, not just Inworld.
- 5 to 15 seconds is enough for instant cloning. Professional cloning (30-plus minutes of clean audio) is a Growth-tier add-on delivered as a professional service rather than a self-serve flow.
Browse the pre-built voice catalog via GET /voices/v1/voices (the legacy /tts/v1/voices endpoint is deprecated July 1, 2026).
What the Router unlocks once you are on Inworld
The Realtime API runs on the Inworld Realtime Router under the hood. That is what changes the conversation around model choice.
- Swap the LLM without rewriting the client. Set
session.model to openai/gpt-5.5, anthropic/claude-sonnet-4-6, google-ai-studio/gemini-3.1-pro, deepseek/deepseek-v4-pro, or deepinfra/openai/gpt-oss-120b. The Realtime event schema does not change.
- Run live A/B tests on production traffic. Latitude beat OpenAI by a point on their workload by A/B testing three providers without changing client code.
- Route on metadata. Pass
user, language, country, tier, or intent and let the Router pick the right model variant.
- Failover for free. Specify a fallback pool via
extra_body.models; the Router records each attempt in metadata.attempts so you can debug.
- OpenAI SDK still works for non-realtime calls. Set
base_url="https://api.inworld.ai/v1" and the official Python or JavaScript SDK behaves as expected.
Migration checklist
- Generate an Inworld API key and secret in the Inworld portal and base64-encode
key:secret for the Basic auth header.
- Swap the WebSocket URL from
wss://api.openai.com/v1/realtime?model=gpt-realtime to wss://api.inworld.ai/api/v1/realtime/session?key=<session-id>&protocol=realtime. Generate session-id as a UUID per session.
- Replace
Authorization: Bearer ... with Authorization: Basic <base64(key:secret)>. Drop the OpenAI-Beta: realtime=v1 header.
- Update
session.update:
- Move the LLM into
session.model as a routed model ID (e.g. openai/gpt-5.5 or anthropic/claude-sonnet-4-6).
- Add
session.audio.input.transcription.model (e.g. inworld/inworld-stt-1).
- Move the TTS model into
session.audio.output.model (e.g. inworld-tts-2).
- Replace the OpenAI voice with an Inworld voice or a cloned
voiceId.
- Retune
server_vad thresholds. Inworld's Silero VAD plus Smart Turn behaves differently from OpenAI's default detector.
- Recreate any custom cloned voices using original human-recorded source audio via
POST /voices/v1/voices:clone.
- For browser deployments, switch to WebRTC and mint short-lived JWTs server-side with
POST /auth/v1/tokens/token:generate.
- Run a parallel deployment, compare turn-latency and barge-in behavior, then cut over.
How does the Inworld Realtime API differ architecturally?
OpenAI's gpt-realtime is a bundled multimodal model. The audio model, the reasoning, the STT, and the TTS are one artifact. Lower theoretical latency, zero component flexibility.
The Inworld Realtime API is a cascaded pipeline (STT + LLM + TTS) co-located in one realtime service. The cost of a cascaded design is one extra cross-stage hop; the benefit is that every stage is independently selectable. Pavel's C++ rewrite of the TTS serving path cut realtime latency by 10 to 15% and the Node-to-Go migration of the orchestrator removed another 150 to 200ms.
For workloads where natural-language voice steering, cross-lingual voice identity, and model choice matter, the cascaded design wins. For workloads where you want the simplest possible single-model surface and are comfortable inside the OpenAI ecosystem, gpt-realtime remains the cleanest option.
When should you not migrate yet
We try to be honest about where Inworld is not the right answer today.
- EU-resident workloads with strict data-residency requirements. Inworld's realtime inference is currently US-hosted. If your contract demands EU-only processing today, wait until EU inference lands.
- Hard guarantee of full-pipeline latency leadership. OpenAI Realtime and ElevenLabs both ran ahead of Inworld in at least one customer benchmark (Microvoz, May 2026). Realtime TTS is #1 on the Artificial Analysis Realtime TTS Arena; full end-to-end latency is workload-dependent. Run your own benchmark.
- SIP-native deployments. SIP is early access on Inworld. If you need a hardened SIP bridge today, OpenAI Realtime's SIP support is older.
Beyond migration: what you get on the full Inworld pipeline
Once your Realtime API integration is on Inworld, the rest of the pipeline is already available under the same API key:
- Realtime Router routes to 200+ LLMs from OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek, Meta, Groq, and Inworld-optimized open-source models. Same OpenAI-compatible event schema, same auth.
- Realtime TTS for non-realtime synthesis at the same quality bar (Realtime TTS-2 is the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena; TTS 1.5 Max also ranks among the top realtime models).
- Realtime STT with Inworld STT-1, Soniox
stt-rt-v4 (WebSocket-only, new May 2026), or AssemblyAI streaming, selectable per session.
- Voice cloning via the two-step
/voices/v1/voices:clone API.
- Inworld GitHub samples for working code examples across REST TTS, STT, and the Realtime API.
Getting help