Get started
Published 03.30.2026

OpenAI Realtime API Alternatives: Best APIs for Speech In and Speech Out

The OpenAI Realtime API was the first realtime API available and proved an important concept. Developers could call one API with audio via WebSockets, and get audio back from the API, skipping the need to build a full pipeline. The Realtime API collapsed STT, LLM, and TTS into a single connection and set the bar for how voice agents should feel.
But two years later, the OpenAI Realtime API is no longer state of the art. It locks you into one specific architecture with one vendor for the model, the voice, and the transport. You can't swap in a different LLM or choose a leading TTS engine.
The good news: the field has expanded. In the span of about a year, we went from one viable speech-to-speech API to five. At Inworld, we built a Realtime API powered by top-ranked Realtime TTS on the Artificial Analysis Speech Arena, with access to 200+ LLMs through a single endpoint and full compatibility with the OpenAI Realtime protocol. Google launched Gemini 3.1 Flash Live in March 2026. xAI shipped the Grok Voice Agent API in December 2025. Hume released EVI 3. This guide compares all five alternatives against the original and breaks down which one fits which use case.

What Is a Realtime Voice API?

A realtime voice API accepts audio from a user and returns audio from an AI agent through a persistent streaming connection. Usually a WebSocket.
The simplest way to understand it: you speak into a microphone, and an AI voice speaks back. Behind the scenes, the API handles three jobs. Speech recognition via a speech-to-text model (STT) that converts your voice to text. An LLM processes the incoming speech and decides what to say in response. Then a text-to-speech (TTS) model does voice generation to turn the LLM response into spoken audio.
Most voice APIs wait for you to finish speaking, then process your words, then build a full audio response, then play it back. That sequence creates a noticeable pause even if each step is fast. Realtime APIs work differently. They start speaking back to you while they're still figuring out the rest of the answer.
Three trends are shaping this space right now:
  • The OpenAI Realtime protocol is becoming a de facto standard. Inworld and xAI both follow OpenAI's event schema. If you've built a client for OpenAI's Realtime API, you can point it at either provider with minimal code changes.
  • Modular architectures are winning on flexibility. OpenAI and Google process audio natively inside the model. That means tight coupling and no component swapping. Inworld uses a modular architecture (STT + LLM + TTS) that lets you choose each piece independently and route across 200+ LLMs through a single API.
  • Pricing models vary widely. xAI publishes a flat per-minute rate. OpenAI prices audio per token. Inworld pricing is published at inworld.ai/pricing with LLM costs passing through at provider rates.

Who Needs a Realtime Voice API (and When)?

Teams migrating off OpenAI's Realtime API. You built a prototype on OpenAI, it works, and now you're staring at per-minute costs that scale poorly. Or you want to use Claude or Gemini as the reasoning model instead of being locked to GPT. Or your enterprise customers need data residency options OpenAI doesn't offer.
Teams building voice agents from scratch. If you're starting a new voice project, you can either assemble a pipeline yourself or use a single realtime API that handles everything. Inworld covers both: the Realtime API ships the pipeline in one endpoint, and the individual TTS, STT, and Router APIs give you component-level control if you need custom processing between stages.
Enterprise teams with compliance requirements. Healthcare applications need HIPAA. Financial services need SOC2 and data residency. Inworld offers full on-premise deployment, SOC 2 Type II, GDPR compliance, and EU/India data residency.
When a realtime API is the wrong tool: batch audio processing (transcribing recordings, generating audiobooks) and content production workflows where latency doesn't matter. These use cases are better served by standalone TTS and STT APIs.

How We Evaluated These APIs

Every API on this list meets the same baseline: a single WebSocket endpoint that accepts audio input and returns audio output in a streaming, bidirectional connection. Beyond that, we evaluated on six criteria:
TTS quality. We referenced the Artificial Analysis Speech Arena. Listeners compare speech samples side-by-side without knowing which model produced them. ELO scores from this benchmark are the most objective quality signal available.
Latency. Specifically P90 time-to-first-audio: the delay between the end of your speech and the first audible output frame. Averages hide tail latency. P90 shows what your users actually experience.
Protocol compatibility. Does the API follow the OpenAI Realtime event schema? If so, your existing client code transfers with minimal changes. Proprietary protocols mean starting from scratch.
Model flexibility. Can you swap the underlying LLM, TTS, or STT? Or are you locked to the provider's own models?
Transport options. WebSocket only, or WebSocket plus WebRTC? WebRTC is built for browsers and handles unstable network conditions automatically. WebSocket gives you more control for server-side applications.
Pricing model. Per-minute flat rate, per-token, or per-character? The structure determines how your costs scale.

The 4 Best OpenAI Realtime API Alternatives in 2026

1. Inworld Realtime API

Top-ranked Realtime TTS on Artificial Analysis. The TTS powering the audio output is Realtime TTS. Realtime TTS-2 preview is #1 realtime TTS on the Artificial Analysis Realtime TTS Arena (~1,208 ELO, May 2026). Realtime TTS 1.5 Max also ranks among the top realtime models. The ranking comes from blind preference testing across thousands of listener comparisons, not self-reported metrics. See pricing for current rates.
Full pipeline coverage in one endpoint. Inworld's Realtime API handles the voice agent pipeline in one place: STT, LLM, TTS, VAD, turn-taking, interruption handling, and image content parts (May 2026). Audio goes in over WebSocket or WebRTC. Audio comes back. No middleware required. You can build a working voice agent in minutes.
Model flexibility across 200+ models via Router. This is the biggest differentiator. Inworld's Router gives you unified access to OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek, Meta, Groq, DeepInfra, and an optimized 1P track of Inworld-hosted open-source models through a single API key. You can swap the reasoning model mid-session without changing your integration code. A/B test Claude against GPT against Gemini and measure the impact on user outcomes.
Automatic interruption handling and semantic VAD. Setting interrupt_response: true enables barge-in. The agent stops speaking and begins processing new input when the user talks over it. The semantic VAD is built on Inworld's Silero VAD plus a Smart Turn detector that listens to what you're saying, not just whether you've gone silent, to decide when you're done talking. You control the tradeoff between fast responses and premature cutoffs with a configurable eagerness parameter (low, medium, high).
WebSocket and WebRTC as first-class transports. WebRTC is built for voice in the browser. It handles unstable connections, adapts audio quality on the fly, and works through firewalls without extra configuration. WebSocket works for server-side orchestration and telephony bridges. Both transports share the same event model. You can run WebRTC on the client and WebSocket on the backend without maintaining separate codepaths.
Drop-in OpenAI migration. The API follows the OpenAI Realtime protocol. Events like session.update, input_audio_buffer.append, response.create, and response.done work with the same semantics. If you're already running on OpenAI's Realtime API, you migrate by changing the endpoint URL and API key. Inworld extends the base protocol with router support, semantic VAD configuration, and dynamic session updates. None of that breaks compatibility with existing OpenAI-shaped clients.
Best for: Teams that want top-ranked realtime TTS and multi-model LLM flexibility in a single realtime API. Strongest fit for teams migrating from OpenAI, since the protocol is compatible and your client code transfers directly.
Pros:
  • Top-ranked Realtime TTS on Artificial Analysis Realtime TTS Arena: Realtime TTS-2 preview at ~1,208 ELO (#1 realtime), Realtime TTS 1.5 Max also among the top-ranked realtime models
  • Model-agnostic: Router across 200+ LLMs with automatic failover and A/B testing
  • Semantic VAD built on Inworld's Silero VAD plus Smart Turn detector, with configurable eagerness
  • Image content parts in Realtime API (May 2026)
  • WebSocket and WebRTC with shared event model
  • Built-in observability, telemetry, and per-stage latency tracing for production debugging
  • OpenAI protocol compatible with documented migration path
  • SOC 2 Type II, GDPR compliance with on-premise deployment option
  • Individual TTS, STT, and Router APIs available for teams that want pipeline-level control
Cons:
  • Realtime API is in research preview, not yet GA. For production-critical deployments with zero tolerance for breaking changes, this matters.
  • 15 GA languages (90+ experimental in TTS-2 preview) versus 50+ (OpenAI) or 90+ (Google). If your application requires broad multilingual coverage today across GA voices, this is a limitation.
Pricing: See pricing for current TTS rates. LLM costs pass through at provider rates.

2. Google Gemini 3.1 Flash Live

Google launched Gemini 3.1 Flash Live on March 26, 2026. It's natively multimodal. Audio goes directly into the model and audio comes directly out. No separate STT or TTS stage. The model processes speech, reasons over it, and generates a spoken response in one pass.
The architecture is different from modular approaches. Gemini 3.1 Flash Live is built on Gemini 3 Pro and accepts audio, images, video, and text as inputs with a 128K token context window. The Live API uses a bidirectional WebSocket connection. Raw 16-bit PCM audio at 16kHz goes in. Raw PCM audio comes back.
The benchmark results are strong. It scores 90.8% on ComplexFuncBench Audio (multi-step function calling via voice) and 36.1% on Scale AI's Audio MultiChallenge (instruction following during interruptions and background noise). It supports 90+ languages for realtime conversations.
The trade-off is lock-in. Flash Live uses Google's proprietary Live API protocol, not the OpenAI event schema. If you're migrating from OpenAI, you're rewriting your client code. You're also locked to Google's models for reasoning, TTS, and STT. No swapping in Claude or a different TTS engine.
Best for: Teams already on Google Cloud who want native multimodal speech-to-speech without managing separate pipeline components. Strong for applications requiring broad language coverage or deep GCP integration (Dialogflow, Contact Center AI, Vertex AI).
Pros:
  • Natively multimodal: no pipeline stages, lower theoretical latency
  • 90+ languages
  • 90.8% on ComplexFuncBench Audio (voice-based function calling)
  • 128K context window
  • Generous free tier (no credit card required for Google AI Studio)
Cons:
  • Proprietary protocol. Not OpenAI-compatible. Migration from OpenAI means rewriting client code.
  • Locked to Google's models. No swapping LLM, TTS, or STT providers.
  • WebSocket only. No WebRTC support documented at launch.
  • Preview status. Not yet GA.
  • Per-turn billing re-processes accumulated context tokens, which can surprise you on cost for long conversations.
Pricing: Token-based. Audio input at $1.00/1M tokens, text output at $3.00/1M tokens via the Gemini API. Free tier available for development. Per-turn billing means costs accumulate across conversation turns as context grows.

3. xAI Grok Voice Agent API

xAI launched the Grok Voice Agent API in December 2025. The same stack powers Grok Voice in millions of Tesla vehicles and the Grok mobile app. It was battle-tested at scale before the API went public.
The API follows the OpenAI Realtime protocol. WebSocket connection at wss://api.x.ai/v1/realtime, same event schema. If you've built on OpenAI, your client code transfers with minimal changes.
The pricing is aggressive: $0.05 per minute flat. No token math. No separate input/output rates. A 10-minute voice agent conversation costs $0.50 total. The same conversation on OpenAI's Realtime API might run $2-3 depending on output length.
Built-in tools are a differentiator. The API natively supports web search and X (Twitter) search as first-party tools. The model can invoke them mid-conversation without custom function definitions. If your voice agent needs to look up real-time information during a call, you don't need to build that integration yourself.
The voice quality is solid but not independently benchmarked on Artificial Analysis. xAI claims the API ranks #1 on Big Bench Audio, but that measures reasoning capability, not TTS quality. You get 5 voice options with expressive tags for controlling delivery style. No voice cloning.
Best for: Teams that want the cheapest flat-rate realtime voice API with OpenAI protocol compatibility. Strong for applications that benefit from built-in web and X search capabilities during conversations.
Pros:
  • $0.05/min flat rate. Simplest, most predictable pricing in this comparison.
  • OpenAI protocol compatible
  • Built-in web search and X search as native tools
  • 100+ languages with automatic detection and mid-conversation switching
  • Battle-tested stack (Tesla, Grok mobile app)
Cons:
  • Locked to Grok models. No swapping in a different LLM.
  • WebSocket only. No WebRTC support.
  • 5 preset voices. No voice cloning.
  • TTS quality not independently ranked on Artificial Analysis Speech Arena.
  • Limited voice customization compared to Inworld's audio markup or Hume's natural language control.
Pricing: $0.05/min connection time, flat rate. Pricing details.

4. Hume EVI 3

Hume shipped EVI 3 as a speech language model that understands and generates emotion natively. The model doesn't just process words. It analyzes prosody, tone, and emotional cues in your voice and adjusts its delivery accordingly.
The voice control is the standout feature. You describe the voice you want in plain English: "Sound hesitant, like someone delivering bad news." Or: "Speak with warm enthusiasm, like a favorite teacher." No SSML tags. No audio markup syntax. The model interprets the instruction and generates speech that matches. You can also design entirely new voices from text descriptions without providing any audio sample.
EVI 3 targets sub-300ms end-to-end latency. It supports voice cloning from under 30 seconds of audio. You can plug in Claude, GPT, Gemini, or your own custom model as a supplementary reasoning engine. EVI 3 generates the initial response while the heavier LLM processes in parallel, then integrates the LLM's output once it's ready.
The trade-off is specialization. Hume is built for applications where emotional nuance is the product, not just a feature. AI companions, therapy bots, coaching apps, social experiences. For transactional voice agents where the goal is speed and accuracy, EVI's emotion processing adds overhead the use case doesn't need.
Best for: Applications where emotional intelligence and voice expressiveness are core differentiators: AI companions, mental health, coaching, and social AI.
Pros:
  • Natural language voice control and voice design from text descriptions
  • Emotional prosody analysis of user input
  • Sub-300ms end-to-end latency
  • Interoperable with external LLMs (Claude, GPT, Gemini, custom)
  • Voice cloning from under 30 seconds of audio
  • 200K+ designed voices on platform
Cons:
  • 11 languages at launch (20+ expansion announced but not yet shipped)
  • Not ranked on Artificial Analysis Speech Arena, making it hard to compare TTS quality objectively
  • Subscription-based pricing with per-minute overage charges adds billing complexity
  • For transactional voice agents, the emotion processing adds overhead without proportional value
Pricing: Tiered subscriptions from free (5 EVI minutes/month) to Business ($500/month for 12,500 EVI minutes). Pro plan overage at $0.06/min. Enterprise pricing is custom. Full pricing.

Overview of OpenAI Realtime API

OpenAI's Realtime API reached GA in August 2025 with the gpt-realtime model. It's the most mature option here. Its event schema has become the protocol other vendors build against.
The API is natively multimodal. Audio flows directly into the model without a separate STT step. The model reasons over the audio and generates both text and audio responses. Fewer moving parts. In theory, lower latency.
WebSocket and WebRTC are both supported. The gpt-realtime model improved on its predecessor with better audio quality, stronger instruction following, and more reliable function calling (66.5% on ComplexFuncBench Audio, up from 49.7%). It now supports image inputs alongside audio and MCP server integration for tool calling.
The limitation hasn't changed since launch: you're locked to OpenAI. The LLM is OpenAI's. The TTS is OpenAI's. The STT is OpenAI's. If you want Claude for reasoning, or Inworld's TTS for voice quality, or a specialized STT for domain-specific accuracy, you can't use them. The pricing reflects that bundled nature: roughly $0.06/min for audio input and $0.24/min for audio output. That adds up quickly at scale.
Best for: Teams building directly on OpenAI's models who want the most mature production track record in this category.
Pros:
  • Established realtime voice API. GA since August 2025.
  • Natively multimodal. No pipeline stages.
  • WebSocket + WebRTC
  • Image inputs alongside audio (gpt-realtime)
  • MCP server support for tool calling
  • 50+ languages
  • Natural language voice instructions via gpt-4o-mini-tts
Cons:
  • Locked to OpenAI models. No LLM, TTS, or STT swapping.
  • TTS quality ranks below Realtime TTS-2 preview (~1,208 ELO, #1 realtime) and Realtime TTS 1.5 Max on the Artificial Analysis Realtime TTS Arena
  • Audio output is priced per token; costs compound for long conversations
  • No voice cloning (9 built-in voices, custom voices only via enterprise agreement)
  • No on-premise deployment
Pricing: $32/1M audio input tokens (~$0.06/min), $64/1M audio output tokens (~$0.24/min). Text tokens billed separately. Full pricing.

Comparison Table

APIArchitectureProtocolTransportsLanguagesTTS Ranking (Artificial Analysis)PricingBest For
InworldModular (STT+LLM+TTS)OpenAI-compatibleWebSocket + WebRTC15 GA / 90+ preview#1 realtime (~1,208 ELO)See pricingModel flexibility + top realtime TTS
Gemini Flash LiveNatively multimodalGoogle proprietaryWebSocket90+Not rankedToken-based (~$1-3/1M tokens)Multilingual + GCP integration
Grok Voice AgentIn-house full stackOpenAI-compatibleWebSocket100+Not ranked$0.05/min flatCheapest flat rate + built-in search
Hume EVI 3Speech language modelHume proprietaryWebSocket11Not in top tier$0.06/min (Pro overage)Emotional AI + companion apps
OpenAI RealtimeNatively multimodalOpenAI (original)WebSocket + WebRTC50+Outside top 5~$0.30/min blendedTightest OpenAI model integration

Why Inworld Is the Strongest Alternative

The OpenAI Realtime API proved that speech-to-speech should be a single API call. Every alternative on this list agrees. The question is which one gives you the most capability per dollar without recreating the lock-in problem.
We at Inworld answer on three axes. Realtime TTS is top-ranked on the Artificial Analysis Realtime TTS Arena (Realtime TTS-2 preview at ~1,208 ELO, #1 realtime; Realtime TTS 1.5 Max also among the top-ranked realtime models). Model flexibility through Router covers 200+ LLMs with automatic failover, A/B testing, and intelligent routing. And the OpenAI protocol compatibility means you can migrate without rewriting your client code.
Gemini Flash Live is the strongest option for multilingual applications. Grok wins on pricing simplicity. Hume owns the emotional AI niche. OpenAI still has the longest production track record. But if you want top-ranked realtime voice quality, multi-provider model flexibility, and a migration path that doesn't require starting over, Inworld is the clear choice.

FAQs

What is a realtime voice API?

A realtime voice API accepts audio input and returns audio output through a persistent streaming connection. Usually a WebSocket. It handles speech recognition, turn detection, language model inference, and text-to-speech synthesis server-side. You don't need to wire together separate services for each stage. The pipeline stages overlap rather than running sequentially, so latency stays low enough for the exchange to feel natural.

How is a realtime API different from chaining STT + LLM + TTS?

A chained pipeline processes each stage one at a time. Transcribe the full utterance. Send the text to an LLM. Wait for the complete response. Synthesize the full audio clip. Play it back. Each handoff adds latency. A realtime API overlaps these stages. The TTS engine starts synthesizing from the first tokens of the LLM response while the model is still generating the rest. It also handles VAD, turn detection, and interruption recovery server-side. In a chained pipeline, you'd build all of that yourself.

Can I migrate from OpenAI's Realtime API to Inworld?

Yes. Inworld's Realtime API follows the OpenAI event schema. Events like session.update, input_audio_buffer.append, response.create, and response.done work with the same semantics. Inworld publishes a migration guide documenting the process. In practice, you change the WebSocket endpoint URL and API key, then optionally configure Inworld-specific extensions like semantic VAD, router settings, and model selection.

Is Inworld's Realtime API better than Gemini Flash Live?

They serve different needs. Gemini Flash Live processes audio natively inside a multimodal model. That eliminates pipeline stages but locks you to Google's models and a proprietary protocol. Inworld uses a modular architecture with top-ranked Realtime TTS on Artificial Analysis (Realtime TTS-2 preview at ~1,208 ELO, #1 realtime), OpenAI protocol compatibility, and Router access to 200+ models. If you need more than 15 GA languages, Gemini's 90+ language support is a clear advantage. If you want model flexibility, top-ranked realtime voice quality, and the ability to migrate without rewriting client code, Inworld is the stronger fit.

What latency should I expect from a realtime voice API?

A well-optimized pipeline typically achieves 500-800ms end-to-end for natural-feeling conversations. Time-to-first-audio is the most important metric. Realtime TTS 1.5 Mini delivers under 130ms P90. Realtime TTS 1.5 Max delivers under 250ms P90. These are end-to-end measurements including network overhead, not inference-only numbers. Ask vendors for P90 benchmarks rather than averages.

When should I use WebRTC vs WebSocket?

Use WebRTC when your voice agent runs in a browser or mobile app over variable network conditions. It adapts to connection quality automatically and works through firewalls without extra setup. Use WebSocket when you're connecting through a telephony bridge, running behind a proxy, or need fine-grained control over every message in the session. Inworld and OpenAI support both transports with the same event model. Gemini, Grok, and Hume are WebSocket only.

What's the most cost-effective realtime voice API?

Grok publishes a flat per-minute rate, which makes billing simple. But cheapest per minute and best value are different questions. Realtime TTS is top-ranked on Artificial Analysis (Realtime TTS-2 preview #1 realtime, ~1,208 ELO); Grok's TTS isn't independently ranked there. If you're optimizing for top-ranked voice output plus LLM choice, Inworld's Realtime API plus Router is the strongest fit. Gemini's token-based pricing is harder to predict because per-turn billing reprocesses accumulated context.

Do I need a realtime API, or can I build my own pipeline?

Inworld supports both approaches. The Inworld Realtime API handles STT, LLM, TTS, VAD, interruption handling, and turn-taking in a single WebSocket or WebRTC connection. You send audio, you get audio back. Most teams ship faster this way because they skip the weeks of infrastructure work that go into wiring pipeline stages together.
If you need more control, Inworld also provides tools for building custom pipelines with more granular control, letting you wire together specific STT, LLM, and TTS nodes, add custom processing logic between stages, run parallel LLM evaluations, and deploy to cloud. These tools use the same Realtime TTS (top-ranked realtime on Artificial Analysis) and the same Inworld Router (200+ models) as the Realtime API, so you get component-level flexibility without giving up model quality or multi-provider access.
Frameworks like Pipecat and LiveKit Agents also support custom pipelines, but they require you to bring your own models and manage your own infrastructure. Inworld gives you both a managed realtime API and a self-hostable pipeline toolkit from the same vendor.
Copyright © 2021-2026 Inworld AI
OpenAI Realtime API Alternatives: Best Realtime APIs in 2026