Published 03.18.2026

Best Speech-to-Speech APIs in 2026: Architecture, Latency, and Code

Voice is becoming the default interface for software, and building for it has never been more accessible. Models are more natural, speeds are faster, and modern APIs handle the hard parts that used to require weeks of custom engineering. This guide breaks down the best speech-to-speech APIs available in 2026, what each one is good at, and how to choose.
Most teams don't think about voice latency until users start complaining. By then, the VAD has been waiting too long to trigger, the TTS has been buffering full sentences before sending audio, and cancellation hasn't been working at all. Teams make these decisions early, before anyone has run a single P90 benchmark.

What "speech-to-speech" actually means

The term covers two different product categories that rarely overlap.
Voice-agent speech-to-speech is what this guide focuses on: user speaks, system responds with speech, optimized for low latency, barge-in, and multi-turn dialogue. The internals are almost always STT → LLM → TTS. The goal is making that pipeline behave like a single audio-in/audio-out system.
Voice conversion speech-to-speech is different: you feed recorded audio and get back that same content in a different voice, accent, or language. Tools like Resemble and ElevenLabs' voice conversion feature own this space. It's closer to dubbing than conversation.
A composed pipeline, even a well-optimized one, can feel as seamless as native audio-in/audio-out if streaming and cancellation are done right. The user doesn't know or care what's happening inside.

Production checklist

Before evaluating any API, there's a short list of capabilities that separate a voice agent from an IVR:
  • Streaming input. Audio arrives continuously via WebSocket or WebRTC. No waiting for the user to finish before processing starts.
  • Fast endpointing. VAD plus end-of-turn detection to handle interruptions and decide when to respond.
  • Streaming output. Audio chunks arrive as they're synthesized, not after the full response is buffered. This is what makes sub-300ms time-to-first-audio possible.
  • Cancellation. Barge-in requires stopping TTS playback, canceling in-flight TTS generation, canceling LLM generation, and resetting stream state. If any of those steps is missing, the agent talks over users or "finishes the old thought" after being interrupted.
  • Observability. Per-turn traces with STT, LLM, and TTS timings broken out separately. Without this, you're guessing which component is causing latency spikes.

Reference architecture

An example streaming STT → LLM → TTS pipeline that hits those targets looks like this:
Audio in (WebSocket / WebRTC)
  └─ VAD + endpointing
      └─ Streaming STT (partials)
          └─ LLM (streaming tokens)
              └─ Text chunking (sentence-level for TTS)
                  └─ Streaming TTS (audio chunks)
                      └─ Client audio playback
                          └─ Cancellation path (on barge-in: back to VAD)
The difference between a fast pipeline and a slow one is usually in the middle steps. If the LLM waits for a full response before sending to TTS, you've added 1–3 seconds. If the TTS waits for a full sentence before generating, you've added another 200–500ms. Chunking LLM output at sentence boundaries and streaming TTS against those chunks is how you get first-audio under 300ms.
WebSocket vs. WebRTC: WebSocket is the right default for most product teams. It's easier to deploy, debug, and monitor, and it handles streaming STT and TTS cleanly. WebRTC makes sense when you need media-grade jitter handling, NAT traversal, or extremely tight real-time constraints. Telephony is the clearest case. The Softcery architecture writeup covers the tradeoffs in detail if you want a deeper look.
When P90 spikes, which layer do you blame? This is the operational question the architecture diagram doesn't answer. In practice: if end-of-turn latency spikes but TTFA is stable, the problem is almost always VAD. A conservative endpointing threshold is waiting too long to declare the user has stopped speaking. If TTFA spikes intermittently, the culprit is usually the LLM tier (token generation time under load) or the text chunking step failing to send the first sentence to TTS quickly enough. If audio sounds choppy or cuts off mid-stream, that's TTS chunk delivery. Either a slow connection or the TTS provider buffering before sending. Isolating this requires per-turn traces that break out STT time, LLM time-to-first-token, and TTS time-to-first-chunk separately. Without that breakdown, a 900ms P90 tells you something is wrong but not what to fix.

The 6 Best Speech-to-Speech APIs in 2026

1. Inworld AI — Realtime API + TTS

Best for: Developers building speech to speech AI applications
Building speech-to-speech the traditional way means wiring together a STT provider, an LLM, a TTS model, chunking logic, VAD, cancellation handling, and observability. Inworld's Realtime API collapses the entire speech to speech pipeline into one endpoint. The endpoint allows users to stream audio in and stream high-quality audio back. The Realtime API is effectively a drop-in voice agent built on Inworld's #1-ranked TTS model and managed for you.
Realtime API
Connect via WebSocket or WebRTC and audio flows in both directions in realtime. VAD, end-of-turn detection, streaming, interruption handling, and cancellation are all handled natively server-side. The Realtime API leverages Inworld's high accuracy, streaming speech-to-text model, which allows you to understand the profile, context and state of your users to contextualize responses. This is paired with Inworld's chart topping models for speech generation, namely Inworld TTS-1.5-Max, the fastest and highest-quality model globally (ranked #1 on the Artificial Analysis Speech Arena leaderboard with an ELO of 1,162).
In addition, unlike OpenAI's Realtime API, you're not locked to a single LLM. Inworld's implementation is model-agnostic, leveraging the Inworld Router to route requests across OpenAI, Anthropic, Google, and 200+ models through a single API, with built-in failover, A/B testing, and intelligent model selection without any code changes.
Code Example:
Connect via WebSocket for streaming conversation:
const ws = new WebSocket(
  `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`,
  { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);

ws.on('message', (raw) => {
  const msg = JSON.parse(raw.toString());
  if (msg.type === 'session.created') {
    ws.send(JSON.stringify({ type: 'session.update', session: { ... } }));
  }
});

// Send audio chunks as they arrive from the microphone
function sendAudio(audioChunk) {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: audioChunk // base64-encoded PCM16
  }));
}

// Receive audio chunks as the agent responds
ws.on('message', (event) => {
  const msg = JSON.parse(event.toString());
  if (msg.type === 'response.output_audio.delta') {
    playAudio(msg.delta); // base64 PCM16
  }
});
For most voice agents, this is where to start. One endpoint replaces weeks of pipeline work.
TTS quality and latency
  • TTS 1.5-Max: sub-250ms P90 time-to-first-audio
  • TTS 1.5-Mini: sub-130ms P90 (trades some quality for speed)
  • Pricing: $10/1M characters (Max), $5/1M characters (Mini). At 100M chars/month, that's $1,000 vs. $20,600 on ElevenLabs.
Voice cloning
Zero-shot cloning from 5–15 seconds of audio is free and included for all users. No tier gating. Our voice cloning docs include persona recording scripts to improve clone quality.
Enterprise controls
SOC 2 Type II, HIPAA with BAAs, GDPR, zero-retention mode, EU and India data residency, full on-premise deployment. Enterprise pricing via inworld.ai/pricing.
Where it falls short
15 supported languages. For any application requiring broad multilingual coverage today, ElevenLabs (29–74 languages) or Google (75+) is the more honest answer. The audio markup emotion tags ([happy], [whisper], etc.) are production-supported for single use at the start of a generation but experimental for multi-tag mid-text sequences.

2. Deepgram — Flux STT + Aura TTS

Best for: Teams building telephony or call center voice agents where STT accuracy and domain-specific pronunciation matter more than TTS naturalness.
Deepgram's speech-to-speech story is built around two models. Flux is a conversational STT model with integrated end-of-turn detection (~260ms cited), designed so that VAD and endpointing are part of the model rather than bolted on afterward. Aura-2 TTS streams over WebSocket and includes domain-specific pronunciation models for healthcare, finance, and legal. Drug names, financial instruments, legal citations: the terms that general-purpose TTS reliably mispronounces.
Deepgram's Aura-2 doesn't appear in the top 17 of Artificial Analysis' Speech Arena. For a call center transcribing medical dictation, that's a reasonable trade. For a consumer product where voice quality shapes user trust, it isn't.
Pricing: $30/1M characters for TTS. STT pricing varies by model tier.
Where it falls short: In blind preference tests, Aura-2 consistently loses to Inworld and OpenAI TTS on naturalness. Acceptable for internal tooling; a real problem for consumer-facing products. If your vocabulary set is standard English, Deepgram's vertical optimization is solving a problem you don't have.

3. Hume AI — EVI (Empathic Voice Interface)

Best for: AI companions, therapy, coaching, and social applications where emotional responsiveness is the core product.
Hume's EVI is architecturally distinct from everything else here. It's an LLM-backbone TTS that analyzes the user's tone of voice and adjusts response delivery accordingly. You don't write SSML tags or prompt a voice style. The model infers sarcasm, urgency, and warmth from conversational context and responds in kind. The EVI overview describes end-of-turn detection using prosody rather than just silence, which reduces false triggers on natural speech pauses.
The natural language voice control is genuinely differentiating: you describe the voice you want in plain English. "Sound hesitant, like someone delivering bad news." No tags required.
EVI 3 targets under 300ms end-to-end. Pricing is $7.60/1M characters, cheaper than Inworld Max. 11 languages at launch, with expansion announced.
Where it falls short: Hume doesn't appear in Artificial Analysis' top 17. For applications where voice quality on blind preference tests is the differentiator, the leaderboard data isn't there yet. For transactional voice agents (customer service, scheduling, any workflow where the goal is speed and accuracy rather than emotional resonance) Hume's architecture is doing work the use case doesn't require.

4. OpenAI Realtime API

Best for: Teams already on OpenAI's platform who want a working voice agent fast and are willing to pay for the simplicity.
The OpenAI Realtime API was one of the first native audio-in/audio-out endpoints on the market. There's no STT transcript, no text passed to an LLM, no TTS render. The same model handles the full loop. That difference has two concrete effects: it eliminates transcription latency in a composed pipeline, and it means the model can respond to tone, pacing, and affect that text transcription would have stripped out. VAD, function calling, and interruption handling all live in the same API surface, which cuts setup time significantly.
Commit to the Realtime API and you're on GPT-4o. You can't swap to Claude Sonnet if it outperforms on your specific domain, and you can't route to a cheaper model during off-peak hours. At scale, that single-vendor dependency pushes per-minute costs well above what a composed stack would run. Inworld's Realtime API offers the same audio-in/audio-out simplicity without the model lock-in. OpenAI also offers a separate TTS endpoint, gpt-4o-mini-tts, that accepts natural language style instructions ("sound skeptical and measured") rather than SSML or markup tags. It's worth knowing about for batch or non-realtime use cases. It's a standalone product, separate from the Realtime API.
The Realtime API has been GA since August 2025. 50+ language support.
Where it falls short: Model lock-in is the main cost. OpenAI does not offer on-premise deployment or free voice cloning. OpenAI doesn't publish P90 end-to-end latency for the Realtime API the way Inworld does, which makes it harder to set production targets before you've already built against it.

5. ElevenLabs — Voice Conversion and Content Production

Best for: Content production workflows: audiobooks, podcast dubbing, voiceover, any application where the input is a recording and the output is a different voice or language.
ElevenLabs' positioning in 2026 is content creation first, voice agents second. ElevenLabs has a 10,000+ community voice library, a dubbing product that preserves speaker voice across languages, voice isolation for noisy source audio, and sound effects generation. For a real-time voice agent, you don't need any of that.
On Artificial Analysis, ElevenLabs Multilingual v2 sits at #5 (ELO 1,105), Flash v2.5 at #12. The quality is competitive. But at $103–206/1M characters versus Inworld's $10/1M for a higher-ranked model, the price-performance gap is hard to justify for high-volume voice agent applications.
74 languages in v3 (currently alpha) is a genuine differentiator if broad multilingual coverage is a hard requirement today.
Where it falls short: Building a voice agent on ElevenLabs means wiring interruption handling yourself. ElevenLabs has no barge-in model for conversational agents. The WebSocket TTS endpoint streams audio, but there's no session state, no cancellation primitive tied to a conversation context, and no orchestration layer that tracks what the agent is mid-sentence on when a user talks over it. You end up building all of that. At $103–206/1M characters, you're paying a content-creation premium for an infrastructure gap you still have to close. For voice agents specifically, that combination is hard to justify.

6. Cartesia — Sonic Streaming TTS

Best for: Ultra-low latency voice agents in telephony, IVR, and real-time voice interfaces on constrained hardware.
Cartesia Sonic 3 ranks 20th on Artificial Analysis (ELO 1,054) at roughly $46.70/1M characters. The headline number is 40ms time-to-first-audio, about 3x faster than Inworld's Mini model (sub-130ms P90). The underlying architecture is a State Space Model rather than a transformer, which scales linearly with sequence length rather than quadratically — a structural throughput advantage at high concurrency that matters for telephony and IVR deployments handling large numbers of simultaneous sessions.
Where it falls short: If your LLM is adding 700–800ms on top of TTS generation, the difference between 40ms and 130ms is largely absorbed before users hear it. Optimize the full pipeline before treating TTS speed as the primary variable. On languages, Cartesia advertises 40+ but has 15 fully deployed — the same number Inworld supports today. Verify your target languages against the actual supported list, not the roadmap.

Summary Table

ToolRole in StackTransportBarge-inPriceBest For
Inworld AIRealtime APIWebSocketBuilt-in (Realtime API) / Graph-level (Runtime)$5–10/1M for TTS; usage-based for LLMsVoice agents, consumer AI, cost-sensitive scale
DeepgramSTT + TTSWebSocketFlux integrated turn-taking$30Telephony, domain-specific vocabulary
Hume EVIFull STS serviceWebSocketProsody-based turn detection$7.60Companions, coaching, emotional AI
OpenAI RealtimeNative audio-in/outWebSocketBuilt-in VAD + cancellationUsage-basedOpenAI-native stacks, rapid prototyping
ElevenLabsTTS / voice conversionREST + WebSocketNone (conversion, not agent)$103–206Content production, dubbing, broad languages
CartesiaTTSWebSocketcancel: true context ID~$47Ultra-low latency, IVR, high concurrency
Build voice agents faster with Inworld AI → Start free today

How to choose

If you're optimizing for quality at scale: Inworld TTS-1 Max holds #1 on Artificial Analysis at $10/1M characters. Running 100M characters per month costs $1,000. The same volume on ElevenLabs Multilingual v2 costs $20,600 for a lower-ranked model. The case for Inworld at scale is just math.
If you're optimizing for latency: Cartesia's 40ms TTFA is the fastest in the market. But raw TTFA isn't the only latency that matters. End-of-turn detection and LLM token time matter just as much. If your endpointing adds 500ms and your LLM adds 800ms, a 40ms vs. 250ms TTS doesn't matter too much. It's better to optimize the whole pipeline before obsessing over the TTS layer.
If you need broad language coverage today: ElevenLabs (29–74 languages) or Google Cloud TTS (75+). Inworld's 15 languages cover the major commercial markets, but if you're shipping to Southeast Asia or Eastern Europe, the coverage constraint is real and the roadmap isn't a substitute for a working product. Google Cloud TTS is worth knowing here: WaveNet voices start at $16/1M characters with 1M free characters per month, Standard voices are $4/1M with 4M free monthly, and the newer Chirp 3 HD tier runs $30/1M. For teams already on GCP, it integrates natively with Dialogflow and Cloud Functions without any glue code. The Gemini 2.5 TTS models add natural language prompt-based voice control (similar to OpenAI's instruction field) across 75+ languages. It doesn't rank on Artificial Analysis, but at that language breadth and with hyperscaler infrastructure behind it, it's the practical default for multilingual enterprise deployments.
If you're building emotional AI: Hume EVI's architecture is built for this. Inworld's model delivers strong expressiveness (30% improvement in TTS-1.5 over TTS-1, per Inworld's internal metrics), but emotional context adaptation is central to Hume's product in a way it isn't for Inworld.
If you want to not manage a stack: Inworld's Realtime API is now the cleaner answer here. Send audio in via WebSocket or REST, get audio back, with model flexibility intact. OpenAI's Realtime API gets you to the same place faster if you're already on GPT-4o, but you're locked in from day one.

Why Inworld Is the Default Choice for Voice Agents

Most teams spend weeks wiring together VAD, cancellation, text chunking, and observability before they've built any actual product. The Inworld Realtime API skips all of that. Connect via WebSocket or WebRTC, send audio in, and get production-quality audio back with interruption handling and streaming already sorted. Teams that need more control over the pipeline can use the Agent Runtime, but for most use cases the Realtime API is the faster path.

FAQs

What is a speech-to-speech API?

A speech-to-speech API accepts audio input, processes it, and returns audio output, typically by chaining speech-to-text, a language model, and text-to-speech together. The user experience is conversational: you speak, the system responds in speech. In production, what determines quality is not just model capability but streaming architecture, cancellation behavior, and how well the pipeline handles interruptions. Some platforms offer pre-built endpoints that handle the full pipeline. Inworld's Realtime API, OpenAI's Realtime API, and Hume EVI all fall into this category. Others require you to assemble and wire the components yourself.

How do I choose the right speech-to-speech API?

Start with P90 end-to-end latency, not median or inference-only benchmarks. Then look at whether the API supports streaming output natively (not buffered) and whether barge-in cancellation is documented clearly. For quality evaluation, independent leaderboards like Artificial Analysis Speech Arena are more reliable than vendor-published comparisons. Cost per million characters matters significantly at scale: a 20x price gap between ElevenLabs and Inworld compounds quickly. Language requirements and compliance needs (HIPAA, on-premise) will narrow the field fast.

Is Inworld AI better than ElevenLabs for voice agents?

For real-time voice agents at scale, yes. Inworld TTS-1 Max ranks #1 on Artificial Analysis (ELO 1,162); ElevenLabs Multilingual v2 ranks #5 (ELO 1,105) at $206/1M characters versus Inworld's $10. The Agent Runtime adds orchestration, observability, and cancellation handling that ElevenLabs doesn't offer for voice agent pipelines. Where ElevenLabs is the right choice: applications requiring 29–74 languages, content production workflows (audiobooks, dubbing, podcast generation), or teams that need a 10,000+ voice library without building custom clones. ElevenLabs was built for content creation; Inworld was built for real-time voice agents. The right answer depends on which product you're building.

How does speech-to-speech relate to TTS?

TTS (text-to-speech) is one component of a speech-to-speech system, the step that converts the language model's text output into audio. A speech-to-speech system also requires STT (speech-to-text) to transcribe the user's voice, an LLM to generate a response, and orchestration logic to stream, cancel, and sequence those components in real time. TTS quality is a major input to overall speech-to-speech quality, but a great TTS API plugged into a poorly structured pipeline will still produce a slow, broken experience. Inworld addresses this by pairing its TTS API with a free Agent Runtime that handles the orchestration layer.

If I'm already using OpenAI for LLM, should I use the OpenAI Realtime API?

Model lock-in is the real cost here, and it's worth naming before the convenience argument takes over. The Realtime API commits you to GPT-4o. You can't swap to Claude or Gemini or other frontier models for your stack. For teams scaling to millions of interactions, Inworld Realtime API will typically deliver better quality at lower cost. Switching from OpenAI Realtime to Inworld Realtime is extremely easy by just changing the router reference.

How quickly can a voice agent go live with Inworld?

The Realtime API is the fastest path. Connect via WebSocket or WebRTC, make a call, send audio in, get audio back. No pipeline assembly, no template scaffolding required.

How easy is it to switch from the OpenAI Realtime API to the Inworld Realtime API?

If you're already using the OpenAI Realtime API, you can switch to Inworld with minimal code changes. The event schema, session structure, and client/server events are compatible. With the Inworld Realtime API you can also leverage your Inworld Router to have a single voice agent dynamically handle many different user cohorts. A full migration guide is available here: https://docs.inworld.ai/docs/realtime/openai-migration.

What's the best ElevenLabs alternative for voice agents?

$10 versus $206 per million characters, for a higher-ranked model. That's the headline. Inworld TTS-1 Max sits at #1 on Artificial Analysis; ElevenLabs Multilingual v2 is at #5 and costs 20x more. The Agent Runtime replaces the orchestration work you'd otherwise build yourself, and the compliance stack (SOC 2, HIPAA, on-premise deployment) covers use cases ElevenLabs can't serve. The only reason to stay on ElevenLabs for a voice agent is language coverage: if you need more than 15 languages today, ElevenLabs or Google Cloud TTS are the more realistic options. For English-primary or limited-language voice agents at any meaningful scale, the switch to Inworld is straightforward.
Copyright © 2021-2026 Inworld AI