Best Speech-to-Speech APIs in 2026: Compare Top Providers

Voice is becoming the default interface for software, and building for it has never been more accessible. Models are more natural, speeds are faster, and modern APIs handle the hard parts that used to require weeks of custom engineering. This guide breaks down the best speech-to-speech APIs available in 2026, what each one is good at, and how to choose.

Most teams don't think about voice latency until users start complaining. By then, the VAD has been waiting too long to trigger, the TTS has been buffering full sentences before sending audio, and cancellation hasn't been working at all. Teams make these decisions early, before anyone has run a single P90 benchmark.

What "speech-to-speech" actually means

The term covers two different product categories that rarely overlap.

Voice-agent speech-to-speech is what this guide focuses on: user speaks, system responds with speech, optimized for low latency, barge-in, and multi-turn dialogue. The internals are almost always STT → LLM → TTS. The goal is making that pipeline behave like a single audio-in/audio-out system.

Voice conversion speech-to-speech is different: you feed recorded audio and get back that same content in a different voice, accent, or language. Tools like Resemble own this space. It's closer to dubbing than conversation.

A composed pipeline, even a well-optimized one, can feel as seamless as native audio-in/audio-out if streaming and cancellation are done right. The user doesn't know or care what's happening inside.

Production checklist

Before evaluating any API, there's a short list of capabilities that separate a voice agent from an IVR:

Streaming input. Audio arrives continuously via WebSocket or WebRTC. No waiting for the user to finish before processing starts.
Fast endpointing. VAD plus end-of-turn detection to handle interruptions and decide when to respond.
Streaming output. Audio chunks arrive as they're synthesized, not after the full response is buffered. This is what makes sub-300ms time-to-first-audio possible.
Cancellation. Barge-in requires stopping TTS playback, canceling in-flight TTS generation, canceling LLM generation, and resetting stream state. If any of those steps is missing, the agent talks over users or "finishes the old thought" after being interrupted.
Observability. Per-turn traces with STT, LLM, and TTS timings broken out separately. Without this, you're guessing which component is causing latency spikes.

Reference architecture

An example streaming STT → LLM → TTS pipeline that hits those targets looks like this:

Audio in (WebSocket / WebRTC)
  └─ VAD + endpointing
      └─ Streaming STT (partials)
          └─ LLM (streaming tokens)
              └─ Text chunking (sentence-level for TTS)
                  └─ Streaming TTS (audio chunks)
                      └─ Client audio playback
                          └─ Cancellation path (on barge-in: back to VAD)

The difference between a fast pipeline and a slow one is usually in the middle steps. If the LLM waits for a full response before sending to TTS, you've added 1–3 seconds. If the TTS waits for a full sentence before generating, you've added another 200–500ms. Chunking LLM output at sentence boundaries and streaming TTS against those chunks is how you get first-audio under 300ms.

WebSocket vs. WebRTC: WebSocket is the right default for most product teams. It's easier to deploy, debug, and monitor, and it handles streaming STT and TTS cleanly. WebRTC makes sense when you need media-grade jitter handling, NAT traversal, or extremely tight real-time constraints. Telephony is the clearest case. The Softcery architecture writeup covers the tradeoffs in detail if you want a deeper look.

When P90 spikes, which layer do you blame? This is the operational question the architecture diagram doesn't answer. In practice: if end-of-turn latency spikes but TTFA is stable, the problem is almost always VAD. A conservative endpointing threshold is waiting too long to declare the user has stopped speaking. If TTFA spikes intermittently, the culprit is usually the LLM tier (token generation time under load) or the text chunking step failing to send the first sentence to TTS quickly enough. If audio sounds choppy or cuts off mid-stream, that's TTS chunk delivery. Either a slow connection or the TTS provider buffering before sending. Isolating this requires per-turn traces that break out STT time, LLM time-to-first-token, and TTS time-to-first-chunk separately. Without that breakdown, a 900ms P90 tells you something is wrong but not what to fix.

The Best Speech-to-Speech APIs in 2026

1. Inworld AI: Realtime API + TTS

Best for: Developers building speech to speech AI applications

Building speech-to-speech the traditional way means wiring together a STT provider, an LLM, a TTS model, chunking logic, VAD, cancellation handling, and observability. Inworld's Realtime API collapses the speech-to-speech pipeline into one endpoint. The endpoint allows you to stream audio in and stream high-quality audio back. The Realtime API is effectively a drop-in voice agent built on Inworld's Realtime TTS and managed for you.

Realtime API

Connect via WebSocket or WebRTC and audio flows in both directions in realtime. VAD, end-of-turn detection, streaming, interruption handling, and cancellation are all handled natively server-side. The Realtime API leverages Inworld's high-accuracy, streaming speech-to-text model, which surfaces the profile, context, and state of your users to contextualize responses. This is paired with Inworld's expressive speech models, including Realtime TTS 1.5 Max and Realtime TTS-2 preview. Inworld's Realtime TTS-2 is the #1 realtime TTS, delivering steerable, natural delivery with sub-200ms time-to-first-audio.

In addition, unlike OpenAI's Realtime API, you're not locked to a single LLM. Inworld's implementation is model-agnostic, leveraging the Inworld Router to route requests across 220+ models from OpenAI, Anthropic, Google, and others through a single API, with built-in failover, A/B testing, and intelligent model selection without any code changes. The Realtime API now also accepts image content parts alongside audio (May 2026).

Code Example:

Connect via WebSocket for streaming conversation:

const ws = new WebSocket(
  `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`,
  { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);

ws.on('message', (raw) => {
  const msg = JSON.parse(raw.toString());
  if (msg.type === 'session.created') {
    ws.send(JSON.stringify({ type: 'session.update', session: { ... } }));
  }
});

// Send audio chunks as they arrive from the microphone
function sendAudio(audioChunk) {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: audioChunk // base64-encoded PCM16
  }));
}

// Receive audio chunks as the agent responds
ws.on('message', (event) => {
  const msg = JSON.parse(event.toString());
  if (msg.type === 'response.output_audio.delta') {
    playAudio(msg.delta); // base64 PCM16
  }
});

For most voice agents, this is where to start. One endpoint replaces weeks of pipeline work.

TTS quality and latency

TTS 1.5-Max: sub-250ms P90 time-to-first-audio
TTS 1.5-Mini: sub-130ms P90 (trades some quality for speed)
Pricing: See pricing for current per-character rates.

Voice cloning

Zero-shot cloning from 5–15 seconds of audio is available across all plans, with per-tier custom voice limits (5 on On-Demand, scaling up through Growth and Enterprise). Our voice cloning docs include persona recording scripts to improve clone quality.

Enterprise controls

SOC 2 Type II, GDPR, zero-retention mode, on-premise deployment available. Enterprise pricing via inworld.ai/pricing.

Where it falls short

15 GA languages (with 90+ experimental in TTS-2). For applications requiring broad multilingual coverage today across GA voices, Google (75+) is the more realistic option. Audio markup tags ([happy], [whispering], etc.) work at the start of a generation in TTS 1.5; non-verbal tags like [sigh] and [laugh] can appear inline. TTS-2 adds natural-language steering across 8 dimensions and is in research preview.

2. Hume AI: EVI (Empathic Voice Interface)

Best for: AI companions, therapy, coaching, and social applications where emotional responsiveness is the core product.

Hume's EVI (now at versions 3 and 4, powered by the Octave 2 TTS model) is architecturally distinct from everything else here. It analyzes the user's tone of voice and adjusts response delivery accordingly. You don't write SSML tags or prompt a voice style. The model infers sarcasm, urgency, and warmth from conversational context and responds in kind. The EVI overview describes end-of-turn detection using prosody rather than just silence, which reduces false triggers on natural speech pauses.

The natural language voice control is genuinely differentiating: you describe the voice you want in plain English. "Sound hesitant, like someone delivering bad news." No tags required.

EVI 3/4 targets under 300ms end-to-end. 11 languages at launch, with expansion announced.

Where it falls short: Hume doesn't publish competitive results on independent voice-quality benchmarks. For applications where raw voice quality on blind preference tests is the differentiator, the comparative data isn't there yet. For transactional voice agents (customer service, scheduling, any workflow where the goal is speed and accuracy rather than emotional resonance) Hume's architecture is doing work the use case doesn't require.

3. OpenAI Realtime API

Best for: Teams already on OpenAI's platform who want a working voice agent fast and are willing to pay for the simplicity.

The OpenAI Realtime API was one of the first native audio-in/audio-out endpoints on the market. There's no STT transcript, no text passed to an LLM, no TTS render. The same model handles the full loop. That difference has two concrete effects: it eliminates transcription latency in a composed pipeline, and it means the model can respond to tone, pacing, and affect that text transcription would have stripped out. VAD, function calling (including MCP support), SIP integration, and interruption handling all live in the same API surface, which cuts setup time significantly.

Commit to the Realtime API and you're locked to OpenAI's models. You can't swap to Claude or Gemini if they outperform on your specific domain, and you can't route to a cheaper model during off-peak hours. At scale, that single-vendor dependency pushes per-minute costs well above what a composed stack would run. Inworld's Realtime API offers the same audio-in/audio-out simplicity without the model lock-in. OpenAI also offers a separate TTS endpoint, gpt-4o-mini-tts, that accepts natural language style instructions ("sound skeptical and measured") rather than SSML or markup tags. It's worth knowing about for batch or non-realtime use cases. It's a standalone product, separate from the Realtime API. OpenAI offers 9 built-in voices.

The Realtime API is GA. 50+ language support.

Where it falls short: Model lock-in is the main cost, along with the limitation to 9 voices. OpenAI does not offer on-premise deployment or free voice cloning. OpenAI doesn't publish P90 end-to-end latency for the Realtime API the way Inworld does, which makes it harder to set production targets before you've already built against it.

4. Cartesia: Sonic 3.5 TTS, Ink STT, and Line Agent Platform

Best for: Ultra-low latency voice agents in telephony, IVR, and realtime voice interfaces on constrained hardware.

Cartesia now offers Sonic 3.5 for TTS, Ink for STT, and Line as an agent platform. Sonic 3.5 is a dedicated realtime TTS model with strong voice quality. The headline number is 40ms time-to-first-audio, about 3x faster than Inworld's Mini model (sub-130ms P90). The underlying architecture is a State Space Model rather than a transformer, which scales linearly with sequence length rather than quadratically, a structural throughput advantage at high concurrency that matters for telephony and IVR deployments handling large numbers of simultaneous sessions.

Where it falls short: If your LLM is adding 700-800ms on top of TTS generation, the difference between 40ms and 130ms is largely absorbed before users hear it. Optimize the full pipeline before treating TTS speed as the primary variable. On languages, Cartesia advertises 40+ but has 15 fully deployed, the same number Inworld supports today. Verify your target languages against the actual supported list, not the roadmap.

5. ElevenLabs Agents

Best for: Teams that want expressive voice generation, voice cloning, and a hosted agent platform.

ElevenLabs Agents (formerly Conversational AI) is a hosted voice agent platform built on Eleven v3 TTS and Scribe STT. Expressive Mode added emotional steering in Feb 2026, Flows shipped in Mar 2026, and the platform integrates with several LLM providers. Voice cloning remains best in class.

Where it falls short: The platform locks voice and STT to ElevenLabs models and does not offer model-agnostic LLM routing across 220+ models.

Summary Table

Tool	Role in Stack	Transport	Barge-in	Price	Best For
Inworld AI	Realtime API	WebSocket, WebRTC	Built-in (Realtime API)	See pricing	Voice agents, consumer AI, cost-sensitive scale
Hume EVI	Full STS service	WebSocket	Prosody-based turn detection	Tiered subscription	Companions, coaching, emotional AI
OpenAI Realtime	Native audio-in/out	WebSocket, WebRTC	Built-in VAD + cancellation	Usage-based	OpenAI-native stacks, rapid prototyping
Cartesia	TTS + STT + Agent Platform	WebSocket	cancel: true context ID	Usage-based	Ultra-low latency, IVR, high concurrency

Build voice agents faster with Inworld AI → Start free today

How to choose

If you're optimizing for quality at scale: Inworld pairs expressive, steerable Realtime TTS (Realtime TTS-2 preview and Realtime TTS 1.5 Max, sub-200ms time-to-first-audio) with full LLM choice through Router. The case for Inworld at scale is straightforward: high-quality realtime voice plus the flexibility to route to any frontier model.

If you're optimizing for latency: Cartesia's 40ms TTFA is the fastest in the market. But raw TTFA isn't the only latency that matters. End-of-turn detection and LLM token time matter just as much. If your endpointing adds 500ms and your LLM adds 800ms, a 40ms vs. 250ms TTS doesn't matter too much. It's better to optimize the whole pipeline before obsessing over the TTS layer.

If you need broad language coverage today: Google Cloud TTS (75+). Inworld's 15 languages cover the major commercial markets, but if you're shipping to Southeast Asia or Eastern Europe, the coverage constraint is real and the roadmap isn't a substitute for a working product. For teams already on GCP, Google Cloud TTS integrates natively with Dialogflow and Cloud Functions without any glue code. Google's Chirp 3 HD and Gemini TTS models add natural language prompt-based voice control across 75+ languages. It doesn't rank on Artificial Analysis, but at that language breadth and with hyperscaler infrastructure behind it, it's the practical default for multilingual enterprise deployments.

If you're building emotional AI: Hume EVI's architecture is built for this. Inworld's model delivers strong expressiveness (30% improvement in Realtime TTS 1.5 over Realtime TTS 1, per Inworld's internal metrics), but emotional context adaptation is central to Hume's product in a way it isn't for Inworld.

If you want to not manage a stack: Inworld's Realtime API is now the cleaner answer here. Send audio in via WebSocket or WebRTC, get audio back, with model flexibility intact. OpenAI's Realtime API gets you to the same place faster if you're already on OpenAI, but you're locked in from day one.

Why Inworld Is the Default Choice for Voice Agents

Most teams spend weeks wiring together VAD, cancellation, text chunking, and observability before they've built any actual product. The Inworld Realtime API skips all of that. Connect via WebSocket or WebRTC, send audio in, and get production-quality audio back with interruption handling and streaming already sorted. Teams that need more control over the pipeline have full control, but for most use cases the Realtime API is the faster path.

FAQs

What is a speech-to-speech API?

A speech-to-speech API accepts audio input, processes it, and returns audio output, typically by chaining speech-to-text, a language model, and text-to-speech together. The user experience is conversational: you speak, the system responds in speech. In production, what determines quality is not just model capability but streaming architecture, cancellation behavior, and how well the pipeline handles interruptions. Some platforms offer pre-built endpoints that handle the full pipeline. Inworld's Realtime API, OpenAI's Realtime API, and Hume EVI all fall into this category. Others require you to assemble and wire the components yourself.

How do I choose the right speech-to-speech API?

Start with P90 end-to-end latency, not median or inference-only benchmarks. Then look at whether the API supports streaming output natively (not buffered) and whether barge-in cancellation is documented clearly. For quality evaluation, independent blind listening tests on your own content are more reliable than vendor-published comparisons. Cost per million characters matters significantly at scale: the price gap between providers compounds quickly. Language requirements and compliance needs (HIPAA, on-premise) will narrow the field fast.

How does speech-to-speech relate to TTS?

TTS (text-to-speech) is one component of a speech-to-speech system, the step that converts the language model's text output into audio. A speech-to-speech system also requires STT (speech-to-text) to transcribe the user's voice, an LLM to generate a response, and orchestration logic to stream, cancel, and sequence those components in real time. TTS quality is a major input to overall speech-to-speech quality, but a great TTS API plugged into a poorly structured pipeline will still produce a slow, broken experience. Inworld addresses this by pairing its TTS API with the Realtime API that handles the orchestration layer.

If I'm already using OpenAI for LLM, should I use the OpenAI Realtime API?

Model lock-in is the real cost here, and it's worth naming before the convenience argument takes over. The OpenAI Realtime API commits you to OpenAI's models. You can't swap to Claude, Gemini, or other frontier models for your stack. For teams scaling to millions of interactions, the Inworld Realtime API will typically deliver better quality at lower cost. Switching from OpenAI Realtime to Inworld Realtime is straightforward due to protocol compatibility.

How quickly can a voice agent go live with Inworld?

The Realtime API is the fastest path. Connect via WebSocket or WebRTC, make a call, send audio in, get audio back. No pipeline assembly, no template scaffolding required.

How easy is it to switch from the OpenAI Realtime API to the Inworld Realtime API?

If you're already using the OpenAI Realtime API, you can switch to Inworld with minimal code changes. The event schema, session structure, and client/server events are compatible. With the Inworld Realtime API you can also leverage your Inworld Router to have a single voice agent dynamically handle many different user cohorts. A full migration guide is available here: https://docs.inworld.ai/docs/realtime/openai-migration.

Best Speech-to-Speech APIs in 2026: Architecture, Latency, and Code