Best Speech-to-Speech Model 2026: S2S Comparison

A speech-to-speech (S2S) model converts spoken input directly into spoken output without requiring separate transcription or synthesis steps. The best S2S model for a given application depends on one architectural question: do you need a single end-to-end model that handles everything, or do you need the flexibility to control each stage of the pipeline independently?

This guide compares the leading S2S models across both architectures: native multimodal models that process audio-in to audio-out in a single pass, and optimized pipeline systems that chain specialized models (STT, LLM, TTS) with tight integration to minimize latency. We evaluate voice quality, latency, language support, flexibility, and production readiness.

Two S2S Architectures

Every speech-to-speech system falls into one of two camps. The distinction matters because it determines what you can and cannot control.

Architecture	How It Works	Latency Profile	Trade-off
Native multimodal	Single model processes audio tokens directly. No intermediate text stage.	160-320ms end-to-end	Lowest latency, but you cannot swap the LLM, tune the voice independently, or insert business logic between stages.
Optimized pipeline	Specialized STT, LLM, and TTS models chained with streaming handoffs.	300-800ms end-to-end (depending on components)	Full control over each stage. Use any LLM, any voice, any language. Higher latency floor, but each component is independently upgradeable.

Native multimodal wins on raw speed. Optimized pipelines win on flexibility. Most production deployments in 2026 use pipelines because teams need to control which LLM handles reasoning, which voice the user hears, and what business logic runs between transcription and response.

Model Comparison

Models are evaluated on voice quality, response latency, architecture type, language support, and production readiness. Quality assessments draw from published benchmarks, independent evaluations, and production deployment data.

1. Inworld Realtime API

Architecture: Optimized pipeline (Realtime STT + any LLM via Inworld Router + Realtime TTS)

Pros:

#1 realtime TTS: Realtime TTS-2 preview and Realtime TTS 1.5 Max are purpose-built voice models with natural-language steering and sub-200ms time-to-first-audio, producing higher-fidelity output than native-multimodal systems that treat voice as one modality among many
Full LLM flexibility: Route through 220+ models via Inworld Router. Swap between OpenAI, Anthropic, Google, or any supported model without changing your voice pipeline
Sub-200ms TTS latency: P90 time-to-first-audio under 200ms on the TTS stage alone. Total pipeline latency typically lands in the realtime range; tune per-stage to your use case
Streaming-native: WebSocket-based streaming across all three stages. Audio begins generating before the LLM finishes its response
Image content parts: Realtime API accepts image inputs alongside audio (May 2026)
Voice cloning from 5-15 seconds of audio: Custom voice profiles for branded experiences via the 2-step /voices/v1/voices:clone flow
Unified billing: Single API, single bill for STT + LLM + TTS. No multi-vendor integration overhead

Cons:

Pipeline latency floor: Three-stage architecture means 500-800ms minimum end-to-end, roughly 200-400ms slower than native multimodal models
S2S product launched Q1 2026: Newer unified offering, though individual components (TTS, STT, Router) have longer production track records

Best for: Teams that need the highest voice quality available combined with full control over which LLM handles reasoning. The architecture trade-off (higher latency floor for complete flexibility) makes sense for applications where voice fidelity and LLM choice matter more than shaving 200ms.

2. OpenAI Realtime API

Architecture: Native multimodal

Pros:

True audio-in, audio-out: Processes speech natively without intermediate text conversion. Preserves tone, emphasis, and prosody in both directions
~320ms average end-to-end latency: Among the fastest production speech-to-speech systems. Minimum latency around 232ms
Function calling support: Can invoke tools and APIs mid-conversation, with MCP and SIP integration. Enabling agent-style workflows
WebRTC and WebSocket support: Flexible connection options for browser, server, and VoIP integrations
Strong reasoning: OpenAI's language capabilities carry directly into voice interactions. The API is now GA

Cons:

No LLM flexibility: Locked to OpenAI's models. Cannot swap to Claude, Gemini, or open-source models for different use cases
Limited voice customization: 9 built-in voices with prompt-based voice styling, but no voice cloning or custom voice creation
Higher cost at scale: Audio token pricing adds up in high-volume conversational applications
Voice quality trails dedicated TTS models. Native-multimodal voice output is lower fidelity than purpose-built TTS models like Realtime TTS-2, Cartesia Sonic 3.5, and Realtime TTS 1.5 Max

Best for: Applications where response speed and reasoning depth matter most, and where OpenAI is already the preferred LLM provider. Strong for voice agents that need tool calling and complex multi-turn reasoning with minimal latency.

3. Gemini Live API (Google)

Architecture: Native multimodal (Gemini 3.1 Flash Live)

Pros:

True multimodal: Processes audio, images, video, and text simultaneously. Can "see" what a user is looking at while talking
70+ language support: Broadest native language coverage of any speech-to-speech system
Barge-in handling: Supports natural interruptions mid-response
Proactive audio: Can initiate responses based on visual or contextual cues without waiting for a user prompt
Competitive latency: ~400ms end-to-end in production configurations

Cons:

Google ecosystem dependency: Tightest integration within Google Cloud. Less flexible for multi-cloud or on-prem deployments
Voice quality behind dedicated TTS: Voice output is functional for conversational use but noticeably lower fidelity than dedicated TTS providers like Inworld and Cartesia
Limited voice customization: No voice cloning. Preset voice options
Newer production track record: Live API is still maturing relative to OpenAI's Realtime API

Best for: Multimodal applications that combine voice with vision (AR/VR, camera-based assistants, screen-sharing agents). The ability to process audio and images simultaneously is unique among production S2S systems.

4. Moshi (Kyutai)

Architecture: Native multimodal (open-source)

Pros:

~200ms end-to-end latency: The fastest S2S model in this comparison. Below the 250ms threshold humans perceive as instantaneous
Full-duplex conversation: Listens and speaks simultaneously. Handles overlapping speech, back-channels ("uh-huh"), and natural interruptions better than any other model tested
Open-source (CC-BY 4.0): Full weights and code available. No per-minute API costs. Self-host on a single GPU
Runs on-device: Optimized builds for Mac (MLX), iPhone 15 Pro, and server GPUs. No cloud dependency required
Superior turn-taking: 89% interruption accuracy vs. 62% for legacy pipeline systems in published benchmarks

Cons:

English-only: No multilingual support as of May 2026
7B parameter model: Reasoning capabilities are limited compared to frontier models. Not suitable for complex multi-turn agent workflows
No managed API: Self-hosted only. You handle infrastructure, scaling, and reliability
Limited voice variety: Small set of available voices. No voice cloning
No tool/function calling: Cannot invoke external APIs mid-conversation

Best for: On-device and edge deployments where cloud latency is unacceptable. Research teams and startups building custom S2S experiences who want full model access and zero API costs. The full-duplex capability is genuinely best-in-class for natural conversation flow.

Comparison Table

Model	Architecture	End-to-End Latency	Voice Quality	LLM Flexibility	Voice Cloning	Languages	Deployment
Inworld Realtime API	Pipeline	500-800ms	Dedicated TTS, high fidelity	220+	Yes (5-15s audio)	15 GA / 90+ preview	Cloud API
OpenAI Realtime	Native multimodal	~320ms	Native, lower fidelity	OpenAI only	No	50+	Cloud API
Gemini Live	Native multimodal	~400ms	Native, lower fidelity	Gemini only	No	70+	Cloud API
Moshi	Native multimodal	~200ms	MOS 4.3/5	Moshi only (7B)	No	English	Self-hosted

Voice quality reflects dedicated-TTS vs. native-multimodal fidelity. Latency figures from published benchmarks and production data.

How to Choose

Lowest possible latency + on-device: Moshi. 200ms, runs on a single GPU or iPhone 15 Pro, open-source. Limited to English and 7B reasoning, but nothing else matches the speed.

Best voice quality + LLM flexibility: Inworld Realtime API. Expressive, steerable Realtime TTS with access to 220+ LLMs through a single pipeline. You pay 200-400ms more latency for full control over every component.

Fastest cloud speech-to-speech with strong reasoning: OpenAI Realtime API. ~320ms with OpenAI's full reasoning capabilities. Accept the vendor lock-in and the limitation to 9 voices.

Multimodal (voice + vision): Gemini Live. The only production S2S system that processes images and audio simultaneously. Essential for AR/VR and camera-based applications.

The Architecture Decision

The native vs. pipeline choice comes down to one question: do you need to control the LLM?

If your application is a general-purpose voice assistant and you are committed to a single LLM provider, native multimodal (OpenAI Realtime or Gemini Live) delivers the lowest latency with the simplest integration.

If you need to choose your LLM based on the task (route complex queries to Claude, simple ones to a fast open-source model), customize the voice, or insert business logic between transcription and response, a pipeline is the only viable path. The latency cost is real but manageable for most conversational applications, where human perception tolerates up to 800ms without friction.

Inworld's pipeline approach is worth particular attention here because it consolidates all three stages under one provider. Most pipeline S2S systems require stitching together STT from one vendor, LLM from another, and TTS from a third. We provide all three plus the routing layer, which removes the integration overhead that makes pipelines brittle in production.

FAQ

What is a speech-to-speech model?

A speech-to-speech model converts spoken audio input into spoken audio output. Some models do this natively in a single pass (OpenAI Realtime, Gemini Live, Moshi), while others chain specialized models for transcription, reasoning, and voice synthesis. Both approaches produce realtime conversational voice interactions.

Which speech-to-speech system has the lowest latency?

Moshi by Kyutai achieves ~200ms end-to-end latency, the fastest of any speech-to-speech model. Among cloud APIs, OpenAI's Realtime API averages ~320ms. Pipeline-based systems like the Inworld Realtime API typically range from 500-800ms depending on the LLM selected.

Can I use my own LLM with a speech-to-speech system?

Only with pipeline-based systems. The Inworld Realtime API routes through 220+ models via Router, giving you full LLM flexibility. Native multimodal models (OpenAI Realtime, Gemini, Moshi) use their own built-in reasoning and cannot be swapped.

Is native multimodal always better than a pipeline?

No. Native multimodal is faster but less flexible. Pipeline systems let you control the LLM, customize the voice, and insert business logic between stages. Most production deployments in 2026 use pipelines because teams need that control. The 200-400ms latency difference is imperceptible in many conversational applications.

Which speech-to-speech system has the best voice quality?

Inworld's Realtime API pipeline uses dedicated TTS models (Realtime TTS-2 preview and Realtime TTS 1.5 Max) purpose-built for voice, delivering expressive, high-fidelity output with sub-200ms time-to-first-audio. Native multimodal models like OpenAI Realtime and Gemini Live generate voice as one of many modalities, which typically produces lower-fidelity output than dedicated TTS models.

Published by Inworld AI. Voice quality reflects dedicated-TTS vs. native-multimodal fidelity. Latency benchmarks from published documentation and production measurements. Pricing reflects published rates as of May 2026 and may change.

Best Speech-to-Speech Model (2026)

Two S2S Architectures

Model Comparison

1. Inworld Realtime API

2. OpenAI Realtime API

3. Gemini Live API (Google)

4. Moshi (Kyutai)

Comparison Table

How to Choose

The Architecture Decision

FAQ