A speech-to-speech (S2S) model converts spoken input directly into spoken output without requiring separate transcription or synthesis steps. The best S2S model for a given application depends on one architectural question: do you need a single end-to-end model that handles everything, or do you need the flexibility to control each stage of the pipeline independently?
This guide compares the leading S2S models across both architectures: native multimodal models that process audio-in to audio-out in a single pass, and optimized pipeline systems that chain specialized models (STT, LLM, TTS) with tight integration to minimize latency. We evaluate voice quality, latency, language support, flexibility, and production readiness.
Two S2S Architectures
Every speech-to-speech system falls into one of two camps. The distinction matters because it determines what you can and cannot control.
| Architecture | How It Works | Latency Profile | Trade-off |
|---|
| Native multimodal | Single model processes audio tokens directly. No intermediate text stage. | 160-320ms end-to-end | Lowest latency, but you cannot swap the LLM, tune the voice independently, or insert business logic between stages. |
| Optimized pipeline | Specialized STT, LLM, and TTS models chained with streaming handoffs. | 300-800ms end-to-end (depending on components) | Full control over each stage. Use any LLM, any voice, any language. Higher latency floor, but each component is independently upgradeable. |
Native multimodal wins on raw speed. Optimized pipelines win on flexibility. Most production deployments in 2026 use pipelines because teams need to control which LLM handles reasoning, which voice the user hears, and what business logic runs between transcription and response.
Model Comparison
Models are evaluated on voice quality, response latency, architecture type, language support, and production readiness. Quality assessments draw from published benchmarks, independent evaluations, and production deployment data.
1. Inworld Realtime API
Architecture: Optimized pipeline (Inworld STT + any LLM via Inworld Router + Inworld TTS)
Pros:
- Top-ranked TTS component: Inworld TTS 1.5 Max holds the #1 position on the Artificial Analysis Speech Arena (Elo 1,236, March 2026), with Inworld occupying 3 of the top 5 positions (#1, #3, #5). The highest voice quality of any model in the comparison
- Full LLM flexibility: Route through hundreds of models via Inworld Router. Swap between OpenAI, Anthropic, Google, or any supported model without changing your voice pipeline
- Sub-200ms TTS latency: P90 time-to-first-audio under 200ms on the TTS stage alone. Total pipeline latency of 500-800ms depending on LLM selection
- Streaming-native: WebSocket-based streaming across all three stages. Audio begins generating before the LLM finishes its response
- Voice cloning from 5-15 seconds of audio: Custom voice profiles for branded experiences without per-voice licensing fees
- Unified billing: Single API, single bill for STT + LLM + TTS. No multi-vendor integration overhead
Cons:
- Pipeline latency floor: Three-stage architecture means 500-800ms minimum end-to-end, roughly 200-400ms slower than native multimodal models
- S2S product launched Q1 2026: Newer unified offering, though individual components (TTS, STT, Router) have longer production track records
Best for: Teams that need the highest voice quality available combined with full control over which LLM handles reasoning. The architecture trade-off (higher latency floor for complete flexibility) makes sense for applications where voice fidelity and LLM choice matter more than shaving 200ms.
2. OpenAI Realtime API
Architecture: Native multimodal
Pros:
- True audio-in, audio-out: Processes speech natively without intermediate text conversion. Preserves tone, emphasis, and prosody in both directions
- ~320ms average end-to-end latency: Among the fastest production speech-to-speech systems. Minimum latency around 232ms
- Function calling support: Can invoke tools and APIs mid-conversation, with MCP and SIP integration. Enabling agent-style workflows
- WebRTC and WebSocket support: Flexible connection options for browser, server, and VoIP integrations
- Strong reasoning: OpenAI's language capabilities carry directly into voice interactions. The API is now GA
Cons:
- No LLM flexibility: Locked to OpenAI's models. Cannot swap to Claude, Gemini, or open-source models for different use cases
- Limited voice customization: 9 built-in voices with prompt-based voice styling, but no voice cloning or custom voice creation
- Higher cost at scale: Audio token pricing adds up in high-volume conversational applications
- Voice quality trails dedicated TTS models. Ranks outside the top 5 on the Artificial Analysis Speech Arena, behind Inworld (#1, Elo 1,236) and other dedicated TTS providers
Best for: Applications where response speed and reasoning depth matter most, and where OpenAI is already the preferred LLM provider. Strong for voice agents that need tool calling and complex multi-turn reasoning with minimal latency.
3. Gemini Live API (Google)
Architecture: Native multimodal (Gemini 3.1 Flash Live)
Pros:
- True multimodal: Processes audio, images, video, and text simultaneously. Can "see" what a user is looking at while talking
- 70+ language support: Broadest native language coverage of any speech-to-speech system
- Barge-in handling: Supports natural interruptions mid-response
- Proactive audio: Can initiate responses based on visual or contextual cues without waiting for a user prompt
- Competitive latency: ~400ms end-to-end in production configurations
Cons:
- Google ecosystem dependency: Tightest integration within Google Cloud. Less flexible for multi-cloud or on-prem deployments
- Voice quality behind dedicated TTS: Voice quality ranks well below dedicated TTS providers on the Artificial Analysis Speech Arena. Functional for conversational use but noticeably lower fidelity than top-ranked voice models
- Limited voice customization: No voice cloning. Preset voice options
- Newer production track record: Live API is still maturing relative to OpenAI's Realtime API
Best for: Multimodal applications that combine voice with vision (AR/VR, camera-based assistants, screen-sharing agents). The ability to process audio and images simultaneously is unique among production S2S systems.
4. Moshi (Kyutai)
Architecture: Native multimodal (open-source)
Pros:
- ~200ms end-to-end latency: The fastest S2S model in this comparison. Below the 250ms threshold humans perceive as instantaneous
- Full-duplex conversation: Listens and speaks simultaneously. Handles overlapping speech, back-channels ("uh-huh"), and natural interruptions better than any other model tested
- Open-source (CC-BY 4.0): Full weights and code available. No per-minute API costs. Self-host on a single GPU
- Runs on-device: Optimized builds for Mac (MLX), iPhone 15 Pro, and server GPUs. No cloud dependency required
- Superior turn-taking: 89% interruption accuracy vs. 62% for legacy pipeline systems in published benchmarks
Cons:
- English-only: No multilingual support as of March 2026
- 7B parameter model: Reasoning capabilities are limited compared to frontier models. Not suitable for complex multi-turn agent workflows
- No managed API: Self-hosted only. You handle infrastructure, scaling, and reliability
- Limited voice variety: Small set of available voices. No voice cloning
- No tool/function calling: Cannot invoke external APIs mid-conversation
Best for: On-device and edge deployments where cloud latency is unacceptable. Research teams and startups building custom S2S experiences who want full model access and zero API costs. The full-duplex capability is genuinely best-in-class for natural conversation flow.
Comparison Table
| Model | Architecture | End-to-End Latency | Voice Quality | LLM Flexibility | Voice Cloning | Languages | Deployment |
|---|
| Inworld Realtime API | Pipeline | 500-800ms | #1 (Elo 1,236) | Hundreds | Yes (5-15s audio) | 15 | Cloud API |
| OpenAI Realtime | Native multimodal | ~320ms | Outside top 5 | OpenAI only | No | 50+ | Cloud API |
| Gemini Live | Native multimodal | ~400ms | Outside top 5 | Gemini only | No | 70+ | Cloud API |
| Moshi | Native multimodal | ~200ms | MOS 4.3/5 | Moshi only (7B) | No | English | Self-hosted |
Voice quality rankings from the Artificial Analysis Speech Arena (March 2026). Latency figures from published benchmarks and production data.
How to Choose
Lowest possible latency + on-device: Moshi. 200ms, runs on a single GPU or iPhone 15 Pro, open-source. Limited to English and 7B reasoning, but nothing else matches the speed.
Best voice quality + LLM flexibility: Inworld Realtime API. #1 ranked TTS with access to hundreds of LLMs through a single pipeline. You pay 200-400ms more latency for full control over every component.
Fastest cloud speech-to-speech with strong reasoning: OpenAI Realtime API. ~320ms with OpenAI's full reasoning capabilities. Accept the vendor lock-in and the limitation to 9 voices.
Multimodal (voice + vision): Gemini Live. The only production S2S system that processes images and audio simultaneously. Essential for AR/VR and camera-based applications.
The Architecture Decision
The native vs. pipeline choice comes down to one question: do you need to control the LLM?
If your application is a general-purpose voice assistant and you are committed to a single LLM provider, native multimodal (OpenAI Realtime or Gemini Live) delivers the lowest latency with the simplest integration.
If you need to choose your LLM based on the task (route complex queries to Claude, simple ones to a fast open-source model), customize the voice, or insert business logic between transcription and response, a pipeline is the only viable path. The latency cost is real but manageable for most conversational applications, where human perception tolerates up to 800ms without friction.
Inworld's pipeline approach is worth particular attention here because it consolidates all three stages under one provider. Most pipeline S2S systems require stitching together STT from one vendor, LLM from another, and TTS from a third. Inworld provides all three plus the routing layer, which removes the integration overhead that makes pipelines brittle in production.
FAQ
What is a speech-to-speech model?
A speech-to-speech model converts spoken audio input into spoken audio output. Some models do this natively in a single pass (OpenAI Realtime, Gemini Live, Moshi), while others chain specialized models for transcription, reasoning, and voice synthesis. Both approaches produce realtime conversational voice interactions.
Which speech-to-speech system has the lowest latency?
Moshi by Kyutai achieves ~200ms end-to-end latency, the fastest of any speech-to-speech model. Among cloud APIs, OpenAI's Realtime API averages ~320ms. Pipeline-based systems like the Inworld Realtime API typically range from 500-800ms depending on the LLM selected.
Can I use my own LLM with a speech-to-speech system?
Only with pipeline-based systems. The Inworld Realtime API routes through hundreds of models via its Router, giving you full LLM flexibility. Native multimodal models (OpenAI Realtime, Gemini, Moshi) use their own built-in reasoning and cannot be swapped.
Is native multimodal always better than a pipeline?
No. Native multimodal is faster but less flexible. Pipeline systems let you control the LLM, customize the voice, and insert business logic between stages. Most production deployments in 2026 use pipelines because teams need that control. The 200-400ms latency difference is imperceptible in many conversational applications.
Which speech-to-speech system has the best voice quality?
Inworld TTS 1.5 Max, used in the Inworld Realtime API pipeline, holds the #1 ranking on the Artificial Analysis Speech Arena (ELO 1,236). Native multimodal models like OpenAI Realtime and Gemini Live rank outside the top 5, producing lower-fidelity voice output than dedicated TTS models.
Published by Inworld AI. Voice quality rankings from the Artificial Analysis Speech Arena (March 2026). Latency benchmarks from published documentation and production measurements. Pricing reflects published rates as of March 2026 and may change.