A speech-to-speech (S2S) model converts spoken input directly into spoken output without requiring separate transcription or synthesis steps. The best S2S model for a given application depends on one architectural question: do you need a single end-to-end model that handles everything, or do you need the flexibility to control each stage of the pipeline independently?
This guide compares the leading S2S models across both architectures: native multimodal models that process audio-in to audio-out in a single pass, and optimized pipeline systems that chain specialized models (STT, LLM, TTS) with tight integration to minimize latency. We evaluate voice quality, latency, language support, flexibility, and production readiness.
Two S2S Architectures
Every speech-to-speech system falls into one of two camps. The distinction matters because it determines what you can and cannot control.
| Architecture | How It Works | Latency Profile | Trade-off |
|---|
| Native multimodal | Single model processes audio tokens directly. No intermediate text stage. | 160-320ms end-to-end | Lowest latency, but you cannot swap the LLM, tune the voice independently, or insert business logic between stages. |
| Optimized pipeline | Specialized STT, LLM, and TTS models chained with streaming handoffs. | 300-800ms end-to-end (depending on components) | Full control over each stage. Use any LLM, any voice, any language. Higher latency floor, but each component is independently upgradeable. |
Native multimodal wins on raw speed. Optimized pipelines win on flexibility. Most production deployments in 2026 use pipelines because teams need to control which LLM handles reasoning, which voice the user hears, and what business logic runs between transcription and response.
Model Comparison
Models are evaluated on voice quality, response latency, architecture type, language support, and production readiness. Quality assessments draw from published benchmarks, independent evaluations, and production deployment data.
1. Inworld S2S
Architecture: Optimized pipeline (Inworld STT + any LLM via Inworld Router + Inworld TTS)
Pros:
- Top-ranked TTS component: Inworld TTS 1.5 Max holds the #1 position on the Artificial Analysis Speech Arena (Elo 1,240, March 2026), with Inworld occupying 3 of the top 5 positions (#1, #3, #4). The highest voice quality of any model in the comparison
- Full LLM flexibility: Route through 200+ models via Inworld Router. Swap between GPT-4o, Claude, Gemini, Llama, or any supported model without changing your voice pipeline
- Sub-200ms TTS latency: P90 time-to-first-audio under 200ms on the TTS stage alone. Total pipeline latency of 500-800ms depending on LLM selection
- Streaming-native: WebSocket-based streaming across all three stages. Audio begins generating before the LLM finishes its response
- Voice cloning from 10s of audio: Custom voice profiles for branded experiences without per-voice licensing fees
- Unified billing: Single API, single bill for STT + LLM + TTS. No multi-vendor integration overhead
Cons:
- Pipeline latency floor: Three-stage architecture means 500-800ms minimum end-to-end, roughly 200-400ms slower than native multimodal models
- S2S product launched Q1 2026: Newer unified offering, though individual components (TTS, STT, Router) have longer production track records
Best for: Teams that need the highest voice quality available combined with full control over which LLM handles reasoning. The architecture trade-off (higher latency floor for complete flexibility) makes sense for applications where voice fidelity and LLM choice matter more than shaving 200ms.
2. GPT-4o (OpenAI Realtime API)
Architecture: Native multimodal
Pros:
- True audio-in, audio-out: Processes speech natively without intermediate text conversion. Preserves tone, emphasis, and prosody in both directions
- ~320ms average end-to-end latency: Among the fastest production S2S systems. Minimum latency around 232ms
- Function calling support: Can invoke tools and APIs mid-conversation, enabling agent-style workflows
- WebRTC and WebSocket support: Flexible connection options for browser, server, and VoIP integrations
- Strongest reasoning: GPT-4o's language capabilities carry directly into voice interactions
Cons:
- No LLM flexibility: You get GPT-4o. Cannot swap to Claude, Gemini, or open-source models for different use cases
- Limited voice customization: Prompt-based voice styling, but no voice cloning or custom voice creation
- Higher cost at scale: Audio token pricing ($100/1M input tokens, $200/1M output tokens at time of writing) adds up in high-volume conversational applications
- Voice quality trails dedicated TTS models. Ranks outside the top 5 on the Artificial Analysis Speech Arena (March 2026), behind Inworld (#1, Elo 1,240) and ElevenLabs Eleven v3 (#2, Elo 1,197)
Best for: Applications where response speed and reasoning depth matter most, and where GPT-4o is already the preferred LLM. Strong for voice agents that need tool calling and complex multi-turn reasoning with minimal latency.
3. Gemini Live API (Google)
Architecture: Native multimodal
Pros:
- True multimodal: Processes audio, images, video, and text simultaneously. Can "see" what a user is looking at while talking
- 70+ language support: Broadest native language coverage of any S2S model
- Barge-in handling: Supports natural interruptions mid-response
- Proactive audio: Can initiate responses based on visual or contextual cues without waiting for a user prompt
- Competitive latency: ~400ms end-to-end in production configurations
Cons:
- Google ecosystem dependency: Tightest integration within Google Cloud. Less flexible for multi-cloud or on-prem deployments
- Voice quality behind dedicated TTS: Voice quality ranks well below dedicated TTS providers on the Artificial Analysis Speech Arena. Functional for conversational use but noticeably lower fidelity than top-ranked voice models
- Limited voice customization: No voice cloning. Preset voice options
- Newer production track record: Live API is still maturing relative to OpenAI's Realtime API
Best for: Multimodal applications that combine voice with vision (AR/VR, camera-based assistants, screen-sharing agents). The ability to process audio and images simultaneously is unique among production S2S systems.
4. Moshi (Kyutai)
Architecture: Native multimodal (open-source)
Pros:
- ~200ms end-to-end latency: The fastest S2S model in this comparison. Below the 250ms threshold humans perceive as instantaneous
- Full-duplex conversation: Listens and speaks simultaneously. Handles overlapping speech, back-channels ("uh-huh"), and natural interruptions better than any other model tested
- Open-source (CC-BY 4.0): Full weights and code available. No per-minute API costs. Self-host on a single GPU
- Runs on-device: Optimized builds for Mac (MLX), iPhone 15 Pro, and server GPUs. No cloud dependency required
- Superior turn-taking: 89% interruption accuracy vs. 62% for legacy pipeline systems in published benchmarks
Cons:
- English-only: No multilingual support as of March 2026
- 7B parameter model: Reasoning capabilities are limited compared to GPT-4o or Claude. Not suitable for complex multi-turn agent workflows
- No managed API: Self-hosted only. You handle infrastructure, scaling, and reliability
- Limited voice variety: Small set of available voices. No voice cloning
- No tool/function calling: Cannot invoke external APIs mid-conversation
Best for: On-device and edge deployments where cloud latency is unacceptable. Research teams and startups building custom S2S experiences who want full model access and zero API costs. The full-duplex capability is genuinely best-in-class for natural conversation flow.
5. ElevenLabs Conversational AI
Architecture: Optimized pipeline (third-party STT + configurable LLM + ElevenLabs TTS)
Pros:
- Best voice cloning in market: Industry-leading voice replication from short audio samples. Widest voice library with 1,000+ pre-built options
- Ranked #2 on Artificial Analysis (Eleven v3, Elo 1,197, March 2026). Significant quality jump with v3; strong on expressive and emotional content
- LLM flexibility: Supports GPT-4o, Claude, Gemini, and custom models as the reasoning layer
- Mature developer ecosystem: Extensive SDKs, documentation, and community tooling
- 30+ language TTS support
Cons:
- Higher cost: ~$200+/month for professional plans. Per-character pricing is significantly more expensive than Inworld TTS at scale
- STT is third-party: ElevenLabs does not own the transcription layer. Adds integration complexity and a dependency on external providers
- Pipeline latency: Similar 500-800ms range as other pipeline approaches, without the unified infrastructure advantage
- 43 Elo points below Inworld TTS on voice quality (Elo 1,197 vs. 1,240, March 2026)
Best for: Content creation, dubbing, and applications where voice variety and cloning quality are the primary requirements. Strong choice when voice identity is the product differentiator.
6. Deepgram Voice Agent API
Architecture: Optimized pipeline (Deepgram Nova STT + configurable LLM + Deepgram Aura TTS)
Pros:
- Best-in-class STT accuracy: Nova-3 consistently leads independent benchmarks on real-world audio (noisy environments, accents, overlapping speech)
- Sub-300ms STT latency: Among the fastest transcription engines available
- Competitive TTS pricing: ~$0.030/1K characters for Aura-2
- Enterprise features: On-prem deployment, domain-specific models, custom vocabulary
- Strong telephony integration: Purpose-built for call center and contact center workflows
Cons:
- TTS quality gap: Aura-2 is functional but not competitive with Inworld TTS or ElevenLabs on naturalness and expressiveness
- Narrower voice selection: Fewer pre-built voices, no voice cloning
- STT-first company: TTS is a newer product line. Less mature than the transcription offering
Best for: Contact center and telephony applications where transcription accuracy is the top priority. The STT quality is genuinely the best available for noisy, real-world audio environments.
Comparison Table
| Model | Architecture | End-to-End Latency | Voice Quality | LLM Flexibility | Voice Cloning | Languages | Deployment |
|---|
| Inworld S2S | Pipeline | 500-800ms | #1 (Elo 1,240) | 200+ models | Yes (10s audio) | Multi | Cloud API |
| GPT-4o Realtime | Native multimodal | ~320ms | Outside top 5 | GPT-4o only | No | 50+ | Cloud API |
| Gemini Live | Native multimodal | ~400ms | Outside top 5 | Gemini only | No | 70+ | Cloud API |
| Moshi | Native multimodal | ~200ms | MOS 4.3/5 | Moshi only (7B) | No | English | Self-hosted |
| ElevenLabs | Pipeline | 500-800ms | #2 (Elo 1,197) | Multi-LLM | Yes (best-in-class) | 30+ | Cloud API |
| Deepgram | Pipeline | 500-800ms | Functional | Multi-LLM | No | 36+ | Cloud + on-prem |
Voice quality rankings from the Artificial Analysis Speech Arena (March 2026). Latency figures from published benchmarks and production data.
How to Choose
Lowest possible latency + on-device: Moshi. 200ms, runs on a single GPU or iPhone 15 Pro, open-source. Limited to English and 7B reasoning, but nothing else matches the speed.
Best voice quality + LLM flexibility: Inworld S2S. #1 ranked TTS with access to 200+ LLMs through a single pipeline. You pay 200-400ms more latency for full control over every component.
Fastest cloud S2S with strong reasoning: GPT-4o Realtime API. ~320ms with GPT-4o's full reasoning capabilities. Accept the vendor lock-in and limited voice customization.
Multimodal (voice + vision): Gemini Live. The only production S2S system that processes images and audio simultaneously. Essential for AR/VR and camera-based applications.
Voice cloning and variety: ElevenLabs. Largest voice library, best cloning quality. Costs more at scale, but voice identity is the product.
Contact center / telephony: Deepgram. Best STT accuracy on noisy real-world audio. Purpose-built for the enterprise voice agent workflow.
The Architecture Decision
The native vs. pipeline choice comes down to one question: do you need to control the LLM?
If your application is a general-purpose voice assistant and you are committed to a single LLM provider, native multimodal (GPT-4o or Gemini) delivers the lowest latency with the simplest integration.
If you need to choose your LLM based on the task (route complex queries to Claude, simple ones to a fast open-source model), customize the voice, or insert business logic between transcription and response, a pipeline is the only viable path. The latency cost is real but manageable for most conversational applications, where human perception tolerates up to 800ms without friction.
Inworld's pipeline approach is worth particular attention here because it consolidates all three stages under one provider. Most pipeline S2S systems require stitching together STT from one vendor, LLM from another, and TTS from a third. Inworld provides all three plus the routing layer, which removes the integration overhead that makes pipelines brittle in production.
FAQ
What is a speech-to-speech model?
A speech-to-speech model converts spoken audio input into spoken audio output. Some models do this natively in a single pass (GPT-4o, Moshi), while others chain specialized models for transcription, reasoning, and voice synthesis. Both approaches produce real-time conversational voice interactions.
Which S2S model has the lowest latency?
Moshi by Kyutai achieves ~200ms end-to-end latency, the fastest of any S2S model. Among cloud APIs, GPT-4o Realtime averages ~320ms. Pipeline-based systems like Inworld S2S and ElevenLabs typically range from 500-800ms depending on the LLM selected.
Can I use my own LLM with a speech-to-speech system?
Only with pipeline-based S2S systems. Inworld S2S routes through 200+ models via its Router, and ElevenLabs Conversational AI supports multiple LLM providers. Native multimodal models (GPT-4o, Gemini, Moshi) use their own built-in reasoning and cannot be swapped.
Is native multimodal always better than a pipeline?
No. Native multimodal is faster but less flexible. Pipeline systems let you control the LLM, customize the voice, and insert business logic between stages. Most production deployments in 2026 use pipelines because teams need that control. The 200-400ms latency difference is imperceptible in many conversational applications.
Which S2S model has the best voice quality?
Inworld TTS-1.5, used in the Inworld S2S pipeline, holds the #1 ranking on the Artificial Analysis Speech Arena (ELO 1,240, March 2026). ElevenLabs Eleven v3 ranks #2 (ELO 1,197). Native multimodal models like GPT-4o and Gemini rank outside the top 5, producing lower-fidelity voice output than dedicated TTS models.
Published by Inworld AI. Voice quality rankings from the Artificial Analysis Speech Arena (March 2026). Latency benchmarks from published documentation and production measurements. Pricing reflects published rates as of March 2026 and may change.