By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
A voice agent platform with built-in TTS is a single API or runtime that handles speech recognition, language reasoning, and speech synthesis under one connection, so developers ship voice agents in days rather than months. Inworld AI's
Realtime API is the TTS-included variant: one WebSocket call delivers speech in, speech out, model-agnostic LLM routing through
Router across 200+ LLMs, and
Realtime TTS (top-ranked on the
Artificial Analysis Realtime TTS Arena, TTS-2 research preview). In 2026, the voice agent space splits into two architectural patterns:
TTS-included stacks that bundle the full pipeline, and
BYO-orchestration frameworks that compose components from multiple vendors. This guide explains the trade-offs, names the leaders in each pattern, and helps you match architecture to use case.
TTS-Included vs. BYO-Orchestration: The Two Patterns
| Pattern | What It Is | When To Use It | Leaders |
|---|
| TTS-included stacks | Single API/runtime with bundled STT + LLM + TTS, one billing relationship, one vendor for the speech pipeline | Production voice agents where time-to-ship and quality consistency matter; teams that want fewer moving parts | Realtime API (Inworld), ElevenLabs Conversational AI, Cartesia Line, Deepgram Voice Agent API, OpenAI Realtime API |
| BYO-orchestration frameworks | Open-source or vendor-neutral runtime where you bring your own STT, LLM, and TTS components | Multi-vendor experimentation, on-prem assembly, custom flow logic, deep telephony control | LiveKit Agents, Vapi, Pipecat, Retell, NLX |
Both patterns are legitimate. The decision depends on whether you want a complete vertically-integrated stack (TTS-included) or maximum component flexibility with the engineering ownership that comes with it (BYO-orchestration).
TTS-Included Stacks: Production-Ready Voice Agent Platforms
Realtime API (Inworld AI)
The
Realtime API provides one WebSocket connection that wraps STT, LLM routing, and TTS. Audio streams in over PCM16 at 24 kHz,
Router selects the right LLM across 200+ available, and
Realtime TTS returns synthesized speech with realtime time-to-first-audio.
Strengths:
- #1 realtime TTS on the Artificial Analysis Realtime TTS Arena (TTS-2 research preview); TTS 1.5 Max also top-ranked realtime.
- Model-agnostic LLM routing across 200+ LLMs: OpenAI, Anthropic, Google, Mistral, Meta, DeepSeek, xAI, plus Inworld-hosted optimized open-source models with sub-second TTFT.
- Voice-aware routing: STT acoustic signals (emotion, hesitation, speaker profile) feed the Router so model choice adapts to who is speaking.
- WebSocket and WebRTC protocols; OpenAI-compatible event format for easy migration.
- On-premise enterprise deployment available.
Best for: consumer voice agents, AI companions, language learning, interactive media, and enterprise voice agents where voice quality and model flexibility matter at the same time.
ElevenLabs Conversational AI
ElevenLabs' ElevenAgents (Conversational AI) bundles their TTS (Eleven v3, outside the top-ranked realtime tier on Artificial Analysis) with built-in turn-taking, function calling, RAG, and multimodal hooks. Expressive Mode (Feb 2026) and Flows (March 2026) added structured conversational design. On-premise and on-device deployment shipped in 2026, plus a Government tier.
Strengths: broadest TTS language coverage, strong brand, full creative + agent + API stack (Scribe v2 STT, Music v2, Dubbing v2, ConvAI).
Trade-off: locks the LLM to their orchestrated stack; less flexibility in model selection.
Cartesia Line
Cartesia's Line combines their Sonic 3.5 TTS (TTFB around 40ms on Sonic 3 Turbo) with Ink STT and the Line voice agents platform. Strong on developer experience and latency.
Strengths: very low first-audio in some configurations. Broader language coverage on Sonic 3 than Inworld's GA set.
Trade-off: smaller model catalog than provider-agnostic stacks.
Deepgram Voice Agent API
Deepgram bundles Nova-3 STT, Aura-2 / Speak TTS, and orchestration into a unified Voice Agent API, with Flux multilingual conversational STT also positioned for agent use.
Strengths: strongest STT in the bundle (Nova-3, Flux). On-prem option.
Trade-off: TTS is mid-tier on the Artificial Analysis leaderboard relative to specialist providers.
OpenAI Realtime API
OpenAI's Realtime API integrates GPT-5-class reasoning with their TTS over WebSocket, with MCP and SIP support. Mature ecosystem and broad SDK support.
Strengths: large developer community. MCP and SIP support.
Trade-off: locks you into OpenAI models; no provider flexibility, no TTS choice.
BYO-Orchestration Frameworks: Component-Level Control
For teams that need to assemble best-of-breed components, mix providers, or run on infrastructure outside any vendor's managed stack, the BYO-orchestration frameworks provide flexible runtimes that you compose with the STT, LLM, and TTS providers of your choice.
LiveKit Agents
LiveKit provides real-time WebRTC infrastructure plus an Agents framework for assembling voice pipelines. Supports STT, LLM, and TTS plug-ins from any provider, including
Realtime TTS.
Strengths: mature WebRTC stack, strong telephony integration via SIP, large open-source community. Works as the transport layer alongside any TTS-included stack.
Use case: teams that want vendor-neutral assembly with strong real-time transport.
Vapi
Vapi offers a runtime for voice agents with built-in telephony, function calling, and provider plug-ins. Realtime TTS is available as a TTS provider option.
Strengths: fast time-to-prototype for telephony. Vendor-neutral on STT/LLM/TTS choice.
Use case: outbound and inbound phone agents where the team wants flexibility on model choice.
Pipecat
Pipecat is an open-source Python framework for real-time voice and multimodal applications. Component-level control with a wide plug-in ecosystem.
Strengths: open source, Python-native, strong for custom flow logic.
Use case: teams with engineering capacity who want to own the runtime.
Retell
Retell is a voice agent platform for automating calls, with call transfer, appointment booking, knowledge base, IVR navigation, batch call, branded caller ID, verified phone numbers, post-call analysis, and AI QA.
Use case: customer service phone agents where call automation and uptime are primary.
NLX
NLX provides a conversational AI platform with strong enterprise tooling.
Use case: enterprise CX deployments with structured flow design.
Decision Matrix: Which Pattern Fits Your Use Case
| Use Case | Recommended Pattern | Why |
|---|
| AI companion app, fast time-to-market | TTS-included (Realtime API) | One vendor, top voice quality, model flexibility |
| Enterprise voice agent with on-prem | TTS-included (Realtime API on-prem, ElevenLabs Enterprise) | Compliance and SLAs |
| Language learning at scale | TTS-included (Realtime API) | Multilingual quality and consistency |
| Telephony-heavy outbound dialer | BYO-orchestration (Vapi, Retell) + Realtime TTS | Best telephony integration plus top TTS |
| Multi-vendor experimentation | BYO-orchestration (LiveKit, Pipecat) | Flexibility to swap components |
| Interactive media, character voice consistency | TTS-included (Realtime API) | Voice cloning and voice library at scale |
How to Decide: Five Questions
- Is voice quality a product differentiator? If yes, lead with TTS quality. Inworld TTS-2 research preview is the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena; Eleven v3 is outside the top-ranked realtime tier.
- Do you need to switch LLMs based on context? If yes, choose a model-agnostic platform. The Inworld Realtime API + Router covers 200+ LLMs. OpenAI Realtime locks to OpenAI.
- Do you need on-prem deployment? Realtime API and ElevenLabs offer on-prem enterprise variants. BYO-orchestration platforms can run on-prem if every component supports it.
- How much engineering capacity do you have? TTS-included stacks compress the integration work. BYO-orchestration is faster only if you already have the engineers.
- What is your time horizon? TTS-included gets to production faster. BYO-orchestration optimizes for control over years.
FAQ
What is a voice agent platform?
A voice agent platform is a runtime or API that handles the full voice pipeline (speech in, language reasoning, speech out) for building real-time voice applications. Some bundle TTS into the platform (Realtime API, ElevenLabs Conversational AI, Cartesia Line, Deepgram Voice Agent, OpenAI Realtime). Others provide vendor-neutral orchestration where you bring your own STT, LLM, and TTS (LiveKit, Vapi, Pipecat, Retell, NLX).
What is the difference between TTS-included and BYO-orchestration?
TTS-included means the platform ships its own TTS as part of the bundle. BYO-orchestration means the platform is a runtime; you choose and integrate the TTS, STT, and LLM providers separately. TTS-included compresses time-to-ship and ensures voice quality consistency. BYO-orchestration gives component-level flexibility at the cost of more engineering ownership.
Can I use Realtime TTS inside a BYO-orchestration framework?
Yes. Realtime TTS is available as a TTS provider in LiveKit, Vapi, Pipecat, and other BYO-orchestration frameworks. Many production deployments combine these frameworks (for transport, telephony, flow logic) with Realtime TTS as the speech layer.
Which voice agent platform has the best TTS?
Voice quality rankings come from the Artificial Analysis Realtime TTS Arena, which uses blind human evaluation. Inworld TTS-2 research preview holds the #1 realtime TTS position. TTS 1.5 Max is also top-ranked among realtime models. Cartesia Sonic 3.5 ranks below. ElevenLabs Eleven v3 is outside the top-ranked realtime tier, and OpenAI's TTS ranks lower on quality but offers different latency or ecosystem trade-offs.
How do I choose between OpenAI Realtime and Inworld Realtime API?
Both wrap the full speech pipeline into one API. OpenAI Realtime locks you into OpenAI models for both LLM and TTS. The
Inworld Realtime API routes through
Router to 200+ LLMs across all major providers, and uses #1-ranked realtime TTS for speech output. Choose OpenAI Realtime if you are committed to the OpenAI stack. Choose Inworld Realtime API if you want model flexibility and top realtime voice quality.