By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
A voice agent platform with built-in TTS is a single API or runtime that handles speech recognition, language reasoning, and speech synthesis under one connection, so developers ship voice agents in days rather than months. Inworld AI's
Realtime API is the TTS-included variant: one WebSocket call delivers speech in, speech out, model-agnostic LLM routing through the
Realtime Router, and
Realtime TTS (#1 on the
Artificial Analysis Speech Arena, three of the top five). In 2026, the voice agent space splits into two architectural patterns:
TTS-included stacks that bundle the full pipeline, and
BYO-orchestration frameworks that compose components from multiple vendors. This guide explains the trade-offs, names the leaders in each pattern, and helps you match architecture to use case.
TTS-Included vs. BYO-Orchestration: The Two Patterns
| Pattern | What It Is | When To Use It | Leaders |
|---|
| TTS-included stacks | Single API/runtime with bundled STT + LLM + TTS, one billing relationship, one vendor for the speech pipeline | Production voice agents where time-to-ship and quality consistency matter; teams that want fewer moving parts | Realtime API (Inworld), ElevenLabs Conversational AI, Cartesia Line, Deepgram Voice Agent API, OpenAI Realtime API |
| BYO-orchestration frameworks | Open-source or vendor-neutral runtime where you bring your own STT, LLM, and TTS components | Multi-vendor experimentation, on-prem assembly, custom flow logic, deep telephony control | LiveKit Agents, Vapi, Pipecat, Retell, NLX |
Both patterns are legitimate. The decision depends on whether you want a complete vertically-integrated stack (TTS-included) or maximum component flexibility with the engineering ownership that comes with it (BYO-orchestration).
TTS-Included Stacks: Production-Ready Voice Agent Platforms
Realtime API (Inworld AI)
The
Realtime API provides one WebSocket connection that wraps STT, LLM routing, and TTS. Audio streams in over PCM16 at 24 kHz, the
Realtime Router selects the right LLM (hundreds available across all major providers), and
Realtime TTS returns synthesized speech with sub-200ms time-to-first-audio.
Strengths:
- #1-ranked TTS quality on Artificial Analysis (three of the top five spots).
- Model-agnostic LLM routing: choose any model from OpenAI, Anthropic, Google, Mistral, Meta, DeepSeek, xAI through one API.
- Voice-aware routing: STT acoustic signals (emotion, hesitation, speaker profile) feed the Router so model choice adapts to who is speaking.
- WebSocket and WebRTC protocols; OpenAI-compatible event format for easy migration.
- On-premise enterprise deployment available.
Best for: consumer voice agents, AI companions, language learning, interactive media, and enterprise voice agents where voice quality and model flexibility matter at the same time.
ElevenLabs Conversational AI
ElevenLabs' Conversational AI bundles their TTS (Eleven v3, ELO ~1,179, #2 on Artificial Analysis) with built-in turn-taking, function calling, RAG, and multimodal hooks. April 2026 brought on-premise enterprise deployment.
Strengths: broadest TTS language coverage (70+), strong brand, expanding feature set.
Trade-off: locks the LLM to their orchestrated stack; less flexibility in model selection.
Cartesia Line
Cartesia's Line combines their Sonic 3 TTS (sub-100ms TTFB) with Ink STT and an Agents platform launched April 2026. Strong on developer experience and latency.
Strengths: sub-100ms first-audio in some configurations. 42+ languages on Sonic 3.
Trade-off: smaller model catalog than provider-agnostic stacks.
Deepgram Voice Agent API
Deepgram bundles Nova-3 STT, an Aura TTS, and orchestration into a unified API. April 2026 added GPT-5.5 and Gemini 3.1 Flash Lite as supported LLMs.
Strengths: strongest STT in the bundle (Nova-3). On-prem option.
Trade-off: TTS is mid-tier on the Artificial Analysis leaderboard relative to specialist providers.
OpenAI Realtime API
OpenAI's Realtime API integrates GPT-5.5 with their TTS over WebSocket. Mature ecosystem and broad SDK support.
Strengths: large developer community. SIP support added.
Trade-off: locks you into OpenAI models; no provider flexibility, no TTS choice.
BYO-Orchestration Frameworks: Component-Level Control
For teams that need to assemble best-of-breed components, mix providers, or run on infrastructure outside any vendor's managed stack, the BYO-orchestration frameworks provide flexible runtimes that you compose with the STT, LLM, and TTS providers of your choice.
LiveKit Agents
LiveKit provides real-time WebRTC infrastructure plus an Agents framework for assembling voice pipelines. Supports STT, LLM, and TTS plug-ins from any provider, including
Realtime TTS.
Strengths: mature WebRTC stack, strong telephony integration via SIP, large open-source community. Works as the transport layer alongside any TTS-included stack.
Use case: teams that want vendor-neutral assembly with strong real-time transport.
Vapi
Vapi offers a runtime for voice agents with built-in telephony, function calling, and provider plug-ins. Realtime TTS is available as a TTS provider option.
Strengths: fast time-to-prototype for telephony. Vendor-neutral on STT/LLM/TTS choice.
Use case: outbound and inbound phone agents where the team wants flexibility on model choice.
Pipecat
Pipecat is an open-source Python framework for real-time voice and multimodal applications. Component-level control with a wide plug-in ecosystem.
Strengths: open source, Python-native, strong for custom flow logic.
Use case: teams with engineering capacity who want to own the runtime.
Retell
Retell focuses on telephony-native voice agents with built-in compliance features.
Use case: customer service phone agents where compliance and uptime are primary.
NLX
NLX provides a conversational AI platform with strong enterprise tooling.
Use case: enterprise CX deployments with structured flow design.
Decision Matrix: Which Pattern Fits Your Use Case
| Use Case | Recommended Pattern | Why |
|---|
| AI companion app, fast time-to-market | TTS-included (Realtime API) | One vendor, top voice quality, model flexibility |
| Enterprise voice agent with on-prem | TTS-included (Realtime API on-prem, ElevenLabs Enterprise) | Compliance and SLAs |
| Language learning at scale | TTS-included (Realtime API) | Multilingual quality and consistency |
| Telephony-heavy outbound dialer | BYO-orchestration (Vapi, Retell) + Realtime TTS | Best telephony integration plus top TTS |
| Multi-vendor experimentation | BYO-orchestration (LiveKit, Pipecat) | Flexibility to swap components |
| Interactive media, character voice consistency | TTS-included (Realtime API) | Voice cloning and voice library at scale |
How to Decide: Five Questions
- Is voice quality a product differentiator? If yes, lead with TTS quality. Realtime TTS is #1 on Artificial Analysis. Eleven v3 is #2.
- Do you need to switch LLMs based on context? If yes, choose a model-agnostic platform. Realtime API + Realtime Router routes to hundreds of models. OpenAI Realtime locks to OpenAI.
- Do you need on-prem deployment? Realtime API and ElevenLabs offer on-prem enterprise variants. BYO-orchestration platforms can run on-prem if every component supports it.
- How much engineering capacity do you have? TTS-included stacks compress the integration work. BYO-orchestration is faster only if you already have the engineers.
- What is your time horizon? TTS-included gets to production faster. BYO-orchestration optimizes for control over years.
FAQ
What is a voice agent platform?
A voice agent platform is a runtime or API that handles the full voice pipeline (speech in, language reasoning, speech out) for building real-time voice applications. Some bundle TTS into the platform (Realtime API, ElevenLabs Conversational AI, Cartesia Line, Deepgram Voice Agent, OpenAI Realtime). Others provide vendor-neutral orchestration where you bring your own STT, LLM, and TTS (LiveKit, Vapi, Pipecat, Retell, NLX).
What is the difference between TTS-included and BYO-orchestration?
TTS-included means the platform ships its own TTS as part of the bundle. BYO-orchestration means the platform is a runtime; you choose and integrate the TTS, STT, and LLM providers separately. TTS-included compresses time-to-ship and ensures voice quality consistency. BYO-orchestration gives component-level flexibility at the cost of more engineering ownership.
Can I use Realtime TTS inside a BYO-orchestration framework?
Yes. Realtime TTS is available as a TTS provider in LiveKit, Vapi, Pipecat, and other BYO-orchestration frameworks. Many production deployments combine these frameworks (for transport, telephony, flow logic) with Realtime TTS as the speech layer.
Which voice agent platform has the best TTS?
Voice quality rankings come from the Artificial Analysis Speech Arena, which uses blind human evaluation.
Realtime TTS holds the #1 position with three of the top five spots. ElevenLabs Eleven v3 ranks #2. Cartesia Sonic 3 and OpenAI's TTS rank lower on quality but offer different latency or ecosystem trade-offs.
How do I choose between OpenAI Realtime and Inworld Realtime API?
Both wrap the full speech pipeline into one API. OpenAI Realtime locks you into OpenAI models for both LLM and TTS. The
Realtime API routes through the
Realtime Router to hundreds of LLMs across all major providers, and uses #1-ranked Realtime TTS for speech output. Choose OpenAI Realtime if you are committed to the OpenAI stack. Choose Realtime API if you want model flexibility and top voice quality.