Voice Agent Platforms with Built-In TTS: 2026 Architecture Guide

By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026

A voice agent platform with built-in TTS is a single API or runtime that handles speech recognition, language reasoning, and speech synthesis under one connection, so developers ship voice agents in days rather than months. Inworld AI's Realtime API is the TTS-included variant: one WebSocket call delivers speech in, speech out, model-agnostic LLM routing through Router across 220+ LLMs, and Realtime TTS, the #1 realtime TTS (TTS-2 research preview, with 8-dimension natural-language steering). In 2026, the voice agent space splits into two architectural patterns: TTS-included stacks that bundle the full pipeline, and BYO-orchestration frameworks that compose components from multiple vendors. This guide explains the trade-offs, names the leaders in each pattern, and helps you match architecture to use case.

TTS-Included vs. BYO-Orchestration: The Two Patterns

Pattern	What It Is	When To Use It	Leaders
TTS-included stacks	Single API/runtime with bundled STT + LLM + TTS, one billing relationship, one vendor for the speech pipeline	Production voice agents where time-to-ship and quality consistency matter; teams that want fewer moving parts	Realtime API (Inworld), ElevenLabs Conversational AI, Cartesia Line, Deepgram Voice Agent API, OpenAI Realtime API
BYO-orchestration frameworks	Open-source or vendor-neutral runtime where you bring your own STT, LLM, and TTS components	Multi-vendor experimentation, on-prem assembly, custom flow logic, deep telephony control	LiveKit Agents, Vapi, Pipecat, Retell, NLX

Both patterns are legitimate. The decision depends on whether you want a complete vertically-integrated stack (TTS-included) or maximum component flexibility with the engineering ownership that comes with it (BYO-orchestration).

TTS-Included Stacks: Production-Ready Voice Agent Platforms

Realtime API (Inworld AI)

The Realtime API provides one WebSocket connection that wraps STT, LLM routing, and TTS. Audio streams in over PCM16 at 24 kHz, Router selects the right LLM across 220+ available, and Realtime TTS returns synthesized speech with realtime time-to-first-audio.

Strengths:

Realtime TTS built for expressive, low-latency speech (TTS-2 research preview, with 8-dimension natural-language steering and sub-200ms TTFT median).
Model-agnostic LLM routing across 220+ LLMs: OpenAI, Anthropic, Google, Mistral, Meta, DeepSeek, xAI, plus Inworld-hosted optimized open-source models with sub-second TTFT.
Conditional routing with fallback and A/B testing across models, plus unified billing and metering.
WebSocket and WebRTC protocols; OpenAI-compatible event format for easy migration.
On-premise enterprise deployment available.

Best for: consumer voice agents, AI companions, language learning, interactive media, and enterprise voice agents where voice quality and model flexibility matter at the same time.

ElevenLabs Conversational AI

ElevenLabs' ElevenAgents (Conversational AI) bundles their TTS (Eleven v3, strong on expressive range and language breadth but not recommended by ElevenLabs for realtime use) with built-in turn-taking, function calling, RAG, and multimodal hooks. Expressive Mode (Feb 2026) and Flows (March 2026) added structured conversational design. On-premise and on-device deployment shipped in 2026, plus a Government tier.

Strengths: broadest TTS language coverage, strong brand, full creative + agent + API stack (Scribe v2 STT, Music v2, Dubbing v2, ConvAI). Trade-off: locks the LLM to their orchestrated stack; less flexibility in model selection.

Cartesia Line

Cartesia's Line combines their Sonic 3.5 TTS (TTFB around 40ms on Sonic 3 Turbo) with Ink STT and the Line voice agents platform. Strong on developer experience and latency.

Strengths: very low first-audio in some configurations. Broader language coverage on Sonic 3 than Inworld's GA set. Trade-off: smaller model catalog than provider-agnostic stacks.

Deepgram Voice Agent API

Deepgram bundles Nova-3 STT, Aura-2 / Speak TTS, and orchestration into a unified Voice Agent API, with Flux multilingual conversational STT also positioned for agent use.

Strengths: strongest STT in the bundle (Nova-3, Flux). On-prem option. Trade-off: the bundled TTS is a general-purpose voice rather than a specialist realtime model.

OpenAI Realtime API

OpenAI's Realtime API integrates GPT-5-class reasoning with their TTS over WebSocket, with MCP and SIP support. Mature ecosystem and broad SDK support.

Strengths: large developer community. MCP and SIP support. Trade-off: locks you into OpenAI models; no provider flexibility, no TTS choice.

BYO-Orchestration Frameworks: Component-Level Control

For teams that need to assemble best-of-breed components, mix providers, or run on infrastructure outside any vendor's managed stack, the BYO-orchestration frameworks provide flexible runtimes that you compose with the STT, LLM, and TTS providers of your choice.

LiveKit Agents

LiveKit provides real-time WebRTC infrastructure plus an Agents framework for assembling voice pipelines. Supports STT, LLM, and TTS plug-ins from any provider, including Realtime TTS.

Strengths: mature WebRTC stack, strong telephony integration via SIP, large open-source community. Works as the transport layer alongside any TTS-included stack. Use case: teams that want vendor-neutral assembly with strong real-time transport.

Vapi

Vapi offers a runtime for voice agents with built-in telephony, function calling, and provider plug-ins. Realtime TTS is available as a TTS provider option.

Strengths: fast time-to-prototype for telephony. Vendor-neutral on STT/LLM/TTS choice. Use case: outbound and inbound phone agents where the team wants flexibility on model choice.

Pipecat

Pipecat is an open-source Python framework for real-time voice and multimodal applications. Component-level control with a wide plug-in ecosystem.

Strengths: open source, Python-native, strong for custom flow logic. Use case: teams with engineering capacity who want to own the runtime.

Retell

Retell is a voice agent platform for automating calls, with call transfer, appointment booking, knowledge base, IVR navigation, batch call, branded caller ID, verified phone numbers, post-call analysis, and AI QA.

Use case: customer service phone agents where call automation and uptime are primary.

NLX

NLX provides a conversational AI platform with strong enterprise tooling.

Use case: enterprise CX deployments with structured flow design.

Decision Matrix: Which Pattern Fits Your Use Case

Use Case	Recommended Pattern	Why
AI companion app, fast time-to-market	TTS-included (Realtime API)	One vendor, top voice quality, model flexibility
Enterprise voice agent with on-prem	TTS-included (Realtime API on-prem, ElevenLabs Enterprise)	Compliance and SLAs
Language learning at scale	TTS-included (Realtime API)	Multilingual quality and consistency
Telephony-heavy outbound dialer	BYO-orchestration (Vapi, Retell) + Realtime TTS	Best telephony integration plus top TTS
Multi-vendor experimentation	BYO-orchestration (LiveKit, Pipecat)	Flexibility to swap components
Interactive media, character voice consistency	TTS-included (Realtime API)	Voice cloning and voice library at scale

How to Decide: Five Questions

Is voice quality a product differentiator? If yes, lead with TTS quality and test it on your own scripts. Inworld TTS-2 research preview is built for expressive, low-latency realtime speech, with 8-dimension natural-language steering; Eleven v3 has broader language coverage but is not recommended for realtime.
Do you need to switch LLMs based on context? If yes, choose a model-agnostic platform. The Inworld Realtime API + Router covers 220+ LLMs. OpenAI Realtime locks to OpenAI.
Do you need on-prem deployment? Realtime API and ElevenLabs offer on-prem enterprise variants. BYO-orchestration platforms can run on-prem if every component supports it.
How much engineering capacity do you have? TTS-included stacks compress the integration work. BYO-orchestration is faster only if you already have the engineers.
What is your time horizon? TTS-included gets to production faster. BYO-orchestration optimizes for control over years.

FAQ

What is a voice agent platform?

A voice agent platform is a runtime or API that handles the full voice pipeline (speech in, language reasoning, speech out) for building real-time voice applications. Some bundle TTS into the platform (Realtime API, ElevenLabs Conversational AI, Cartesia Line, Deepgram Voice Agent, OpenAI Realtime). Others provide vendor-neutral orchestration where you bring your own STT, LLM, and TTS (LiveKit, Vapi, Pipecat, Retell, NLX).

What is the difference between TTS-included and BYO-orchestration?

TTS-included means the platform ships its own TTS as part of the bundle. BYO-orchestration means the platform is a runtime; you choose and integrate the TTS, STT, and LLM providers separately. TTS-included compresses time-to-ship and ensures voice quality consistency. BYO-orchestration gives component-level flexibility at the cost of more engineering ownership.

Can I use Realtime TTS inside a BYO-orchestration framework?

Yes. Realtime TTS is available as a TTS provider in LiveKit, Vapi, Pipecat, and other BYO-orchestration frameworks. Many production deployments combine these frameworks (for transport, telephony, flow logic) with Realtime TTS as the speech layer.

Which voice agent platform has the best TTS?

Judge TTS quality on your own scripts with side-by-side audio samples and blind user preference tests. Inworld TTS-2 research preview is built for expressive, low-latency realtime speech, with 8-dimension natural-language steering and sub-200ms TTFT median. ElevenLabs Eleven v3 has the broadest language coverage but is not recommended for realtime. Cartesia Sonic 3.5 is strong on first-audio latency. OpenAI's TTS trades some expressive range for ecosystem convenience.

How do I choose between OpenAI Realtime and Inworld Realtime API?

Both wrap the full speech pipeline into one API. OpenAI Realtime locks you into OpenAI models for both LLM and TTS. The Inworld Realtime API routes through Router to 220+ LLMs across all major providers, and uses Realtime TTS built for expressive, low-latency realtime speech. Choose OpenAI Realtime if you are committed to the OpenAI stack. Choose Inworld Realtime API if you want model flexibility and expressive realtime voice quality.