Vapi, Pipecat, and LiveKit are the three most popular frameworks for building voice agents in 2026, and they solve fundamentally different problems.
Realtime TTS from Inworld AI (ranked #1 on the
Artificial Analysis leaderboard with an ELO around 1,208) plugs into any of them as the TTS layer. Or you can skip the framework entirely and use the
Realtime API for the full voice pipeline over a single connection. This guide breaks down each framework honestly so you can pick the right one for your stack.
Which voice agent framework should I use?
The answer depends on where you want to spend your engineering time.
Vapi: managed orchestration for speed
Vapi is a managed voice agent platform. You configure your STT, LLM, and TTS providers through Vapi's API or dashboard, and Vapi handles the realtime orchestration loop: listen, think, speak. The value proposition is speed to first call. You can have a working voice agent on a phone number in hours, not days.
Strengths:
- Fastest path from zero to a working voice agent. Developer onboarding is genuinely good.
- Built-in telephony integration with Twilio and Vonage. Phone number provisioning is a first-class feature.
- Flow Studio provides a visual, no-code builder for multi-step conversational workflows.
- Broad provider compatibility. You pick your STT, LLM, and TTS vendors through configuration.
Trade-offs:
- Closed source. You cannot inspect or modify the orchestration layer.
- Cost stacks quickly. The platform fee is the floor, not the ceiling. STT, LLM, TTS, and telephony costs layer on top, and the total per-minute cost depends entirely on which providers you select.
- You do not control the voice pipeline directly. Latency optimization, custom VAD behavior, and pipeline-level debugging are limited to what Vapi exposes.
- Vendor lock-in risk. Your agent logic lives in Vapi's configuration format.
When to pick Vapi: You need a voice agent on a phone line this week, your team is small, and you would rather configure than code. Vapi is the right call for prototyping and for telephony-heavy use cases where the managed overhead saves more time than it costs.
Pipecat: composable pipelines for full control
Pipecat is an open-source Python framework created by Daily. It reached v1.0 in April 2026. The core abstraction is a pipeline of frame processors: audio frames flow in, get processed through STT, LLM, and TTS stages, and audio flows back out. You assemble the pipeline yourself, which means you control every processing step.
Strengths:
- Full pipeline visibility and control. You can inspect, modify, or replace any stage.
- 60+ service integrations out of the box. Swap TTS providers without touching the rest of your pipeline.
- Pipecat Flows for managing complex conversational state machines.
- Subagents for distributed multi-agent architectures where specialists hand off conversations.
- Strong developer tooling: Whisker for real-time pipeline debugging, Tail for live monitoring, and OpenTelemetry integration.
- No platform lock-in. BSD-2 license, your code runs anywhere.
Trade-offs:
- Python only for the core framework. Client SDKs exist for JS, React, iOS, and Android, but your agent logic is Python.
- You own the infrastructure. Pipecat does not host your agents. You need to run them on your own servers or a cloud provider.
- Transport is separate. Pipecat handles the AI pipeline; for WebRTC transport, most teams use Daily (which makes sense given the lineage). You can also bring your own transport layer.
- Steeper learning curve than a managed platform. Expect a day or two to get comfortable with the frame processing model.
When to pick Pipecat: You want to own every layer of your voice pipeline, your team writes Python, and you are comfortable managing your own infrastructure. Pipecat is the right choice when pipeline-level control matters more than deployment speed.
LiveKit: WebRTC infrastructure with an agent layer
LiveKit is an open-source WebRTC media server with an
Agents SDK built on top. The core differentiator is the room model: your agent joins a LiveKit room as a participant alongside users, which means multi-participant scenarios (group calls, video + voice, screen sharing) are native rather than bolted on.
Strengths:
- WebRTC infrastructure is battle-tested and production-grade. LiveKit handles the hard parts of realtime media transport.
- Self-hostable end to end. The media server, Agents SDK, and SIP bridge are all open source under Apache 2.0.
- Native multi-participant support. Your agent can interact in rooms with multiple users, share video, and handle complex turn-taking.
- Built-in semantic turn detection, adaptive interruption handling, and MCP tool support.
- Python and TypeScript SDKs for agent logic.
- LiveKit Cloud available if you do not want to manage infrastructure, with pricing starting at $0.01 per agent session minute.
Trade-offs:
- More infrastructure to manage if you self-host. You are running a WebRTC media server, not just a Python script.
- The Agents SDK is one layer in a larger system. Understanding LiveKit rooms, participants, and tracks is a prerequisite.
- The ecosystem is WebRTC-native, which is ideal for browser-based and multi-participant use cases but adds complexity if all you need is a simple phone agent.
- Smaller plugin ecosystem than Pipecat for AI service integrations, though the gap is closing.
When to pick LiveKit: You need WebRTC-native infrastructure, multi-participant rooms, or the ability to self-host everything. LiveKit is the right choice when your agent lives inside a broader realtime communication system rather than operating as a standalone phone bot.
Can I use Realtime TTS with these frameworks?
Yes. All three frameworks are orchestration layers that depend on underlying TTS models.
Realtime TTS exposes a standard REST API that slots into any of them as the TTS provider.
With Pipecat, you write a custom service class that calls the Realtime TTS API in the TTS stage of your pipeline. Pipecat's plugin architecture is designed for exactly this pattern. The TTS endpoint accepts text and returns streaming audio, which maps directly to Pipecat's frame processing model:
# Pipecat pipeline with Realtime TTS
# Uses the REST streaming endpoint: POST /tts/v1/voice:stream
import aiohttp
import base64
from pipecat.frames.frames import AudioRawFrame
from pipecat.services.ai_services import TTSService
class RealtimeTTSService(TTSService):
def __init__(self, api_key: str, voice_id: str = "Sarah", model_id: str = "inworld-tts-1.5-max"):
super().__init__()
self._api_key = api_key
self._voice_id = voice_id
self._model_id = model_id
async def run_tts(self, text: str):
async with aiohttp.ClientSession() as session:
async with session.post(
"https://api.inworld.ai/tts/v1/voice:stream",
headers={
"Authorization": f"Basic {self._api_key}",
"Content-Type": "application/json",
},
json={
"text": text,
"voiceId": self._voice_id,
"modelId": self._model_id,
"audioConfig": {
"audioEncoding": "PCM",
"sampleRateHertz": 24000,
},
},
) as resp:
resp.raise_for_status()
async for line in resp.content:
line = line.strip()
if not line:
continue
import json
chunk = json.loads(line)
audio_bytes = base64.b64decode(
chunk["result"]["audioContent"]
)
yield AudioRawFrame(
audio=audio_bytes,
sample_rate=24000,
num_channels=1,
)
With LiveKit, the Agents SDK voice pipeline accepts custom TTS implementations. Point the TTS stage at the Realtime TTS endpoint using the same REST streaming pattern shown above.
With Vapi, you configure custom providers in the assistant settings. Point the TTS provider to the Realtime TTS API endpoint, and Vapi will use it within its managed pipeline.
The integration pattern is the same in all cases: your framework handles orchestration, transport, and turn-taking. Realtime TTS handles voice synthesis. Swap it in for
#1 ranked TTS quality without changing your framework choice.
Framework vs API: when do you need a framework at all?
Frameworks solve orchestration: connecting STT to LLM to TTS, managing turn-taking, handling interruptions, and piping audio through a transport layer. If you are building a voice agent from individual components, a framework saves you from writing that glue code yourself.
But there is an alternative. The
Realtime API handles STT, reasoning, TTS, and tool calling through a single WebSocket or WebRTC connection. One API call covers the full voice pipeline: audio goes in, audio comes out, with the reasoning and voice generation handled server-side. The system includes semantic VAD with configurable eagerness, barge-in and interruption handling, and simultaneous text and audio streaming.
The honest answer: use a framework when you need transport-level control, multi-participant rooms, native telephony, or custom pipeline stages that go beyond what an API exposes. Use the Realtime API when you want the simplest possible integration with the lowest latency and do not need to manage your own voice pipeline.
You can also mix approaches. Start with the Realtime API for the core voice loop, and use a framework for transport and session management if your deployment requires it.
How to evaluate voice quality across any framework
Whichever framework you choose, the TTS model determines how your agent sounds. Framework choice affects latency and orchestration; TTS model choice affects whether your users want to keep talking.
A few things worth checking:
- Independent benchmarks. The Artificial Analysis TTS Arena ranks models by blind human preference votes. Realtime TTS 1.5 Max currently sits at #1 with an ELO around 1,208. ELO scores shift with new votes, so check the live leaderboard.
- End-to-end latency, not just TTFA. Time-to-first-audio is useful but incomplete. What your users experience is the full round-trip from finishing their sentence to hearing the agent respond. Ask your TTS provider for median end-to-end numbers, not cherry-picked inference metrics.
- Interruption recovery. How the agent handles barge-in (the user talks over the agent) matters more than raw latency for conversational quality. Test this with your actual framework integration, not in isolation.
TL;DR: picking the right stack
- Need a phone agent this week? Vapi. Configure your providers, point it at a phone number, ship it.
- Want full pipeline control in Python? Pipecat. Assemble your own STT + LLM + TTS stack, own every frame.
- Building WebRTC-native or multi-participant? LiveKit. The room model and self-hosting are hard to replicate.
- Want the simplest path to top-quality voice with minimal infrastructure? Skip the framework. Use the Realtime API directly.
Regardless of which path you take, the TTS model matters.
Realtime TTS delivers #1 ranked voice quality and integrates with any framework as a standard REST TTS provider. Or skip the framework entirely and use the
Realtime API for the full voice pipeline over a single connection.
FAQs
What is the difference between Vapi, Pipecat, and LiveKit?
Vapi is a managed voice agent platform with pay-per-minute pricing and built-in telephony. Pipecat is an open-source Python framework by Daily with a composable pipeline architecture where you assemble your own STT, LLM, and TTS stack. LiveKit is an open-source WebRTC infrastructure project with an Agents SDK for building voice and video AI applications, available self-hosted or as a managed cloud service. All three are orchestration layers that need underlying models (TTS, STT, LLM) to function.
Which voice agent framework is best for production?
It depends on your production requirements. Vapi optimizes for fast deployment and managed infrastructure. Pipecat gives you the most pipeline-level control for tuning latency and quality. LiveKit gives you the most infrastructure-level control, especially if you self-host. For production voice quality specifically, the TTS model matters more than the framework.
Realtime TTS ranks #1 on Artificial Analysis regardless of which framework you use.
Can I switch frameworks later without rebuilding?
Partially. Your LLM prompts, tool definitions, and business logic are portable. Your orchestration code and transport integration are not. Pipecat and LiveKit are both open source, so migration between them is a code refactor, not a vendor negotiation. Migrating off Vapi requires rebuilding the orchestration layer since it is closed source.
How much does it cost to run a voice agent?
Framework costs are only part of the picture. Vapi charges a per-minute platform fee plus passthrough costs for your chosen providers. Pipecat and LiveKit are free to use, but you pay for the AI services (TTS, STT, LLM) and your hosting infrastructure. LiveKit Cloud adds $0.01/min for agent sessions. The dominant cost in any stack is usually the LLM and TTS providers, not the framework itself. See
current pricing for Realtime TTS, STT, and Router rates.
Do I need WebRTC for a voice agent?
Not necessarily. WebRTC gives you low-latency, browser-native audio transport with NAT traversal and encryption built in. It is the best choice for browser-based and multi-participant use cases. For server-to-server pipelines or telephony, WebSocket-based transports work fine. The
Realtime API supports both WebSocket and WebRTC connections.
What is the best TTS for voice agents in 2026?
Realtime TTS 1.5 Max ranks #1 on the
Artificial Analysis TTS leaderboard with an ELO around 1,208 and sub-200ms median end-to-end latency. It works with any framework through a standard REST API. The Mini variant runs at around 120ms median for latency-sensitive use cases. ELO scores fluctuate, so always check the live leaderboard for the latest rankings.
Is the Realtime API a competitor to these frameworks?
Not exactly. The
Realtime API replaces the need for a framework by handling the full voice pipeline (STT, reasoning, TTS, tool calling) through a single connection. Frameworks remain the better choice when you need custom pipeline stages, self-hosted transport, multi-participant rooms, or native telephony. Think of it as a vertical integration option: same models, no orchestration required.