Can I use Inworld with Vapi, Pipecat, or LiveKit?

Yes. Realtime TTS exposes a standard REST API that integrates with any of these frameworks as a TTS provider. Pipecat and LiveKit have plugin architectures where you can swap in any TTS provider. Vapi supports custom provider configurations. Alternatively, the Realtime API provides a fully integrated voice pipeline without needing a separate framework.

Is Pipecat free to use?

Pipecat is open-source under a BSD license and free to use. Your costs come from the AI services you plug into the pipeline (TTS, STT, LLM providers) and your own hosting infrastructure. Daily offers managed transport if you do not want to handle WebRTC yourself.

Is LiveKit free to use?

The LiveKit server and Agents SDK are open-source under Apache 2.0 and free to self-host. LiveKit Cloud is a managed option with pricing starting at $0.01 per agent session minute. Self-hosting eliminates per-minute platform fees but requires you to manage your own infrastructure.

What is the latency difference between using a framework and a direct API?

Frameworks add orchestration overhead on top of the underlying model latency. A cascaded pipeline (separate STT, LLM, TTS calls) typically adds 100-300ms of total overhead depending on the framework and transport. A direct realtime API that handles the full pipeline in one connection eliminates most of that orchestration cost.

Vapi vs Pipecat vs LiveKit: Which Voice Agent Framework in 2026?

Q: Which voice agent framework should I use?

Choose Vapi if you want the fastest path to a working voice agent with minimal infrastructure work. Choose Pipecat if you want full pipeline control in Python with no platform lock-in. Choose LiveKit if you need native WebRTC room infrastructure, multi-participant support, or want to self-host everything. All three are orchestration layers that need underlying TTS, STT, and LLM models to function.

Q: Do I need a framework to build a voice agent?

No. Frameworks handle orchestration, turn-taking, and transport, but you can build directly on a realtime API instead. The Realtime API handles STT, reasoning, TTS, and tool calling through a single WebSocket or WebRTC connection, which eliminates the need to stitch together separate services through a framework.

Q: What is the best TTS to use with Pipecat or LiveKit?

Both frameworks are provider-agnostic, so the best TTS depends on your priorities. Inworld TTS-2 research preview is built for expressive, low-latency realtime speech, with 8-dimension natural-language steering and sub-200ms TTFT median, making it a strong default for production voice quality.

Vapi, Pipecat, and LiveKit are the three most popular frameworks for building voice agents in 2026, and they solve fundamentally different problems. Realtime TTS from Inworld AI (TTS-2 research preview is the #1 realtime TTS) plugs into any of them as the TTS layer. Or you can skip the framework entirely and use the Realtime API for the full voice pipeline over a single connection. This guide breaks down each framework honestly so you can pick the right one for your stack.

Which voice agent framework should I use?

The answer depends on where you want to spend your engineering time.

Vapi: managed orchestration for speed

Vapi is a managed voice agent platform. You configure your STT, LLM, and TTS providers through Vapi's API or dashboard, and Vapi handles the realtime orchestration loop: listen, think, speak. The value proposition is speed to first call. You can have a working voice agent on a phone number in hours, not days.

Strengths:

Fastest path from zero to a working voice agent. Developer onboarding is genuinely good.
Built-in telephony integration with Twilio and Vonage. Phone number provisioning is a first-class feature.
Flow Studio provides a visual, no-code builder for multi-step conversational workflows.
Broad provider compatibility. You pick your STT, LLM, and TTS vendors through configuration.

Trade-offs:

Closed source. You cannot inspect or modify the orchestration layer.
Cost stacks quickly. The platform fee is the floor, not the ceiling. STT, LLM, TTS, and telephony costs layer on top, and the total per-minute cost depends entirely on which providers you select.
You do not control the voice pipeline directly. Latency optimization, custom VAD behavior, and pipeline-level debugging are limited to what Vapi exposes.
Vendor lock-in risk. Your agent logic lives in Vapi's configuration format.

When to pick Vapi: You need a voice agent on a phone line this week, your team is small, and you would rather configure than code. Vapi is the right call for prototyping and for telephony-heavy use cases where the managed overhead saves more time than it costs.

Pipecat: composable pipelines for full control

Pipecat is an open-source Python framework created by Daily. It reached v1.0 in April 2026. The core abstraction is a pipeline of frame processors: audio frames flow in, get processed through STT, LLM, and TTS stages, and audio flows back out. You assemble the pipeline yourself, which means you control every processing step.

Strengths:

Full pipeline visibility and control. You can inspect, modify, or replace any stage.
60+ service integrations out of the box. Swap TTS providers without touching the rest of your pipeline.
Pipecat Flows for managing complex conversational state machines.
Subagents for distributed multi-agent architectures where specialists hand off conversations.
Strong developer tooling: Whisker for real-time pipeline debugging, Tail for live monitoring, and OpenTelemetry integration.
No platform lock-in. BSD-2 license, your code runs anywhere.

Trade-offs:

Python only for the core framework. Client SDKs exist for JS, React, iOS, and Android, but your agent logic is Python.
You own the infrastructure. Pipecat does not host your agents. You need to run them on your own servers or a cloud provider.
Transport is separate. Pipecat handles the AI pipeline; for WebRTC transport, most teams use Daily (which makes sense given the lineage). You can also bring your own transport layer.
Steeper learning curve than a managed platform. Expect a day or two to get comfortable with the frame processing model.

When to pick Pipecat: You want to own every layer of your voice pipeline, your team writes Python, and you are comfortable managing your own infrastructure. Pipecat is the right choice when pipeline-level control matters more than deployment speed.

LiveKit: WebRTC infrastructure with an agent layer

LiveKit is an open-source WebRTC media server with an Agents SDK built on top. The core differentiator is the room model: your agent joins a LiveKit room as a participant alongside users, which means multi-participant scenarios (group calls, video + voice, screen sharing) are native rather than bolted on.

Strengths:

WebRTC infrastructure is battle-tested and production-grade. LiveKit handles the hard parts of realtime media transport.
Self-hostable end to end. The media server, Agents SDK, and SIP bridge are all open source under Apache 2.0.
Native multi-participant support. Your agent can interact in rooms with multiple users, share video, and handle complex turn-taking.
Built-in semantic turn detection, adaptive interruption handling, and MCP tool support.
Python and TypeScript SDKs for agent logic.
LiveKit Cloud available if you do not want to manage infrastructure, with pricing starting at $0.01 per agent session minute.

Trade-offs:

More infrastructure to manage if you self-host. You are running a WebRTC media server, not just a Python script.
The Agents SDK is one layer in a larger system. Understanding LiveKit rooms, participants, and tracks is a prerequisite.
The ecosystem is WebRTC-native, which is ideal for browser-based and multi-participant use cases but adds complexity if all you need is a simple phone agent.
Smaller plugin ecosystem than Pipecat for AI service integrations, though the gap is closing.

When to pick LiveKit: You need WebRTC-native infrastructure, multi-participant rooms, or the ability to self-host everything. LiveKit is the right choice when your agent lives inside a broader realtime communication system rather than operating as a standalone phone bot.

Can I use Realtime TTS with these frameworks?

Yes. All three frameworks are orchestration layers that depend on underlying TTS models. Realtime TTS exposes a standard REST API that slots into any of them as the TTS provider.

With Pipecat, you write a custom service class that calls the Realtime TTS API in the TTS stage of your pipeline. Pipecat's plugin architecture is designed for exactly this pattern. The TTS endpoint accepts text and returns streaming audio, which maps directly to Pipecat's frame processing model:

# Pipecat pipeline with Realtime TTS
# Uses the REST streaming endpoint: POST /tts/v1/voice:stream

import aiohttp
import base64
from pipecat.frames.frames import AudioRawFrame
from pipecat.services.ai_services import TTSService

class RealtimeTTSService(TTSService):
    def __init__(self, api_key: str, voice_id: str = "Sarah", model_id: str = "inworld-tts-1.5-max"):
        super().__init__()
        self._api_key = api_key
        self._voice_id = voice_id
        self._model_id = model_id

    async def run_tts(self, text: str):
        async with aiohttp.ClientSession() as session:
            async with session.post(
                "https://api.inworld.ai/tts/v1/voice:stream",
                headers={
                    "Authorization": f"Basic {self._api_key}",
                    "Content-Type": "application/json",
                },
                json={
                    "text": text,
                    "voiceId": self._voice_id,
                    "modelId": self._model_id,
                    "audioConfig": {
                        "audioEncoding": "PCM",
                        "sampleRateHertz": 24000,
                    },
                },
            ) as resp:
                resp.raise_for_status()
                async for line in resp.content:
                    line = line.strip()
                    if not line:
                        continue
                    import json
                    chunk = json.loads(line)
                    audio_bytes = base64.b64decode(
                        chunk["result"]["audioContent"]
                    )
                    yield AudioRawFrame(
                        audio=audio_bytes,
                        sample_rate=24000,
                        num_channels=1,
                    )

With LiveKit, the Agents SDK voice pipeline accepts custom TTS implementations. Point the TTS stage at the Realtime TTS endpoint using the same REST streaming pattern shown above.

With Vapi, you configure custom providers in the assistant settings. Point the TTS provider to the Realtime TTS API endpoint, and Vapi will use it within its managed pipeline.

The integration pattern is the same in all cases: your framework handles orchestration, transport, and turn-taking. Realtime TTS handles voice synthesis. Swap it in for expressive, low-latency realtime voice quality without changing your framework choice.

Framework vs API: when do you need a framework at all?

Frameworks solve orchestration: connecting STT to LLM to TTS, managing turn-taking, handling interruptions, and piping audio through a transport layer. If you are building a voice agent from individual components, a framework saves you from writing that glue code yourself.

But there is an alternative. The Realtime API handles STT, reasoning, TTS, and tool calling through a single WebSocket or WebRTC connection. One API call covers the full voice pipeline: audio goes in, audio comes out, with the reasoning and voice generation handled server-side. The system includes semantic VAD with configurable eagerness, barge-in and interruption handling, and simultaneous text and audio streaming.

The honest answer: use a framework when you need transport-level control, multi-participant rooms, native telephony, or custom pipeline stages that go beyond what an API exposes. Use the Realtime API when you want the simplest possible integration with the lowest latency and do not need to manage your own voice pipeline.

You can also mix approaches. Start with the Realtime API for the core voice loop, and use a framework for transport and session management if your deployment requires it.

How to evaluate voice quality across any framework

Whichever framework you choose, the TTS model determines how your agent sounds. Framework choice affects latency and orchestration; TTS model choice affects whether your users want to keep talking.

A few things worth checking:

Independent evaluation. Judge voice quality on evidence you can hear: side-by-side audio samples on your own scripts, blind preference tests with real users, and published latency figures. Inworld TTS-2 research preview is built for expressive, low-latency realtime speech, with 8-dimension natural-language steering and sub-200ms TTFT median. Run your own samples before committing.
End-to-end latency, not just TTFA. Time-to-first-audio is useful but incomplete. What your users experience is the full round-trip from finishing their sentence to hearing the agent respond. Ask your TTS provider for median end-to-end numbers, not cherry-picked inference metrics.
Interruption recovery. How the agent handles barge-in (the user talks over the agent) matters more than raw latency for conversational quality. Test this with your actual framework integration, not in isolation.

TL;DR: picking the right stack

Need a phone agent this week? Vapi. Configure your providers, point it at a phone number, ship it.
Want full pipeline control in Python? Pipecat. Assemble your own STT + LLM + TTS stack, own every frame.
Building WebRTC-native or multi-participant? LiveKit. The room model and self-hosting are hard to replicate.
Want the simplest path to top-quality voice with minimal infrastructure? Skip the framework. Use the Realtime API directly.

Regardless of which path you take, the TTS model matters. Realtime TTS is built for expressive, low-latency realtime speech and integrates with any framework as a standard REST TTS provider. Or skip the framework entirely and use the Realtime API for the full voice pipeline over a single connection.

FAQs

What is the difference between Vapi, Pipecat, and LiveKit?

Vapi is a managed voice agent platform with pay-per-minute pricing and built-in telephony. Pipecat is an open-source Python framework by Daily with a composable pipeline architecture where you assemble your own STT, LLM, and TTS stack. LiveKit is an open-source WebRTC infrastructure project with an Agents SDK for building voice and video AI applications, available self-hosted or as a managed cloud service. All three are orchestration layers that need underlying models (TTS, STT, LLM) to function.

Which voice agent framework is best for production?

It depends on your production requirements. Vapi optimizes for fast deployment and managed infrastructure. Pipecat gives you the most pipeline-level control for tuning latency and quality. LiveKit gives you the most infrastructure-level control, especially if you self-host. For production voice quality specifically, the TTS model matters more than the framework. Realtime TTS is built for expressive, low-latency realtime speech and works with any framework.

Can I switch frameworks later without rebuilding?

Partially. Your LLM prompts, tool definitions, and business logic are portable. Your orchestration code and transport integration are not. Pipecat and LiveKit are both open source, so migration between them is a code refactor, not a vendor negotiation. Migrating off Vapi requires rebuilding the orchestration layer since it is closed source.

How much does it cost to run a voice agent?

Framework costs are only part of the picture. Vapi charges a per-minute platform fee plus passthrough costs for your chosen providers. Pipecat and LiveKit are free to use, but you pay for the AI services (TTS, STT, LLM) and your hosting infrastructure. LiveKit Cloud adds $0.01/min for agent sessions. The dominant cost in any stack is usually the LLM and TTS providers, not the framework itself. See current pricing for Realtime TTS, STT, and Router rates.

Do I need WebRTC for a voice agent?

Not necessarily. WebRTC gives you low-latency, browser-native audio transport with NAT traversal and encryption built in. It is the best choice for browser-based and multi-participant use cases. For server-to-server pipelines or telephony, WebSocket-based transports work fine. The Realtime API supports both WebSocket and WebRTC connections.

What is the best TTS for voice agents in 2026?

Inworld TTS-2 research preview is built for expressive, low-latency realtime speech, with 8-dimension natural-language steering and sub-200ms TTFT median. All three variants (TTS-2, 1.5 Max, 1.5 Mini) work with any framework through a standard REST API. The Mini variant is optimized for latency-sensitive use cases. Judge voice quality on your own scripts with side-by-side audio samples.

Is the Realtime API a competitor to these frameworks?

Not exactly. The Realtime API replaces the need for a framework by handling the full voice pipeline (STT, reasoning, TTS, tool calling) through a single connection. Frameworks remain the better choice when you need custom pipeline stages, self-hosted transport, multi-participant rooms, or native telephony. Think of it as a vertical integration option: same models, no orchestration required.

Vapi vs Pipecat vs LiveKit: Voice Agent Frameworks Compared (2026)

Which voice agent framework should I use?

Vapi: managed orchestration for speed

Pipecat: composable pipelines for full control

LiveKit: WebRTC infrastructure with an agent layer

Can I use Realtime TTS with these frameworks?

Framework vs API: when do you need a framework at all?

How to evaluate voice quality across any framework

TL;DR: picking the right stack

FAQs

What is the difference between Vapi, Pipecat, and LiveKit?

Which voice agent framework is best for production?

Can I switch frameworks later without rebuilding?

How much does it cost to run a voice agent?

Do I need WebRTC for a voice agent?

What is the best TTS for voice agents in 2026?

Is the Realtime API a competitor to these frameworks?