By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
A phone agent is a voice AI that answers and places real telephone calls, handling tasks like customer service, sales qualification, appointment scheduling, and outbound follow-up. Inworld AI's
Realtime TTS is ranked #1 on the
Artificial Analysis Speech Arena (three of the top five), with sub-200ms time-to-first-audio that holds up over PSTN-quality audio. In 2026, phone agents have moved from research curiosities into production deployments handling millions of minutes per month, and enterprise customer service teams increasingly route inbound calls to AI before escalating to humans.
This guide ranks the TTS APIs and voice agent stacks engineered for telephony, and explains the constraints (codec quality, jitter, barge-in, function calling) that separate "works in the demo" from "survives a real call center."
What Makes a TTS API Work for Phone Agents?
Phone calls add constraints that browser-based voice apps do not have:
- Narrowband codecs. PSTN audio runs at 8 kHz mono, often through G.711 μ-law compression. TTS that sounds great in a browser may sound robotic over a phone. Realtime TTS supports MULAW and ALAW at 8 kHz natively.
- Jitter and packet loss. Real telephony networks drop packets. Streaming TTS must tolerate retransmits and resync without audible glitches.
- Barge-in. Callers interrupt. The agent must detect the interruption, stop speaking, and handle the new utterance immediately. This requires semantic VAD on the STT side and instant TTS interruption on the playback side.
- Latency under load. A live call has no buffer. Sub-200ms time-to-first-audio is the difference between a natural conversation and an obviously-AI experience that drops the call.
- Function calling at speed. The agent must call lookup tools (CRM, scheduling, booking) without dropping the audio stream. Tool calls happening inside the speech loop is what production-grade phone agent stacks handle natively.
- Concurrent sessions. A real call center handles 1,000+ concurrent calls. The voice stack must scale horizontally without per-session bottlenecks.
Quick Ranking: TTS APIs for Phone Agents
| Provider | Telephony Quality | Latency (TTFB) | Languages | SIP/PSTN Support | Concurrent Sessions |
|---|
| Realtime TTS | #1 on Artificial Analysis Speech Arena, native MULAW/ALAW 8kHz | Sub-200ms | 15 | Via Twilio, Telnyx, LiveKit, Vapi | 1,000+ verified |
| ElevenLabs (Eleven v3) | #2 on Artificial Analysis | ~250-400ms | 70+ | Via Conversational AI + partners | Production-scale |
| Cartesia Sonic 3 | Mid-tier ranking | Sub-100ms (Turbo) | 42+ | Via Line + partners | Production-scale |
| Deepgram Aura-2 | Mid-tier, bundled with Voice Agent API | ~200ms | English-focused | Native via Voice Agent API | Production-scale |
| OpenAI TTS | Mid-tier | ~300ms | 57+ | Via Realtime API + SIP | Production-scale |
Why Realtime TTS Wins on Telephony
Three things separate Realtime TTS for phone agents:
- Voice quality survives the codec. PSTN compression flattens expressive nuance. Realtime TTS is engineered to preserve naturalness through G.711, retaining the prosody and emotional shading that make callers stay engaged.
- Sub-200ms time-to-first-audio. Below the threshold of human conversational latency. Combined with streaming STT and LLM inference, total round-trip stays under 800ms.
- Integration with voice-aware routing. Inside the Realtime API, STT acoustic signals (caller emotion, hesitation, speaker profile) reach the Realtime Router, which selects the right LLM for each turn, and the TTS adapts pacing and emotion accordingly. A frustrated caller routes differently than a routine inquiry.
Code Example: Phone Agent with Twilio + Realtime API
# Server-side Twilio webhook that bridges a call to the Realtime API.
from fastapi import FastAPI, WebSocket
from twilio.twiml.voice_response import VoiceResponse, Connect
import json
import websockets
app = FastAPI()
@app.post("/voice")
async def incoming_call():
response = VoiceResponse()
connect = Connect()
connect.stream(url="wss://your-server.example.com/media")
response.append(connect)
return str(response)
@app.websocket("/media")
async def media_stream(twilio_ws: WebSocket):
await twilio_ws.accept()
inworld_url = (
"wss://api.inworld.ai/api/v1/realtime/session"
"?key=<session-id>&protocol=realtime"
)
async with websockets.connect(
inworld_url,
extra_headers={"Authorization": "Basic <your-api-key>"}
) as inworld_ws:
await inworld_ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "gpt-5.5",
"instructions": "You are a friendly support agent.",
"audio": {
"input": {
"format": {"type": "audio/pcmu", "rate": 8000},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "medium"
}
},
"output": {
"voice": "Sarah",
"model": "inworld-tts-1.5-mini",
"format": {"type": "audio/pcmu", "rate": 8000},
"speed": 1.0
}
}
}
}))
# Bridge Twilio media frames <-> Inworld events here.
# See docs.inworld.ai/realtime for full event schema.
For production telephony deployments, use
LiveKit,
Vapi, or
Telnyx as the SIP layer. All three integrate the Realtime API as a first-class voice provider.
What to Look For in a Phone Agent Stack
- Sub-200ms TTS time-to-first-audio. Anything slower turns into an obviously-AI experience.
- Native 8 kHz MULAW/ALAW support. Avoid resampling pipelines that introduce latency and quality loss.
- Semantic VAD on STT. Energy-based VAD cuts off callers mid-sentence. Semantic VAD waits for natural turn boundaries.
- Instant interruption handling. When the caller starts speaking, the agent stops within 50ms.
- Function calling inside the audio loop. CRM, booking, payment lookups must not pause the conversation.
- Concurrent-session scale. 1,000+ simultaneous calls without per-session bottleneck.
FAQ
What is the best TTS for AI phone agents?
Realtime TTS ranks #1 on the Artificial Analysis Speech Arena with three of the top five spots, supports native 8 kHz MULAW/ALAW for PSTN audio, and delivers sub-200ms time-to-first-audio. It integrates with telephony platforms (Twilio, Telnyx, LiveKit, Vapi) and the
Realtime API for end-to-end pipeline.
How do I integrate AI voice with Twilio?
Use Twilio's Media Streams to bridge inbound calls to a WebSocket endpoint that proxies to the
Realtime API. The Realtime API accepts MULAW at 8 kHz directly, so the bridge is a thin pass-through. See the code example above for the FastAPI structure.
What latency is acceptable for a phone agent?
Total round-trip (caller speaks → agent responds) should stay under 1 second, ideally under 800ms. Beyond 1.5 seconds, callers perceive the system as broken and hang up. Realtime TTS contributes sub-200ms of that budget; Realtime STT contributes 100-300ms; the LLM contributes 200-500ms time-to-first-token.
How many concurrent calls can a voice AI handle?
Production deployments routinely handle 1,000+ concurrent calls. Capacity scales with the underlying infrastructure. The Realtime API and BYO-orchestration platforms (LiveKit, Vapi, Telnyx) are all built for telephony scale.
Can phone agents call APIs during the conversation?
Yes. Modern voice agent stacks support function calling inside the speech loop. The agent can look up customer records, check inventory, book appointments, or process payments without dropping the audio. The
Realtime API supports tool calling natively as part of the WebSocket event stream.