By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
A phone agent is a voice AI that answers and places real telephone calls, handling tasks like customer service, sales qualification, appointment scheduling, and outbound follow-up. Inworld AI's
Realtime TTS-2 preview is the #1 realtime TTS on the
Artificial Analysis Realtime TTS Arena, with TTS 1.5 Max also top-tier in the realtime category. The Realtime API pairs that voice with model-agnostic LLM routing across 200+ LLMs and native MULAW/ALAW at 8 kHz for PSTN audio. In 2026, phone agents have moved from research curiosities into production deployments handling millions of minutes per month, and enterprise customer service teams increasingly route inbound calls to AI before escalating to humans.
This guide ranks the TTS APIs and voice agent stacks engineered for telephony, and explains the constraints (codec quality, jitter, barge-in, function calling) that separate "works in the demo" from "survives a real call center."
What Makes a TTS API Work for Phone Agents?
Phone calls add constraints that browser-based voice apps do not have:
- Narrowband codecs. PSTN audio runs at 8 kHz mono, often through G.711 μ-law compression. TTS that sounds great in a browser may sound robotic over a phone. Realtime TTS supports MULAW and ALAW at 8 kHz natively.
- Jitter and packet loss. Real telephony networks drop packets. Streaming TTS must tolerate retransmits and resync without audible glitches.
- Barge-in. Callers interrupt. The agent must detect the interruption, stop speaking, and handle the new utterance immediately. This requires semantic VAD on the STT side and instant TTS interruption on the playback side.
- Latency under load. A live call has no buffer. Realtime time-to-first-audio is the difference between a natural conversation and an obviously-AI experience that drops the call.
- Function calling at speed. The agent must call lookup tools (CRM, scheduling, booking) without dropping the audio stream. Tool calls happening inside the speech loop is what production-grade phone agent stacks handle natively.
- Concurrent sessions. A real call center handles 1,000+ concurrent calls. The voice stack must scale horizontally without per-session bottlenecks.
Quick Ranking: TTS APIs for Phone Agents
| Provider | Telephony Quality | Latency (TTFB) | Languages | SIP/PSTN Support | Concurrent Sessions |
|---|
| Inworld Realtime TTS | #1 realtime model on Artificial Analysis; native MULAW/ALAW 8kHz | Realtime | 15 GA (90+ experimental on TTS-2) | Via Twilio, Telnyx, LiveKit, Vapi | 1,000+ verified |
| ElevenLabs (Eleven v3 / Flash) | Below top-tier realtime on Artificial Analysis; Flash claims ~75ms TTFB | ~75-400ms by model | Broadest among voice vendors | Via ElevenAgents + partners | Production-scale |
| Cartesia Sonic 3.5 | Top-tier realtime on Artificial Analysis | ~40ms TTFB on Sonic 3 Turbo | 42+ | Via Line + partners | Production-scale |
| Deepgram Aura-2 | Mid-tier, bundled with Voice Agent API | ~200ms | English-focused | Native via Voice Agent API | Production-scale |
| OpenAI TTS | Mid-tier | ~300ms | 57+ | Via Realtime API + SIP | Production-scale |
Why Inworld Realtime TTS Stands Out on Telephony
Three things separate Inworld Realtime TTS for phone agents:
- Voice quality survives the codec. PSTN compression flattens expressive nuance. Realtime TTS is engineered to preserve naturalness through G.711, retaining the prosody and emotional shading that make callers stay engaged. The combination of #1-ranked realtime TTS, 1P inference for the LLM layer, and a model-agnostic Realtime API is what no single competitor matches.
- Realtime time-to-first-audio. Inside the human conversational latency range, with TTS 1.5 Mini optimized for lowest TTFB.
- Integration with voice-aware routing. Inside the Realtime API, STT acoustic signals (caller emotion, hesitation, speaker profile) reach Router, which selects the right LLM across 200+ available, and the TTS adapts pacing and emotion accordingly. A frustrated caller routes differently than a routine inquiry.
Code Example: Phone Agent with Twilio + Realtime API
# Server-side Twilio webhook that bridges a call to the Realtime API.
from fastapi import FastAPI, WebSocket
from twilio.twiml.voice_response import VoiceResponse, Connect
import json
import websockets
app = FastAPI()
@app.post("/voice")
async def incoming_call():
response = VoiceResponse()
connect = Connect()
connect.stream(url="wss://your-server.example.com/media")
response.append(connect)
return str(response)
@app.websocket("/media")
async def media_stream(twilio_ws: WebSocket):
await twilio_ws.accept()
inworld_url = (
"wss://api.inworld.ai/api/v1/realtime/session"
"?key=<session-id>&protocol=realtime"
)
async with websockets.connect(
inworld_url,
extra_headers={"Authorization": "Basic <your-api-key>"}
) as inworld_ws:
await inworld_ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "openai/gpt-5.5",
"instructions": "You are a friendly support agent.",
"audio": {
"input": {
"format": {"type": "audio/pcmu", "rate": 8000},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "medium"
}
},
"output": {
"voice": "Sarah",
"model": "inworld-tts-1.5-mini",
"format": {"type": "audio/pcmu", "rate": 8000},
"speed": 1.0
}
}
}
}))
# Bridge Twilio media frames <-> Inworld events here.
# See docs.inworld.ai/realtime for full event schema.
For production telephony deployments, use
LiveKit,
Vapi, or
Telnyx as the SIP layer. All three integrate the Realtime API as a first-class voice provider.
What to Look For in a Phone Agent Stack
- Realtime TTS time-to-first-audio. Anything slower turns into an obviously-AI experience.
- Native 8 kHz MULAW/ALAW support. Avoid resampling pipelines that introduce latency and quality loss.
- Semantic VAD on STT. Energy-based VAD cuts off callers mid-sentence. Semantic VAD waits for natural turn boundaries.
- Instant interruption handling. When the caller starts speaking, the agent stops within 50ms.
- Function calling inside the audio loop. CRM, booking, payment lookups must not pause the conversation.
- Concurrent-session scale. 1,000+ simultaneous calls without per-session bottleneck.
FAQ
What is the best TTS for AI phone agents?
Inworld Realtime TTS-2 preview is the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena, with TTS 1.5 Max also top-tier in the realtime category. It supports native 8 kHz MULAW/ALAW for PSTN audio and integrates with telephony platforms (Twilio, Telnyx, LiveKit, Vapi) and the
Realtime API for an end-to-end pipeline.
How do I integrate AI voice with Twilio?
Use Twilio's Media Streams to bridge inbound calls to a WebSocket endpoint that proxies to the
Realtime API. The Realtime API accepts MULAW at 8 kHz directly, so the bridge is a thin pass-through. See the code example above for the FastAPI structure.
What latency is acceptable for a phone agent?
Total round-trip (caller speaks, agent responds) should stay under 1 second, ideally under 800ms. Beyond 1.5 seconds, callers perceive the system as broken and hang up. Realtime TTS contributes a realtime portion of that budget; Realtime STT contributes 100-300ms; the LLM contributes the rest. Optimize each stage against your own pipeline.
How many concurrent calls can a voice AI handle?
Production deployments routinely handle 1,000+ concurrent calls. Capacity scales with the underlying infrastructure. The Realtime API and BYO-orchestration platforms (LiveKit, Vapi, Telnyx) are all built for telephony scale.
Can phone agents call APIs during the conversation?
Yes. Modern voice agent stacks support function calling inside the speech loop. The agent can look up customer records, check inventory, book appointments, or process payments without dropping the audio. The
Realtime API supports tool calling natively as part of the WebSocket event stream.