Last updated: May 28, 2026
Inworld AI Realtime TTS is the #1 realtime TTS on the
Artificial Analysis Speech Arena, bundled with Realtime STT, the Router across 200+ LLMs, and the Realtime API under one auth. Cartesia Sonic 3.5 is a top-tier dedicated realtime TTS, paired with Ink STT and the Line agent platform. Both are credible 2026 choices for voice agent builders. This page compares the two on quality, latency, steering, bundle breadth, agent runtime ergonomics, and the customer scale behind each, so the decision is grounded in what actually ships in production.
How does Inworld Realtime TTS compare to Cartesia Sonic 3.5 at a glance?
Which is #1 on the Artificial Analysis Speech Arena?
Inworld is the #1 realtime TTS on the Artificial Analysis Speech Arena. The leaderboard runs blind A/B preference tests where listeners pick the more natural audio without seeing the model name, which makes it the most-cited independent quality benchmark in the category.
Cartesia Sonic 3.5 also ranks among the top realtime models and is genuinely strong on listener preference. The honest read: both are top-tier realtime TTS. Inworld holds the #1 realtime spot today, but quality variance between top entries is small and votes move week to week. Always check the live leaderboard before making a final call.
How do they compare on latency?
Cartesia advertises approximately 40ms time-to-first-byte on Sonic 3.5 Turbo, which is among the fastest TTFB numbers in the market. That is a real differentiator if every millisecond of audio start matters to the product.
Inworld targets realtime latency on TTS-2 and TTS 1.5 Max with sub-200ms TTFT, and approximately 120ms on the smaller TTS 1.5 Mini. Both sit below the ~250ms threshold where humans start perceiving a gap in conversation, but on raw TTFB Cartesia has the edge.
Our take: if your product is a latency benchmark first and a voice second, weight Cartesia. If your product is a long-form companion, character chat, or roleplay agent where the listener spends minutes with the voice, weight Inworld for steering and cross-lingual identity.
How does TTS-2 natural-language steering work?
TTS-2 (research preview, model ID inworld-tts-2) accepts bracketed instructions at the start of text and supports 8 steering dimensions:
- Emotion (
[say sadly], [say excitedly])
- Articulation (
[say with force])
- Intonation (
[in a questioning tone])
- Volume (
[whisper in a hushed style], [shout])
- Pitch (
[say in a deep voice])
- Range (
[monotone], [wide pitch range])
- Speed (
[say quickly])
- Vocal style (
[as a wise mentor])
Plus inline non-verbals ([laugh], [sigh], [breathe]). TTS-2 also exposes a deliveryMode field (STABLE, BALANCED, CREATIVE) that controls expressiveness independently of temperature.
Cartesia Sonic 3.5 is a strong baseline voice model with consistent expressive output, but does not expose an equivalent 8-dimension natural-language steering surface. If your application needs an actor's-direction interface on top of TTS (companion that shifts moods, roleplay agent that whispers, language tutor that emphasizes pronunciation), TTS-2 is the more direct fit.
What is cross-lingual voice identity and why does it matter?
TTS-2 preserves voice identity across 100+ languages (15 GA + 90+ experimental). The same speaker can switch languages mid-utterance without re-cloning. For a multilingual companion or language-learning app, that means one voice the user bonds with works in every language the user studies.
Cartesia Sonic 3.5 covers a broad multilingual range. If raw language count is the deciding factor and identity continuity per voice is secondary, the comparison is closer. If the product is a single-voice persona that crosses languages, TTS-2 is purpose-built for it.
How does the Realtime API compare to Cartesia Line?
Inworld Realtime API is a model-agnostic voice pipeline. One WebSocket carries STT, LLM, and TTS. The LLM layer is the Router, which gives access to 200+ models across two tracks:
- 1P track (Realtime Inference): Inworld-hosted open-source models with realtime-grade inference (optimized Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5).
- 3P track: OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, DeepInfra (
gpt-oss-120b via deepinfra/openai/gpt-oss-120b).
The Realtime API is OpenAI Realtime protocol compatible (swap the base URL). Voice activity detection runs on Inworld's own Silero VAD plus Smart Turn detector inside server_vad.
Cartesia Line is a focused agent platform built around Sonic and Ink. It is well-engineered for the Cartesia stack and ships strong agent ergonomics for teams who want a tightly integrated Sonic + Ink runtime.
The architectural choice: Inworld lets you decouple LLM from voice (swap anthropic/claude-sonnet-4-6 for deepinfra/openai/gpt-oss-120b without re-integrating the pipeline). Cartesia Line is more opinionated and integrated end-to-end.
What does Realtime STT add to the Inworld stack?
Realtime STT routes across multiple providers under one auth: Inworld STT (inworld/inworld-stt-1), Groq Whisper, AssemblyAI streaming models, and the newly added Soniox soniox/stt-rt-v4. Inworld STT supports configurable turn-taking (endOfTurnConfidenceThreshold, inactivityTimeoutSeconds) and voice profiling.
Cartesia Ink is a focused streaming STT, with an Ink-Whisper variant. Both providers ship credible STT. Inworld's value is the routing layer that lets you swap STT providers on a single config field if a workload exposes a model's weakness.
What is the Router and how does it fit into voice?
The Realtime Router routes to 200+ LLMs across 1P and 3P tracks via an OpenAI Chat Completions compatible endpoint (POST /v1/chat/completions). For voice agent builders that translates to:
- Pass
model: "openai/gpt-5.5" today, model: "deepseek/deepseek-v4-pro" tomorrow.
- Set
extra_body.sort to latency, price, or intelligence to let the Router pick.
- Use
extra_body.models for fallback chains so a 503 from one provider does not break the call.
- Use
user for sticky routing so a session stays on the same model.
Cartesia does not provide a 1st-party multi-provider LLM router. If your voice agent depends on swapping LLMs without changing your integration, that capability lives in the Inworld stack.
What does the code look like for each?
A minimal Inworld Realtime TTS streaming call:
curl -X POST "https://api.inworld.ai/tts/v1/voice:stream" \
-H "Authorization: Basic $INWORLD_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Welcome back. Where did we leave off?",
"voiceId": "Sarah",
"modelId": "inworld-tts-2",
"deliveryMode": "BALANCED",
"audioConfig": {
"audioEncoding": "MP3",
"sampleRateHertz": 24000
}
}'
The streaming response is NDJSON: each line is {"result": {"audioContent": "base64..."}}, decoded per chunk.
A minimal Cartesia Sonic call uses Cartesia's SDK with
model_id="sonic-3-5" and a voice ID. The shape is similar but the field names and streaming format are Cartesia-specific. See
Cartesia's documentation for current details.
Field-name discipline matters: Inworld REST TTS uses voiceId / modelId. The Realtime WebSocket uses voice / model. The Router uses model. Do not mix them across APIs.
Which voice agent stacks ship at consumer scale today?
Consumer-scale customers running on the Inworld stack include:
- Wishroll / Status, one of the fastest consumer apps to reach 1M users.
- Bible Chat, which migrated from a previous TTS provider and now scales voice features to millions of users.
- Talkpal, which uses Realtime TTS for multilingual language learning at 5M+ users.
- Janitor, a large-scale companion app running on the Router.
- Latitude, running interactive media voice on the Realtime API.
Cartesia powers a range of voice agent customers across enterprise and developer tools and has strong scale inside Line. Both providers are real production stacks. The relevant question for your team is which set of anchor workloads looks most like your own.
When should you choose Cartesia Sonic 3.5?
Cartesia is the stronger fit when:
- Sub-50ms TTFB is the single deciding metric. Sonic 3.5 Turbo's ~40ms TTFB is the fastest in the market today.
- Your team wants a tightly integrated agent runtime around one stack. Sonic + Ink + Line are designed to work together end-to-end.
- You do not need multi-provider LLM routing. If you have already standardized on a single LLM provider, the Router's flexibility is less valuable to you.
- On-device TTS is a requirement. Cartesia ships an on-device option for edge inference that Inworld does not.
When should you choose Inworld Realtime TTS?
Inworld is the stronger fit when:
- You want the #1 realtime TTS. Top of the Artificial Analysis Speech Arena in the realtime category, with TTS-2 in research preview.
- You need natural-language steering. TTS-2's 8 dimensions plus
deliveryMode give an actor's-direction interface that companion, roleplay, and tutoring apps use directly.
- You need cross-lingual voice identity. One voice across 100+ languages without re-cloning.
- You want one auth and one billing surface for TTS + STT + LLM + Realtime API. Bundle integration removes orchestration overhead.
- LLM flexibility is a hard requirement. The Router lets you switch LLMs, A/B across providers, and fall back on outages without re-integrating the voice pipeline.
How do you get started with Inworld AI?
- Try the TTS Playground: hear TTS-2, 1.5 Max, and 1.5 Mini with your own text, then add steering tags to direct delivery.
- Read the docs: TTS, STT, Router, and Realtime API references in one place.
- Explore the Realtime API: WebSocket and WebRTC voice pipelines with STT + LLM + TTS over one connection.
- See current pricing.
- Talk to an architect for on-premise deployment, custom voices, or enterprise terms.
Rankings from the Artificial Analysis Speech Arena as of May 2026. Cartesia specifications from Cartesia's public documentation. Always verify current capabilities directly with each provider.