What is the core difference between Inworld Realtime TTS and Cartesia Sonic 3.5?

Inworld AI Realtime TTS is a first-party realtime voice model bundled with Realtime STT, the Router (220+ LLMs across 1P and 3P tracks), and the Realtime API under one auth and one billing surface. Cartesia Sonic 3.5 is a dedicated TTS model paired with Ink STT and the Line agent platform. The choice is between an integrated 1P + 3P stack with steerable voices (Inworld) and a focused TTS + dedicated agent platform (Cartesia).

Which model has better TTS quality?

Both are strong realtime voice models, and quality is use-case specific. Inworld's Realtime TTS-2 (research preview) and Realtime TTS 1.5 Max are first-party realtime models tuned for expressive, steerable speech at consumer scale. Cartesia Sonic 3.5 is a top-tier realtime model with clean, natural output. The most reliable test is a blind A/B comparison on your own text with your own users.

What can Inworld TTS-2 do that Sonic 3.5 does not?

TTS-2 supports natural-language steering across 8 dimensions (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) plus non-verbals, with a deliveryMode field (STABLE, BALANCED, CREATIVE) that controls expressiveness independently of temperature. TTS-2 also preserves cross-lingual voice identity, so the same speaker can switch languages mid-utterance without re-cloning. Cartesia Sonic 3.5 ships strong voice quality but does not expose the same 8-dimension natural-language steering surface.

How do the agent platforms compare: Realtime API vs Line?

Inworld Realtime API is a model-agnostic pipeline: pick any LLM from the Router (1P open-source models on Inworld GPUs plus 3P providers like OpenAI, Anthropic, Google, DeepSeek, Mistral) and pair it with Realtime STT and Realtime TTS over one WebSocket. Auth is one Basic key. Cartesia Line is a focused agent platform built around Sonic and Ink with strong ergonomics for the Cartesia stack. If you want to keep LLM choice flexible across providers, Inworld gives you that out of the box. If you want a tightly integrated Sonic + Ink agent runtime, Line is purpose-built for it.

Which providers run at consumer scale today?

On the Inworld stack: Wishroll / Status is one of the fastest consumer apps to reach 1M users, Bible Chat scales voice to millions of users after migrating from a previous TTS provider, and Talkpal runs multilingual language learning across 5M+ users. Cartesia powers a range of voice agent customers across enterprise and developer tools and is heavily used inside Line. Both are real production stacks.

Inworld Realtime TTS vs Cartesia Sonic 3.5 (2026)

Last updated: May 28, 2026

Inworld AI Realtime TTS is a first-party realtime voice model tuned for expressive, natural speech at consumer scale, bundled with Realtime STT, the Router across 220+ LLMs, and the Realtime API under one auth. Inworld's Realtime TTS-2 is the #1 realtime TTS. Cartesia Sonic 3.5 is a top-tier dedicated realtime TTS, paired with Ink STT and the Line agent platform. Both are credible 2026 choices for voice agent builders. This page compares the two on quality, latency, steering, bundle breadth, agent runtime ergonomics, and the customer scale behind each, so the decision is grounded in what actually ships in production.

How does Inworld Realtime TTS compare to Cartesia Sonic 3.5 at a glance?

Cartesia latency and language figures from Cartesia's published documentation (May 2026).
Inworld latency, language, and steering figures from Inworld docs and the live pricing page.

How do Inworld Realtime TTS and Cartesia compare on quality?

Voice quality is use-case specific, and the most reliable test is a blind A/B comparison on your own text with your own users. Both are genuinely strong realtime voice models.

Inworld's Realtime TTS-2 (research preview) and Realtime TTS 1.5 Max are first-party realtime models built for expressive, steerable speech at consumer scale, with natural-language steering across emotion, pace, and intonation on TTS-2. Cartesia Sonic 3.5 is a top-tier realtime model with clean, natural output and market-leading time-to-first-byte. The honest read: quality variance between top realtime models is small, so run representative text through each, test the edge cases that matter for your app, and measure user preference directly.

How do they compare on latency?

Cartesia advertises approximately 40ms time-to-first-byte on Sonic 3.5 Turbo, which is among the fastest TTFB numbers in the market. That is a real differentiator if every millisecond of audio start matters to the product.

Inworld targets realtime latency on TTS-2 and TTS 1.5 Max with sub-200ms TTFT, and approximately 120ms on the smaller TTS 1.5 Mini. Both sit below the ~250ms threshold where humans start perceiving a gap in conversation, but on raw TTFB Cartesia has the edge.

Our take: if your product is a latency benchmark first and a voice second, weight Cartesia. If your product is a long-form companion, education, or social app where the listener spends minutes with the voice, weight Inworld for steering and cross-lingual identity.

How does TTS-2 natural-language steering work?

TTS-2 (research preview, model ID inworld-tts-2) accepts bracketed instructions at the start of text and supports 8 steering dimensions:

Emotion ([say sadly], [say excitedly])
Articulation ([say with force])
Intonation ([in a questioning tone])
Volume ([whisper in a hushed style], [shout])
Pitch ([say in a deep voice])
Range ([monotone], [wide pitch range])
Speed ([say quickly])
Vocal style ([as a wise mentor])

Plus inline non-verbals ([laugh], [sigh], [breathe]). TTS-2 also exposes a deliveryMode field (STABLE, BALANCED, CREATIVE) that controls expressiveness independently of temperature.

Cartesia Sonic 3.5 is a strong baseline voice model with consistent expressive output, but does not expose an equivalent 8-dimension natural-language steering surface. If your application needs an actor's-direction interface on top of TTS (companion that shifts moods, a narrator that whispers, a language tutor that emphasizes pronunciation), TTS-2 is the more direct fit.

What is cross-lingual voice identity and why does it matter?

TTS-2 preserves voice identity across 100+ languages (15 GA + 90+ experimental). The same speaker can switch languages mid-utterance without re-cloning. For a multilingual companion or language-learning app, that means one voice the user bonds with works in every language the user studies.

Cartesia Sonic 3.5 covers a broad multilingual range. If raw language count is the deciding factor and identity continuity per voice is secondary, the comparison is closer. If the product is a single-voice persona that crosses languages, TTS-2 is purpose-built for it.

How does the Realtime API compare to Cartesia Line?

Inworld Realtime API is a model-agnostic voice pipeline. One WebSocket carries STT, LLM, and TTS. The LLM layer is the Router, which routes to 220+ models across two tracks:

1P track (Realtime Inference): Inworld-hosted open-source models with realtime-grade inference (optimized Gemma 4, DeepSeek V3.2/V4, GLM-5.1/5.2).
3P track: OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, DeepInfra (gpt-oss-120b via deepinfra/openai/gpt-oss-120b).

The Realtime API is OpenAI Realtime protocol compatible (swap the base URL). Voice activity detection runs on Inworld's own Silero VAD plus Smart Turn detector inside server_vad.

Cartesia Line is a focused agent platform built around Sonic and Ink. It is well-engineered for the Cartesia stack and ships strong agent ergonomics for teams who want a tightly integrated Sonic + Ink runtime.

The architectural choice: Inworld lets you decouple LLM from voice (swap anthropic/claude-sonnet-4-6 for deepinfra/openai/gpt-oss-120b without re-integrating the pipeline). Cartesia Line is more opinionated and integrated end-to-end.

What does Realtime STT add to the Inworld stack?

Realtime STT routes across multiple providers under one auth: Inworld STT (inworld/inworld-stt-1), Groq Whisper, AssemblyAI streaming models, and the newly added Soniox soniox/stt-rt-v4. Inworld STT supports configurable turn-taking (endOfTurnConfidenceThreshold, inactivityTimeoutSeconds) and voice profiling.

Cartesia Ink is a focused streaming STT, with an Ink-Whisper variant. Both providers ship credible STT. Inworld's value is the routing layer that lets you swap STT providers on a single config field if a workload exposes a model's weakness.

What is the Router and how does it fit into voice?

The Realtime Router routes to 220+ LLMs across 1P and 3P tracks via an OpenAI Chat Completions compatible endpoint (POST /v1/chat/completions). For voice agent builders that translates to:

Pass model: "openai/gpt-5.5" today, model: "deepseek/deepseek-v4-pro" tomorrow.
Set extra_body.sort to latency, price, or intelligence to let the Router pick.
Use extra_body.models for fallback chains so a 503 from one provider does not break the call.
Use user for sticky routing so a session stays on the same model.

Cartesia does not provide a 1st-party multi-provider LLM router. If your voice agent depends on swapping LLMs without changing your integration, that capability lives in the Inworld stack.

What does the code look like for each?

A minimal Inworld Realtime TTS streaming call:

curl -X POST "https://api.inworld.ai/tts/v1/voice:stream" \
  -H "Authorization: Basic $INWORLD_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Welcome back. Where did we leave off?",
    "voiceId": "Sarah",
    "modelId": "inworld-tts-2",
    "deliveryMode": "BALANCED",
    "audioConfig": {
      "audioEncoding": "MP3",
      "sampleRateHertz": 24000
    }
  }'

The streaming response is NDJSON: each line is {"result": {"audioContent": "base64..."}}, decoded per chunk.

A minimal Cartesia Sonic call uses Cartesia's SDK with model_id="sonic-3-5" and a voice ID. The shape is similar but the field names and streaming format are Cartesia-specific. See Cartesia's documentation for current details.

Field-name discipline matters: Inworld REST TTS uses voiceId / modelId. The Realtime WebSocket uses voice / model. The Router uses model. Do not mix them across APIs.

Which voice agent stacks ship at consumer scale today?

Consumer-scale customers running on the Inworld stack include:

Wishroll / Status, one of the fastest consumer apps to reach 1M users.
Bible Chat, which migrated from a previous TTS provider and now scales voice features to millions of users.
Talkpal, which uses Realtime TTS for multilingual language learning at 5M+ users.

Cartesia powers a range of voice agent customers across enterprise and developer tools and has strong scale inside Line. Both providers are real production stacks. The relevant question for your team is which set of anchor workloads looks most like your own.

When should you choose Cartesia Sonic 3.5?

Cartesia is the stronger fit when:

Sub-50ms TTFB is the single deciding metric. Sonic 3.5 Turbo's ~40ms TTFB is the fastest in the market today.
Your team wants a tightly integrated agent runtime around one stack. Sonic + Ink + Line are designed to work together end-to-end.
You do not need multi-provider LLM routing. If you have already standardized on a single LLM provider, the Router's flexibility is less valuable to you.
On-device TTS is a requirement. Cartesia ships an on-device option for edge inference that Inworld does not.

When should you choose Inworld Realtime TTS?

Inworld is the stronger fit when:

You want expressive, steerable realtime voice. TTS-2 (research preview) and 1.5 Max are first-party realtime models tuned for expressive speech at consumer scale, with sub-200ms TTFT. Judge quality with a blind A/B on your own text.
You need natural-language steering. TTS-2's 8 dimensions plus deliveryMode give an actor's-direction interface that companion, social, and tutoring apps use directly.
You need cross-lingual voice identity. One voice across 100+ languages without re-cloning.
You want one auth and one billing surface for TTS + STT + LLM + Realtime API. Bundle integration removes orchestration overhead.
LLM flexibility is a hard requirement. The Router lets you switch LLMs, A/B across providers, and fall back on outages without re-integrating the voice pipeline.

How do you get started with Inworld AI?

Try the TTS Playground: hear TTS-2, 1.5 Max, and 1.5 Mini with your own text, then add steering tags to direct delivery.
Read the docs: TTS, STT, Router, and Realtime API references in one place.
Explore the Realtime API: WebSocket and WebRTC voice pipelines with STT + LLM + TTS over one connection.
See current pricing.
Talk to an architect for on-premise deployment, custom voices, or enterprise terms.

Cartesia specifications from Cartesia's public documentation as of May 2026. Latency and language figures represent published metrics from each provider. Always verify current capabilities directly with each provider.

Inworld Realtime TTS vs Cartesia Sonic 3.5: a 2026 comparison for voice agent builders