Natural-language TTS steering: how to control emotion, tone, and style in 2026

Q: What dimensions of voice can Inworld TTS-2 control?

Inworld TTS-2 (research preview) exposes 8 steering dimensions: emotion, articulation, intonation, volume, pitch, range, speed, and vocal style. It also accepts inline non-verbals such as [laugh], [sigh], [breathe], [cough], [yawn], and [clear_throat]. Steering instructions are written in English regardless of the target speech language.

Q: What is the deliveryMode field in TTS-2?

deliveryMode is an optional field on the /tts/v1/voice request body for the inworld-tts-2 model. It accepts STABLE (most consistent across requests, narrower emotional range), BALANCED (default, mix of consistency and expressiveness), or CREATIVE (maximum emotional range, less predictable). It is independent of the temperature parameter and helps you trade reliability for performance variance per call.

Q: How does Inworld TTS-2 steering compare with ElevenLabs v3 audio tags?

Both approaches embed bracketed instructions in the script. ElevenLabs v3 ships audio tags for use cases like [whispers], [excited], [sighs] and supports 70+ languages. Inworld TTS-2 generalizes the idea to 8 explicit steering dimensions, adds a deliveryMode field for run-to-run variance control, and is designed for realtime applications (sub-200ms TTFT median). ElevenLabs explicitly does not recommend v3 for realtime use cases.

Q: Can I use steering tags with Inworld TTS 1.5?

No. Steering tags are only supported on inworld-tts-2 (research preview). On Realtime TTS 1.5 Max and 1.5 Mini, bracketed instructions are read aloud literally. TTS 1.5 still supports a limited set of experimental emotion markups like [happy], [sad], [angry], but for natural-language steering use TTS-2.

Q: Does steering preserve voice identity across emotions and languages?

Yes. TTS-2 keeps a single voice identity stable while applying steering instructions, and it preserves that identity across more than 100 languages (15 GA plus 90+ experimental). The same voice can switch emotions, pacing, or even languages mid-utterance without sounding like a different speaker.

Inworld AI Realtime TTS-2 (research preview, launched May 5, 2026) treats voice direction the same way modern image and video models treat prompts: you write in English, and the model adjusts how it speaks. Natural-language steering on inworld-tts-2 exposes 8 dimensions of vocal control plus a new deliveryMode field, so the same voice can sound urgent, exhausted, or amused without retraining or post-processing. This page explains how steering works, what the 8 dimensions cover, how to call the API today, and how the approach compares with ElevenLabs v3 audio tags, Cartesia Sonic 3.5, and Hume Octave.

TTS-2 is research preview, not GA. Inworld's Realtime TTS-2 is the #1 realtime TTS. Steering and deliveryMode are available on every TTS-2 request via the standard POST /tts/v1/voice endpoint.

What is natural-language voice steering?

Natural-language voice steering is a control method for TTS models where you place plain-English instructions next to the text you want spoken, and the model adjusts delivery in a single forward pass. Instead of fine-tuning a new voice for "angry CEO" or running expensive post-processing to add a sigh, the prompt itself carries the direction.

On Realtime TTS-2, instructions are written in square brackets at the start of the text, in English, even when the target speech is in another language. Non-verbal sounds can appear inline anywhere inside the text. The model treats the brackets as direction, not as text to read.

A useful mental model: TTS-2 separates what is said (the script) from how it is said (the steering tags) and from who is saying it (the voice ID). Each of those three axes can change independently.

What 8 steering dimensions does Inworld TTS-2 expose?

TTS-2 exposes 8 explicit steering dimensions, each with a vocabulary of natural-language phrases that the model understands. The dimensions map to concrete acoustic properties of the resulting audio.

In addition to the 8 dimensions, TTS-2 supports inline non-verbal tags: [laugh], [breathe], [sigh], [cough], [yawn], [clear_throat]. These can appear anywhere in the text, not just at the start, and multiple non-verbals can be combined.

Three constraints matter in practice. Instructions are English-only, even when target speech is non-English. Steering tags belong at the start of the text; only non-verbals can be inline. And the instruction must match the script: pairing [say sadly] with a celebratory sentence degrades the output rather than improving it.

How does the deliveryMode field work?

deliveryMode is a new optional field on the POST /tts/v1/voice request body, available on the inworld-tts-2 model. It controls how much variance the model is allowed to introduce across requests with the same input.

deliveryMode is independent of temperature (default 1.0, valid range 0 to 2 exclusive). You can pair STABLE with a low temperature for deterministic-feeling UI voice, or CREATIVE with a higher temperature for performance work. In practice, most production conversational apps should leave BALANCED as the default and only override per request when the use case calls for it.

What does a real TTS-2 steering call look like?

A working request to the streaming TTS endpoint with steering and deliveryMode. The streaming endpoint returns NDJSON; each line carries a base64 audioContent chunk that must be decoded and concatenated.

Three things to verify before shipping. Authorization is Basic, not Bearer (server-side; browsers use a Bearer JWT minted server-side via POST /auth/v1/tokens/token:generate). The streaming endpoint returns NDJSON with base64 audioContent per line, not raw bytes. And voiceId plus modelId are REST TTS field names; the Realtime WebSocket session uses voice and model instead. Mixing those across APIs is the most common cause of silent failures.

Why doesn't this work on Realtime TTS 1.5?

Steering tags are an inworld-tts-2 feature. On inworld-tts-1.5-max and inworld-tts-1.5-mini, the model treats [say sadly] as text and reads the brackets aloud, which produces audio like "open bracket say sadly close bracket you actually finished it." TTS 1.5 still supports a small set of experimental emotion markups ([happy], [sad], [angry], [surprised], [fearful], [disgusted], [laughing], [whispering]) at the start of text, but the natural-language steering surface and deliveryMode field are TTS-2 only.

The practical rule: any prompt fragment that goes near the TTS layer should know which model it is targeting. A prompt template that works against TTS-2 will sound broken on TTS 1.5, and vice versa.

How does this compare with ElevenLabs v3 audio tags?

ElevenLabs Eleven v3 (GA, March 14, 2026) introduced audio tags as their answer to in-script direction. The two systems are genuinely close in spirit, and v3 is a credible competitor on raw expressive range. The differences are about scope and realtime suitability.

Where ElevenLabs v3 is genuinely stronger: language breadth (70+ vs 15 GA), audio tag expressiveness for pre-rendered narration, and a deeper voice library. If the use case is audiobook production, film dubbing, or long-form scripted content where latency does not matter, v3 is well-suited. If the use case is a voice agent that needs to react in under 200ms with the same expressive control, TTS-2 is built for it.

How does this compare with Cartesia Sonic 3.5 and Hume Octave?

Steering is becoming a category-wide pattern, but each vendor approaches it differently.

Hume is the honest empathy specialist. Their EVI platform focuses on the input side, detecting the user's emotion and conditioning the model's response to it, which is a different angle from output-side steering. For applications where the most important question is "how does the user actually feel right now," Hume is the right tool. For applications where the most important question is "how should the agent sound on this specific line," TTS-2's 8-dimension steering is more direct.

When should I reach for steering vs voice cloning?

Steering and voice cloning solve different problems and often work together.

Voice cloning answers the question of who the speaker is. On TTS-2, instant voice cloning takes 5 to 15 seconds of reference audio; the cloned voiceId then plugs into any TTS-2 request. Professional voice cloning is delivered as a paid add-on through the Growth tier and above, not as a self-serve feature.

Steering answers the question of how that voice delivers a specific line. The same voiceId can run through [say with rising tone], [whisper in a hushed style], [as a wise mentor], or [say angrily] across thousands of requests without retraining and without identity drift.

A typical production pattern: clone the voice once, store the voiceId, then drive performance per request through steering and deliveryMode. That keeps the brand voice consistent while letting the agent react to context turn by turn.

What are the limits I should design around?

Six constraints to plan for.

Steering instructions are English-only, regardless of the language being spoken. A French utterance still takes its direction from an English tag at the start.

Tags belong at the start of the text for the 8 steering dimensions. Only non-verbal sounds like [laugh] or [sigh] can appear inline.

Match the tag to the script. Pairing [say sadly] with a happy line, or [whisper] with [shout], degrades output rather than blending the effects.

TTS-2 is research preview. It is not GA. SLAs differ from the GA TTS 1.5 line. Builders should design failover to inworld-tts-1.5-max for production workloads that cannot tolerate preview-stage variability.

Per-request limit is 2,000 characters. Long scripts should be chunked at sentence breaks (recommended 500 to 1,600 chars per chunk) and streamed through /tts/v1/voice:stream.

Data residency options include EU and India on Enterprise plans. Check the pricing page for the current region list before planning a multi-region deployment.

FAQ

What is natural-language voice steering?

Natural-language voice steering is the ability to control how a TTS model delivers a line by writing plain-English instructions next to the text. With Inworld AI Realtime TTS-2 (research preview), an instruction like [say sadly] or [whisper in a hushed style] placed at the start of text changes emotion, pitch, pace, or style without changing voice identity.

What dimensions of voice can Inworld TTS-2 control?

8 steering dimensions: emotion, articulation, intonation, volume, pitch, range, speed, and vocal style. Plus inline non-verbals: [laugh], [sigh], [breathe], [cough], [yawn], [clear_throat]. Instructions are written in English regardless of the target speech language.

What is the deliveryMode field in TTS-2?

deliveryMode is an optional field on the /tts/v1/voice request body for the inworld-tts-2 model. It takes STABLE, BALANCED (default), or CREATIVE, controlling how much variance the model is allowed across requests with the same input. It is independent of temperature.

How does Inworld TTS-2 steering compare with ElevenLabs v3 audio tags?

Both embed bracketed instructions in the script. ElevenLabs v3 has broader language coverage (70+) and a deeper audio tag library, but is not recommended for realtime. Inworld TTS-2 generalizes the idea into 8 explicit dimensions, adds deliveryMode, and runs at sub-200ms TTFT median for realtime applications.

Can I use steering tags with Inworld TTS 1.5?

No. On inworld-tts-1.5-max and inworld-tts-1.5-mini, bracketed instructions are read aloud literally. Use inworld-tts-2 for natural-language steering.

Does steering preserve voice identity across emotions and languages?

Yes. TTS-2 keeps a single voice identity stable under steering and preserves that identity across 100+ languages (15 GA plus 90+ experimental).