Get started
Published 05.28.2026

Natural-language TTS steering: how to control emotion, tone, and style in 2026

Inworld AI Realtime TTS-2 (research preview, launched May 5, 2026) treats voice direction the same way modern image and video models treat prompts: you write in English, and the model adjusts how it speaks. Natural-language steering on inworld-tts-2 exposes 8 dimensions of vocal control plus a new deliveryMode field, so the same voice can sound urgent, exhausted, or amused without retraining or post-processing. This page explains how steering works, what the 8 dimensions cover, how to call the API today, and how the approach compares with ElevenLabs v3 audio tags, Cartesia Sonic 3.5, and Hume Octave.
TTS-2 is research preview, not GA. The model is currently the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena (around 1,208 ELO as of late May 2026). Steering and deliveryMode are available on every TTS-2 request via the standard POST /tts/v1/voice endpoint.

What is natural-language voice steering?

Natural-language voice steering is a control method for TTS models where you place plain-English instructions next to the text you want spoken, and the model adjusts delivery in a single forward pass. Instead of fine-tuning a new voice for "angry CEO" or running expensive post-processing to add a sigh, the prompt itself carries the direction.
On Realtime TTS-2, instructions are written in square brackets at the start of the text, in English, even when the target speech is in another language. Non-verbal sounds can appear inline anywhere inside the text. The model treats the brackets as direction, not as text to read.
A useful mental model: TTS-2 separates what is said (the script) from how it is said (the steering tags) and from who is saying it (the voice ID). Each of those three axes can change independently.

What 8 steering dimensions does Inworld TTS-2 expose?

TTS-2 exposes 8 explicit steering dimensions, each with a vocabulary of natural-language phrases that the model understands. The dimensions map to concrete acoustic properties of the resulting audio.
In addition to the 8 dimensions, TTS-2 supports inline non-verbal tags: [laugh], [breathe], [sigh], [cough], [yawn], [clear_throat]. These can appear anywhere in the text, not just at the start, and multiple non-verbals can be combined.
Three constraints matter in practice. Instructions are English-only, even when target speech is non-English. Steering tags belong at the start of the text; only non-verbals can be inline. And the instruction must match the script: pairing [say sadly] with a celebratory sentence degrades the output rather than improving it.

How does the deliveryMode field work?

deliveryMode is a new optional field on the POST /tts/v1/voice request body, available on the inworld-tts-2 model. It controls how much variance the model is allowed to introduce across requests with the same input.
deliveryMode is independent of temperature (default 1.0, valid range 0 to 2 exclusive). You can pair STABLE with a low temperature for deterministic-feeling UI voice, or CREATIVE with a higher temperature for performance work. In practice, most production conversational apps should leave BALANCED as the default and only override per request when the use case calls for it.

What does a real TTS-2 steering call look like?

A working request to the streaming TTS endpoint with steering and deliveryMode. The streaming endpoint returns NDJSON; each line carries a base64 audioContent chunk that must be decoded and concatenated.
Three things to verify before shipping. Authorization is Basic, not Bearer (server-side; browsers use a Bearer JWT minted server-side via POST /auth/v1/tokens/token:generate). The streaming endpoint returns NDJSON with base64 audioContent per line, not raw bytes. And voiceId plus modelId are REST TTS field names; the Realtime WebSocket session uses voice and model instead. Mixing those across APIs is the most common cause of silent failures.

Why doesn't this work on Realtime TTS 1.5?

Steering tags are an inworld-tts-2 feature. On inworld-tts-1.5-max and inworld-tts-1.5-mini, the model treats [say sadly] as text and reads the brackets aloud, which produces audio like "open bracket say sadly close bracket you actually finished it." TTS 1.5 still supports a small set of experimental emotion markups ([happy], [sad], [angry], [surprised], [fearful], [disgusted], [laughing], [whispering]) at the start of text, but the natural-language steering surface and deliveryMode field are TTS-2 only.
The practical rule: any prompt fragment that goes near the TTS layer should know which model it is targeting. A prompt template that works against TTS-2 will sound broken on TTS 1.5, and vice versa.

How does this compare with ElevenLabs v3 audio tags?

ElevenLabs Eleven v3 (GA, March 14, 2026) introduced audio tags as their answer to in-script direction. The two systems are genuinely close in spirit, and v3 is a credible competitor on raw expressive range. The differences are about scope, realtime suitability, and the public ranking.
Where ElevenLabs v3 is genuinely stronger: language breadth (70+ vs 15 GA), audio tag expressiveness for pre-rendered narration, and a deeper voice library. If the use case is audiobook production, film dubbing, or long-form scripted content where latency does not matter, v3 is well-suited. If the use case is a voice agent that needs to react in under 200ms with the same expressive control, TTS-2 is built for it.

How does this compare with Cartesia Sonic 3.5 and Hume Octave?

Steering is becoming a category-wide pattern, but each vendor approaches it differently.
Hume is the honest empathy specialist. Their EVI platform focuses on the input side, detecting the user's emotion and conditioning the model's response to it, which is a different angle from output-side steering. For applications where the most important question is "how does the user actually feel right now," Hume is the right tool. For applications where the most important question is "how should the agent sound on this specific line," TTS-2's 8-dimension steering is more direct.

When should I reach for steering vs voice cloning?

Steering and voice cloning solve different problems and often work together.
Voice cloning answers the question of who the speaker is. On TTS-2, instant voice cloning takes 5 to 15 seconds of reference audio; the cloned voiceId then plugs into any TTS-2 request. Professional voice cloning is delivered as a paid add-on through the Growth tier and above, not as a self-serve feature.
Steering answers the question of how that voice delivers a specific line. The same voiceId can run through [say with rising tone], [whisper in a hushed style], [as a wise mentor], or [say angrily] across thousands of requests without retraining and without identity drift.
A typical production pattern: clone the voice once, store the voiceId, then drive performance per request through steering and deliveryMode. That keeps the brand voice consistent while letting the agent react to context turn by turn.

What are the limits I should design around?

Six constraints to plan for.
Steering instructions are English-only, regardless of the language being spoken. A French utterance still takes its direction from an English tag at the start.
Tags belong at the start of the text for the 8 steering dimensions. Only non-verbal sounds like [laugh] or [sigh] can appear inline.
Match the tag to the script. Pairing [say sadly] with a happy line, or [whisper] with [shout], degrades output rather than blending the effects.
TTS-2 is research preview. It is not GA. SLAs differ from the GA TTS 1.5 line. Builders should design failover to inworld-tts-1.5-max for production workloads that cannot tolerate preview-stage variability.
Per-request limit is 2,000 characters. Long scripts should be chunked at sentence breaks (recommended 500 to 1,600 chars per chunk) and streamed through /tts/v1/voice:stream.
Data residency options include EU and India on Enterprise plans. Check the pricing page for the current region list before planning a multi-region deployment.

FAQ

What is natural-language voice steering?
Natural-language voice steering is the ability to control how a TTS model delivers a line by writing plain-English instructions next to the text. With Inworld AI Realtime TTS-2 (research preview), an instruction like [say sadly] or [whisper in a hushed style] placed at the start of text changes emotion, pitch, pace, or style without changing voice identity.
What dimensions of voice can Inworld TTS-2 control?
8 steering dimensions: emotion, articulation, intonation, volume, pitch, range, speed, and vocal style. Plus inline non-verbals: [laugh], [sigh], [breathe], [cough], [yawn], [clear_throat]. Instructions are written in English regardless of the target speech language.
What is the deliveryMode field in TTS-2?
deliveryMode is an optional field on the /tts/v1/voice request body for the inworld-tts-2 model. It takes STABLE, BALANCED (default), or CREATIVE, controlling how much variance the model is allowed across requests with the same input. It is independent of temperature.
How does Inworld TTS-2 steering compare with ElevenLabs v3 audio tags?
Both embed bracketed instructions in the script. ElevenLabs v3 has broader language coverage (70+) and a deeper audio tag library, but is not recommended for realtime. Inworld TTS-2 generalizes the idea into 8 explicit dimensions, adds deliveryMode, and runs at sub-200ms TTFT median for realtime applications.
Can I use steering tags with Inworld TTS 1.5?
No. On inworld-tts-1.5-max and inworld-tts-1.5-mini, bracketed instructions are read aloud literally. Use inworld-tts-2 for natural-language steering.
Does steering preserve voice identity across emotions and languages?
Yes. TTS-2 keeps a single voice identity stable under steering and preserves that identity across 100+ languages (15 GA plus 90+ experimental).
Copyright © 2021-2026 Inworld AI
Natural-language TTS steering: how to control emotion, tone, and style in 2026