How to Measure TTS Latency: Realtime Voice Guide

Introduction

Latency can make or break real-time voice experiences, whether for interactive game characters, voice agents, or call centers. But measuring latency isn’t as simple as it seems, since differences in provider behavior, streaming, and buffering can skew results.

In this post, we’ll guide you through:

How to fairly measure latency across TTS providers
Using our interactive demo to visualize latency
Techniques to optimize latency in Inworld TTS

Measuring Latency Across Providers

Latency comparisons between TTS systems can be misleading unless you control for several key factors:

Initial Silence: Some providers include a pause before playback begins, adding hundreds of milliseconds. Always measure from the moment text is submitted to the first audible sound.
Streaming vs. Full Synthesis: Streaming APIs begin outputting audio before the entire audio output is processed, while standard HTTP APIs wait until synthesis is complete. For a fair comparison, ensure all providers use the same streaming mode.
Text Normalization and Preprocessing: Differences in text normalization (e.g., handling punctuation or abbreviations) can impact both latency and quality. Confirm that identical normalization settings are applied across systems.
Post-Processing and Additional Features: Post-processing transformations like pitch shift, speed adjustments, timestamp alignment or reformatting the audio output can add extra latency.

Interactive Demo: Visualizing Latency

To make these differences more tangible, we’ve built a demo tool that lets you:

Compare Inworld TTS latency against other providers, including:
- Time to first chunk received;
- Time to first hearable audio.
Listen to various generations provided by numerous providers including Inworld and available all at once.

Clone the repo to try it yourself

Optimizing Latency in Inworld TTS

Choose the Right Streaming Mode

Depending on your use case, either the streaming or non-streaming API may deliver better performance.

Use Non-Streaming for Lowest End-to-End Latency: When the full audio is generated in one shot, the total synthesis time is often shorter than streaming chunk-by-chunk. Choose this option if your priority is minimizing total response time rather than supporting live or interactive playback.
Use Streaming for Real-Time Interactions: When building voice agents or conversational experiences, streaming lets speech play as it’s generated so users hear responses immediately. This makes the experience feel faster, more responsive, and natural.

Tip: Streaming with websocket (below) provides fastest real-time interactions.

Try WebSocket

Persistent WebSocket connections reduce overhead compared to repeated HTTP requests.

Benefits:

Faster audio start.
Context persistence across multiple requests.
Seamless handling of interruptions.

Use case: Interactive characters or voice agents that require low latency and dynamic responses.

Disable Text Normalization

Text normalization is a post-processing step in TTS models that converts written text (e.g., “$5,” “Dr.,” or “3/10”) into a spoken-friendly form (“five dollars,” “doctor,” “March tenth”). Skipping text normalization can cut response time:

English: 30–40ms on average (up to 300ms for complex text).
Other languages: up to 1 second saved.

Tip: Prompt your LLM to generate speech-ready text and turn off text normalization. Alternatively, apply basic Regex rules to normalize most common cases, such as numbers and currency.

Minimize Postprocessing

Any postprocessing generally adds additional latency, since it is applied after audio is generated. Examples include pitch shift, speed adjustments, timestamp alignment or reformatting the audio output. The magnitude of the added latency depends on the combination of features you use and the specifics of your environment or setup.

Native 48kHz audio: High-quality output is generated by default, which is ideal for fidelity, but in low-bandwidth or unstable network conditions, this may not be optimal for real-time applications. We provide multiple audio formats to help you balance quality vs. transfer time. In some cases, reducing transfer time outweighs the cost of compression.
Parallel processing: Many features, such as timestamp alignment and audio compression, are processed in parallel. In practice, this often means that combining these features introduces minimal or no additional latency.

Tip: Experiment with audio formats and consider disabling optional features if latency is critical. This allows you to find the optimal latency-quality trade-off for your specific use case.

Test Inworld TTS Latency Today

All features are available via our API and TTS Playground:

The complete guide to measuring and optimizing TTS latency