How to Evaluate TTS Models for Realtime Conversational AI

Key learnings from "How Inworld Is Crafting the Future of Real-Time Interaction," a conversation with CEO Kylan Gibbs on Bluejay AI's Skywatch podcast.

Most teams evaluate text-to-speech models incorrectly. The common approach is to listen to a few samples, pick the one that sounds best, and move on. But what sounds good to your team is often very different from what sounds good to your customers.

If you're building realtime voice applications, whether that's a health coach, a sales agent, or customer support, choosing the right TTS model requires a more systematic approach.

Start with what quality means for your use case

Quality isn't universal. What makes a voice good for an audiobook is completely different from what works for a realtime fitness app.

The first question to ask is: what does quality mean for your specific application? A health coach app has very different requirements than a customer support line. Support applications tend to have more convergence around metrics like containment rates and clear communication. But consumer applications often need something harder to measure. Does the voice feel sympathetic? Does it match the brand personality? Does it keep users engaged?

Before you evaluate a single model, define what success looks like for your specific application.

Offline metrics are useful but limited

Standard benchmarks give you a starting point. Word error rate tells you about accuracy. Similarity scores measure how close a cloned voice is to the original. These metrics matter, but they don't tell the full story.

For realtime applications, latency becomes critical. Sub-500 millisecond latency should be the baseline. If your voice agent takes too long to respond, the conversation feels unnatural regardless of how good the voice sounds.

Third-party benchmarks add objectivity. Your internal team will have opinions about which voice sounds best. Those opinions may not match what your users think. Platforms like Artificial Analysis and Hugging Face Arena provide evaluation based on human subjectivity rather than technical metrics. They aggregate preferences from real humans evaluating voices blind. They're not perfect, but they're more objective than your team's gut feeling.

Factor in cost from the start

Many voice providers weren't focused on cost because it didn't matter for their primary use cases. When you're doing audiobooks or TikTok content, cost per character is negligible. But when you're serving millions of users concurrently and they're talking for an hour a day, cost becomes a critical constraint.

High costs create binary decisions. Either the use case works at that price point or it doesn't. Lower costs open up experimentation and entirely new applications. With Inworld TTS priced at $5 per million characters (approximately 0.5 cents per minute), use cases that were previously impossible suddenly become worth trying.

If you're building for scale, cost per minute should be part of your evaluation criteria from day one.

Test for real-world failure modes

Beyond aggregate metrics, watch for specific failure modes that can break user trust.

Basic stability issues are common: hallucinations, inserting random words, not finishing sentences. These happen, and they erode user confidence quickly.

Mispronunciation is another frequent problem, especially with non-standard words. A Japanese word inserted in an English sentence. A customer's unusual name. These details matter, especially for support use cases where personalization builds trust.

Build test cases around these edge cases. They reveal model limitations that demos don't show.

Evaluate controllability for realtime contexts

Static voice quality isn't enough for realtime applications. You need dynamic control.

When you're creating an audiobook, you can generate multiple variants and select the one that works best. But realtime applications require you to control the model for specific contexts on the fly. A meditation app late at night should have a slower, more responsive voice. Turn-taking should be more sensitive, giving users more time to respond.

Promptable TTS addresses this by letting you adjust the model's behavior on every generation: speed changes, inflection points, tone adjustments. Evaluate whether models let you make these adjustments dynamically at runtime, not just during setup.

Test your full voice pipeline together

A common mistake is evaluating TTS in isolation.

Different LLMs output different structures, and those structures affect how TTS performs. Punctuation, sentence structure, vocabulary choices all influence the final audio output. You might want an LLM that maximizes proper vocabulary, punctuation, grammar, and style because those elements improve TTS expression.

Evaluate your voice stack as a unified pipeline, not as independent components. The interactions between pieces matter as much as the pieces themselves.

Run A/B tests with real users

Offline evaluation only gets you so far. The real test is how models perform with actual users.

Set up controlled experiments where different user segments experience different TTS models, then measure against the metrics you defined earlier. Make sure you're clear on the metrics that matter and the models you're evaluating. If you're using voice cloning, make sure the clones are set up properly.

The willingness to iterate based on real data, rather than internal preferences, separates teams that ship great voice experiences from teams that ship mediocre ones.

How Inworld TTS addresses the evaluation challenge

The evaluation framework outlined above reflects lessons learned from building Inworld TTS, which now ranks first on Artificial Analysis leaderboards.

Inworld TTS was designed for realtime conversational AI at consumer scale, addressing the core criteria that matter most: benchmark-leading quality through novel architecture and custom kernel development, sub-200ms latency with TTS 1.5, and voice tags for realtime control over emotion, pace, and intensity without redeployment. TTS 1.5 delivers 4x latency improvement, 30% greater expressiveness, 40% reduction in word error rates, and expands multilingual support to 15+ languages including Hindi, Arabic, and Hebrew.

At $5 per million characters, Inworld TTS costs 95% less than comparable alternatives. Companies like Bible Chat scaled AI voice features to millions of users while reducing costs by over 90%, and Talkpal achieved 40% cost reduction with a 7% increase in feature usage and 4% lift in retention within four weeks.

Get started to test Inworld TTS quality and experiment with Agent Runtime orchestration.

TTS evaluation checklist

Before you start

Define what "quality" means for your specific use case
Identify key metrics (retention, time spent, sympathy)

Offline evaluation

Measure word error rate and similarity score (for voice cloning)
Test latency (target sub-500ms for realtime)
Calculate cost per minute at your expected scale
Check third-party benchmarks (Artificial Analysis, Hugging Face Arena)

Edge case testing

Test mispronunciation with non-standard words, names, and proper nouns
Check for hallucinations and incomplete sentences
Test foreign words inserted in English sentences

Controllability

Verify speed, tone, and inflection controls work in realtime
Evaluate turn-taking sensitivity options

Pipeline integration

Test TTS with your actual LLM output
Verify punctuation and formatting improves expression
Evaluate the LLM + TTS pipeline together

Online evaluation

Set up A/B test infrastructure
Define success metrics for the experiment
Run controlled experiments with real users
Iterate based on results

Frequently asked questions

What metrics should I use to evaluate TTS models?

Start with offline metrics like word error rate and similarity score, but don't stop there. Define metrics specific to your use case, such as user retention, time spent, or whether the voice feels sympathetic. The most important evaluation happens through A/B testing with real users.

What latency is acceptable for realtime voice applications?

Sub-500 millisecond latency should be the baseline for realtime applications. If your voice agent takes too long to respond, the conversation feels unnatural regardless of how good the voice sounds.

How much does TTS cost for high-volume applications?

Cost varies significantly across providers. For realtime applications serving millions of users, look for pricing around $5 per million characters (approximately 0.5 cents per minute). At these price points, use cases that were previously cost-prohibitive become viable.

What's the difference between offline and online TTS evaluation?

Offline evaluation uses technical metrics like word error rate and similarity scores to compare models without real users. Online evaluation means A/B testing with actual users, measuring real business metrics like retention and engagement. Offline evaluation is a starting point; online evaluation reveals how models actually perform.

Should I evaluate TTS separately from my LLM?

No. Different LLMs output different text structures, and those structures affect how your TTS sounds. Punctuation, sentence structure, and vocabulary choices all influence the final audio. Evaluate your voice stack as a unified pipeline, not as independent components.

What are the most common TTS failure modes to test for?

Watch for hallucinations (inserting random words), incomplete sentences, and mispronunciation of non-standard words like foreign terms or unusual names. Build test cases around these edge cases, as they reveal limitations that demos don't show.

What is promptable TTS?

Promptable TTS lets you feed in text commands that adjust the model's behavior in realtime. On every generation, you can control speed, inflection, and tone. This matters for realtime applications where the same voice may need to adapt to different contexts, like a meditation app adjusting for late-night use.

How to evaluate TTS models for realtime conversational AI