How to evaluate TTS models for realtime conversational AI
Last updated: April 5, 2026
Inworld AI TTS-1.5 Max ranks #1 on the Artificial Analysis TTS leaderboard with an ELO score of 1,236 based on thousands of blind user preference comparisons (March 2026). The framework below comes from building production voice systems at consumer scale.
Most teams evaluate text-to-speech models incorrectly. The common approach is to listen to a few samples, pick the one that sounds best, and move on. But what sounds good to your team is often very different from what sounds good to your customers.
If you're building realtime voice applications, whether that is a health coach, a sales agent, or customer support, choosing the right TTS model requires a more systematic approach.
What does quality mean for your use case?
Quality isn't universal. What makes a voice good for an audiobook is completely different from what works for a realtime fitness app.
The first question to ask is: what does quality mean for your specific application? A health coach app has very different requirements than a customer support line. Support applications tend to have more convergence around metrics like containment rates and clear communication. But consumer applications often need something harder to measure. Does the voice feel sympathetic? Does it match the brand personality? Does it keep users engaged?
Before you evaluate a single model, define what success looks like for your specific application.
How useful are offline metrics for TTS evaluation?
Standard benchmarks give you a starting point. Word error rate tells you about accuracy. Similarity scores measure how close a cloned voice is to the original. These metrics matter, but they don't tell the full story.
For realtime applications, latency becomes critical. Sub-500 millisecond latency should be the baseline. If your voice agent takes too long to respond, the conversation feels unnatural regardless of how good the voice sounds.
Third-party benchmarks add objectivity. Your internal team will have opinions about which voice sounds best. Those opinions may not match what your users think. Platforms like Artificial Analysis and Hugging Face Arena provide evaluation based on human subjectivity rather than technical metrics. They aggregate preferences from real humans evaluating voices blind. They're not perfect, but they're more objective than your team's gut feeling.
Why should cost be part of TTS evaluation from the start?
Many voice providers weren't focused on cost because it didn't matter for their primary use cases. When you're doing audiobooks or TikTok content, cost per character is negligible. But when you're serving millions of users concurrently and they're talking for an hour a day, cost becomes a critical constraint.
High costs create binary decisions. Either the use case works at that price point or it doesn't. Lower costs open up experimentation and entirely new applications. When per-character costs drop low enough, use cases that were previously impossible suddenly become worth trying.
If you are building for scale, cost per minute should be part of your evaluation criteria from day one.
What real-world failure modes should you test for?
Beyond aggregate metrics, watch for specific failure modes that can break user trust.
Basic stability issues are common: hallucinations, inserting random words, not finishing sentences. These happen, and they erode user confidence quickly.
Mispronunciation is another frequent problem, especially with non-standard words. A Japanese word inserted in an English sentence. A customer's unusual name. These details matter, especially for support use cases where personalization builds trust.
Build test cases around these edge cases. They reveal model limitations that demos don't show.
How important is controllability for realtime contexts?
Static voice quality isn't enough for realtime applications. You need dynamic control.
When you're creating an audiobook, you can generate multiple variants and select the one that works best. But realtime applications require you to control the model for specific contexts on the fly. A meditation app late at night should have a slower, more responsive voice. Turn-taking should be more sensitive, giving users more time to respond.
Promptable TTS addresses this by letting you adjust the model's behavior on every generation: speed changes, inflection points, tone adjustments. Evaluate whether models let you make these adjustments dynamically at runtime, not just during setup.
Should you test TTS in isolation or as part of the full pipeline?
A common mistake is evaluating TTS in isolation.
Different LLMs output different structures, and those structures affect how TTS performs. Punctuation, sentence structure, vocabulary choices all influence the final audio output. You might want an LLM that maximizes proper vocabulary, punctuation, grammar, and style because those elements improve TTS expression.
Evaluate your voice stack as a unified pipeline, not as independent components. The interactions between pieces matter as much as the pieces themselves.
Why should you run A/B tests with real users?
Offline evaluation only gets you so far. The real test is how models perform with actual users.
Set up controlled experiments where different user segments experience different TTS models, then measure against the metrics you defined earlier. Make sure you're clear on the metrics that matter and the models you're evaluating. If you're using voice cloning, make sure the clones are set up properly.
Iterate based on real data, not internal preferences.
How does Inworld TTS address the evaluation challenge?
Inworld TTS ranks #1 on the Artificial Analysis leaderboard with an ELO of 1,236 (March 2026). The evaluation framework above comes from building it.
Inworld TTS was designed for realtime conversational AI at consumer scale: #1-ranked quality through novel architecture and custom kernel development, sub-250ms P90 latency with TTS-1.5 Max, and audio markup tags (experimental, English) for realtime control over emotion, pace, and intensity without redeployment. TTS-1.5 delivers 4x latency improvement, 30% greater expressiveness, 40% reduction in word error rates, and expands multilingual support to 15 languages including Hindi, Arabic, and Hebrew.
Companies like Bible Chat scaled AI voice features to millions of users, and Talkpal achieved a 7% increase in feature usage and 4% lift in retention within four weeks after switching to Inworld TTS.
Define what "quality" means for your specific use case
Identify key metrics (retention, time spent, sympathy)
Offline evaluation
Measure word error rate and similarity score (for voice cloning)
Test latency (target sub-500ms for realtime)
Calculate cost per minute at your expected scale
Check third-party benchmarks (Artificial Analysis, Hugging Face Arena)
Edge case testing
Test mispronunciation with non-standard words, names, and proper nouns
Check for hallucinations and incomplete sentences
Test foreign words inserted in English sentences
Controllability
Verify speed, tone, and inflection controls work in realtime
Evaluate turn-taking sensitivity options
Pipeline integration
Test TTS with your actual LLM output
Verify punctuation and formatting improves expression
Evaluate the LLM + TTS pipeline together
Online evaluation
Set up A/B test infrastructure
Define success metrics for the experiment
Run controlled experiments with real users
Iterate based on results
Frequently asked questions
What metrics should I use to evaluate TTS models?
Start with offline metrics like word error rate and similarity score, but don't stop there. Define metrics specific to your use case, such as user retention, time spent, or whether the voice feels sympathetic. The most important evaluation happens through A/B testing with real users.
What latency is acceptable for realtime voice applications?
Sub-500 millisecond latency should be the baseline for realtime applications. If your voice agent takes too long to respond, the conversation feels unnatural regardless of how good the voice sounds.
How much does TTS cost for high-volume applications?
Cost varies significantly across providers. For realtime applications serving millions of users, per-character cost is one of the most important evaluation criteria. Providers with lower per-character rates enable use cases that are cost-prohibitive at higher price points. See the best TTS APIs comparison for current pricing details across all major providers.
What's the difference between offline and online TTS evaluation?
Offline evaluation uses technical metrics like word error rate and similarity scores to compare models without real users. Online evaluation means A/B testing with actual users, measuring real business metrics like retention and engagement. Offline evaluation is a starting point; online evaluation reveals how models actually perform.
Should I evaluate TTS separately from my LLM?
No. Different LLMs output different text structures, and those structures affect how your TTS sounds. Punctuation, sentence structure, and vocabulary choices all influence the final audio. Evaluate your voice stack as a unified pipeline, not as independent components.
What are the most common TTS failure modes to test for?
Watch for hallucinations (inserting random words), incomplete sentences, and mispronunciation of non-standard words like foreign terms or unusual names. Build test cases around these edge cases, as they reveal limitations that demos don't show.
What is promptable TTS?
Promptable TTS lets you feed in text commands that adjust the model's behavior in realtime. On every generation, you can control speed, inflection, and tone. This matters for realtime applications where the same voice may need to adapt to different contexts, like a meditation app adjusting for late-night use.