Beyond TTS Quality: Evaluating Emotionality and Range

Aleksey Tikhonov, Head of ML/AI Research

When evaluating TTS systems, we usually focus on the obvious stuff: efficiency and accuracy. But once you nail those basics, things get interesting. What makes one voice more engaging than another? We decided to check it out.

This post will show how, while working on the new Realtime TTS models, we developed metrics for emotionality and expressiveness, and then used them to create voices that better match user needs.

How Do We Evaluate TTS?

First, let's compare the basic things: cost, speed, scalability, and ease of use. This approach assumes that the TTS works flawlessly.

When it doesn't (and, well, it often doesn't), there are well-established metrics to evaluate accuracy issues. We look at generation errors, pronunciation deviations, tone inconsistencies, unnatural sounds, and so on.

But what comes next?

Let's say we have a fast, cheap TTS that generates speech without errors. This is where things get interesting. Depending on our task, we might have some specific requirements. For a virtual AI companion, empathy and emotionality might be important, while for an AI support specialist, calmness and professionalism matter more.

How do we define such properties? Can we learn to work with them objectively? We started with emotionality, since it's the most researched aspect of expressive speech.

Emotionality

Evaluating the emotionality of text or speech is a fairly well-studied problem.

On one hand, theory tells us that the degree of emotionality in speech directly or indirectly affects various acoustic parameters: fundamental frequency (F0, pitch), loudness, speed, and so on. [1], [2].

On the other hand, there are plenty of ready-made solutions out there. Currently, HuggingFace hosts more than 20 models for emotion classification from speech audio.

We tried a bunch of them, and, unfortunately, most of them don't work quite the way we'd like. Some of them seem to be overfitted on specific domains, others are just unstable on our tests. So, we built our own "arousal" metric that evaluates overall speech emotionality regardless of the specific emotion.

Here are a few generation examples from the same speakers but with different levels of emotionality:

Hades

Alex

Low

0:00

Medium

0:00

High

0:00

Emotionality vs Expressiveness

Now, if you think about it, emotionality isn't everything we want from expressive speech. Speech can be emotionally neutral yet very expressive.

Think of your favorite audiobook narrated by a professional voice actor. Such a narrator uses various techniques (speed, volume, tone, rhythm, pauses, semantic emphasis) to make the flow of speech less monotonous and to highlight key words and phrases.

Can we automatically evaluate expressiveness beyond emotionality? It turns out we can. We managed to develop an "expressivity" metric that, with reasonable accuracy, determines the level of expressiveness in speech. While arousal captures intensity, expressivity measures variation and dynamic range across prosodic features.

Here are a few examples where we can hear high expressiveness with roughly similar emotionality:

Philip

Elizabeth

Agnes

Low

0:00

High

0:00

From Metrics to Training

With these metrics in hand, we can train models in a controlled way using reinforcement learning approaches, either increasing or decreasing emotionality and expressiveness depending on our preferences. We can also use them for data collection and training controlled generation models with explicit tags.

In our recent TTS 1.5 release, we used this approach to achieve more natural results that better match user needs. For instance, our AI companion voices now score 30% higher on expressivity while maintaining natural speech patterns.

Next Steps

Our current results are just the beginning. There's plenty of work ahead and quite a few tricky nuances to figure out: different languages and cultures use different intonation patterns to express emotions; non-standard voices (voices of non-human characters like robots, aliens, monsters, and cartoon characters) have unusual acoustic characteristics and may require special attention.

One way or another, we're planning to share the code for our metrics with the community. We are also going to create special benchmarks for developing and calibrating new ones.

If you're working on similar problems, we'd love to hear from you. The intersection of perceptual quality and measurable metrics is a fascinating space, and there's room for everyone to contribute.

Beyond Quality: Emotionality and Expressiveness

How Do We Evaluate TTS?

Emotionality

Emotionality vs Expressiveness

From Metrics to Training

Next Steps