We examine what expressive speech synthesis AI is, how it works, why it’s so hard to do well, and how to benchmark good speech synthesis.

Attempts to synthesize speech started over 200 years ago in St. Petersburg. In 1779, Christian Kratzenstein, a Russian professor, examined the physiological differences between five long vowels and created a machine that could produce them artificially through vibrating reeds and acoustic resonators.
Speech synthesis has come a long way over the years, evolving from the rudimentary speaking machines that inventors like Alexander Graham Bell created in the 1800s to electrical synthesizers that wowed crowds at New York World’s Fair in the 1930s, and finally to text-to-speech (TTS) systems – which saw a breakthrough at the Electrotechnical Laboratory in Japan in the 1960s.
Throughout that 200 year period, the challenge of artificially reproducing speech has intrigued and daunted tens of thousands of inventors and computer scientists, partly because it's a particularly difficult problem to solve. Human speech is an intricate blend of emotion and prosody (ex. pauses, non-linguistic exclamations, breath intakes, speaking style, contextual emphases, and so much more). That expressiveness makes building a speech synthesis model that approximates the complexity of human speech one of the most complex machine learning tasks to attempt – and one that’s riddled with technical obstacles.
However, the application of new machine learning techniques to today’s sophisticated deep learning models, has led to significant progress in the last several years. That’s translated into recent realistic text-to-speech models that are of exceptional quality with the ability to produce speech that more closely resembles human speech patterns than was ever possible before.
But as applications of speech synthesis have increased with the advancements in large language models like GPT-3 and GPT-4, the demand for different kinds of text-to-speech models to fit various use cases have increased.
Research efforts in the last five years have been focused on developing efficient or real-time models for expressive speech synthesis. This work has become a central theme in speech synthesis research because expressive speech is critical for potential applications in customer service, video games, dubbing, audiobooks, content production, and more.
In this post, we’ll examine what expressive speech synthesis is, why it’s so hard to do well, and how to benchmark good speech synthesis.
Interested in speech synthesis models and APIs? Test our model here.
Expressive speech synthesis, also often referred to as expressive text-to-speech (TTS), represents a significant advancement in speech synthesis technology. Unlike traditional TTS systems that focus solely on converting text into speech, expressive speech synthesis aims to give synthesized speech human-like tones, emotions, and other characteristics of embodied speech like pauses, non-verbal utterances, and breath intakes.
Neural TTS models have been central to the development of expressive speech synthesis. Synthetic voices, the predecessors of neural voices, were known for their robotic intonations and stilted delivery that people typically associate with TTS. Neural text-to-speech, however, bears a much closer resemblance to real human voices by exhibiting natural flow, appropriate intonation, and nuanced characteristics such as tone, pace, delivery, pitch, and inflection.
Neural text-to-speech (NTTS) leverages machine learning techniques to train neural networks capable of learning the intricate nuances of human speech and generating outputs that resemble it. Traditionally, text-to-speech systems relied on rule-based or statistical models for speech synthesis. These systems struggled to capture the natural prosody, rhythm, and intonation of human speech because their predefined linguistic and acoustic models followed limited rules-based algorithms. This resulted in outputs that lacked the richness and authenticity of human speech.
In contrast, NTTS models are generative AI models that are trained on vast amounts of speech data to create contextual connections between texts and how those words are spoken. That allows them to produce speech outputs with natural prosody. This shift from rule-based to data-driven approaches has unlocked new possibilities in expressive speech synthesis.
Early neural TTS models faced challenges in producing expressive and emotionally-rich speech. However, advancements in deep learning techniques have enabled neural TTS systems to model the complex dynamics of human speech with greater fidelity. By incorporating emotion-specific acoustic features into neural networks, NTTS systems can modify the tone and pitch of synthesized voices to convey a range of emotions.
Moreover, newer neural text-to-speech models have reduced the data requirements for training, making it easier to develop TTS systems tailored to specific languages or dialects. This has democratized expressive speech synthesis, allowing developers to create TTS solutions that cater to diverse linguistic and cultural contexts.
While realistic speech synthesis models have made significant progress in recent years, there is still more work to be done in improving the naturalness, expressiveness, and latency of models. That’s because of the many factors expressive speech synthesis models must take into account that collectively contribute to the richness and authenticity of synthesized speech. While each factor plays a crucial role, here are some of the most important.
Prosody refers to the rhythm, intonation, stress, and pitch variations in speech. It’s one of the most critical factors in expressive speech synthesis since it conveys emotional nuances, emphasis, and naturalness. Prosody also communicates things like gender, accent, dialect, age, other social signifiers, speaker state, and personality traits – something that is particularly important to capture for entertainment-focused use cases like video games. Capturing prosodic features accurately ensures that synthesized speech sounds lifelike and engaging and the speaker remains in-character.
Key challenges to achieving good prosody include:
Some potential strategies to overcome these challenges include techniques like using additional training data, specialized architectures, multi-task learning, autoencoders, fine tuning, and effective evaluation metrics to guide model development.
Contextual awareness involves understanding the surrounding context of speech acts – including speaker intentions, discourse structure, and conversational cues. Contextually aware TTS models adapt speech output dynamically based on contextual information in the text, ensuring that synthesized speech is contextually appropriate and coherent.
Key challenges to achieving contextual awareness include:
Some potential strategies to overcome these challenges include context encoding, improving semantic parsing, and contextual vocoding.
Acoustic characteristics such as timbre, resonance, and pitch range contribute to the overall quality and realism of synthesized speech. High-quality acoustic modeling is essential for producing natural-sounding speech that closely resembles human voices.
Key challenges to achieving good acoustics include:
Some potential strategies to overcome these challenges include things like using specialized architectures tailored to acoustic modeling and objective evaluation metrics like spectral distortion measures or perceptual quality scores.
Emotionally expressive speech synthesis involves infusing synthesized speech with various emotional tones, such as happiness, sadness, anger, or excitement. Emotional TTS models can dynamically adjust speech parameters to convey different emotions effectively, thus making the synthesized speech sound more natural.
Key challenges to achieving emotional expressiveness include:
Some potential strategies to overcome these challenges include things like emotion-targeted training, multi-modal input, prosody modeling, and more.
Speaker adaptation allows expressive TTS models to mimic the unique characteristics of individual speakers' voices. By learning from speaker-specific data, adapted models can produce speech that closely resembles the voices of specific individuals, including accent, intonation, and speech patterns.
Key challenges to achieving good speaker adaptation include:
Some potential strategies to overcome these challenges include things like transfer learning, scaling up the dataset, fine-grained adaptation techniques, and more.

Evaluating the quality of speech is complicated. Conventional evaluations of Text-to-Speech (TTS) model performance often rely on Mean-Opinion Score (MOS), where human raters assess voice quality on a scale of 1 to 5. However, this method is expensive, slow, and impractical for frequent model and data adjustments during model training.
At Inworld, we developed an automated evaluation framework evaluating prompt and generated voice quality across five key factors to allow for faster iteration on our model:
By adopting this comprehensive framework, our goal was to expedite and streamline the assessment process and enable the efficient refinement of TTS models with enhanced voice quality.
Despite the challenges that speech synthesis AI models face in generating expressive and lifelike speech, the rate at which models have improved and become more efficient over the last few years is exciting.
At Inworld, we've made significant advancements in reducing latency for both our Inworld Studio and standalone voices, giving our customers more options to generate natural-sounding conversations for their real-time Inworld projects or for use as a standalone API.
Inworld’s speech synthesis model offers high-fidelity AI voice generation powered by AI models that add realism with lifelike rhythm, intonation, expressiveness, and tone – while also achieving extremely low latency. Our new voices also have a 250ms end-to-end 50pct latency for approximately 6 seconds of audio generation.
The current state of our TTS model and of expressive TTS technology more broadly opens up exciting opportunities for its use in applications like gaming, audiobooks, voiceovers, customer service, content creation, podcasts, AI assistants, and more. With the pace of recent developments, expect the output of TTS to soon be nearly indistinguishable from real voices.
Want to hear more Inworld’s TTS voices?