Research
New Speech Tech Makes Text-to-Speech Clearer Without Sounding Condescending
A team of computer scientists, linguists, and cognitive scientists from and in France has developed a new way to make synthetic speech easier to understand, particularly for second-language (L2) speakers and in noisy environments.
If you’ve ever struggled to understand a subway announcement or a voice assistant speaking too quickly, you’re not alone. While slowing down or over-enunciating speech is a common strategy, it often sounds unnatural and can feel patronizing, especially when used in text-to-speech (TTS) systems.
Instead of copying these traditional speech patterns, the researchers used a technique called reverse correlation to uncover what actually helps listeners recognize spoken words. selectively lengthening specific vowels, particularly tense vowels like those in “sheep” or “fool,” significantly improves comprehension.
This new “” was implemented in , an open-source speech synthesis system. In experiments with French speakers learning English, participants made over 9 percent fewer errors when listening to clarity mode speech compared to fully slowed-down speech. Interestingly, many listeners believed the slowed-down version was more intelligible, even though it led to more mistakes. This suggests people may not always be aware of what supports their understanding.
Further tests showed that the benefits aren't limited to L2 speakers. Native English speakers also performed better when clarity mode was used in simulated subway noise. In these difficult conditions, they relied more on timing cues than on vowel quality, much like L2 listeners.
“Slowing down everything doesn’t always help–it can actually hurt comprehension and make the voice feel less respectful,” said lead author Paige Tuttösi. “But slowing just the right vowel at just the right time improves both understanding and the listener’s perception of the voice.”
was presented at the , a world-leading venue for speech synthesis research, taking place in Leeuwarden, Netherlands, this past August.
By focusing on how speech is perceived, rather than how it is produced, this research offers a new path toward more inclusive and effective voice technology. The clarity mode, the first TTS specifically designed for second language speakers in mind, is now freely available in Matcha-TTS for researchers and developers.
Media Contact:
Angelica Lim, Practitioner Associate Professor, Computing Science | angelica@sfu.ca