Towards a Vowel Formant Based Quality Metric for Text-to-Speech Systems: Measuring Monophthong Naturalness

Sven Albrecht, Rewa Tamboli, Stefan Taubert, Maximilian Eibl, Günter Daniel Rey, Josef Schmied

November, 2021

Abstract

This contribution proposes an objective, vowel formant based quality metric for assessing the naturalness of monophthongs synthesized by Text-to-Speech (TTS) systems. This could eliminate the need for time and resource-intensive perception-based mean opinion score metrics. We show that vowel space plots can serve as a basis for developing a more comprehensive, linguistically sound quality metric for TTS systems. In addition, our metric provides detailed insights on the quality of individual monophthongs, which could help with identifying areas for further optimization in TTS systems. We use a state-of-the-art neural TTS pipeline based on Tacotron and WaveGlow, trained on the LJ Speech dataset, to generate the same audio as in our validation set and compare the two sets of audio files. Our quality metric is automatically calculated using the Montreal Forced Aligner for aligning text and audio, Praat for measuring formant values and a custom R script to calculate the vowel spaces and their overlap. Our results show that the vowel spaces of the original audio and the synthesized audio overlap to a large extent, especially for the means and the central 75% of the data, with more variation when considering all data points. Furthermore, we show that there is considerable variation in the overlap of the vowel spaces of different monophthongs.

Type