21-language neural text-to-speech technology now works on smartphones

Researchers have developed a 21-language, fast and high-fidelity neural text-to-speech technology that can synthesize one second of speech in 0.1 seconds

This technology can synthesize one second of speech in just 0.1 seconds using a single CPU core, making it eight times faster than previous methods.

This technology can synthesize one second of speech in just 0.1 seconds using a single CPU core, making it eight times faster than previous methods. (CREDIT: NIST)

The Universal Communication Research Institute of the National Institute of Information and Communications Technology (NICT) has achieved a significant breakthrough in neural text-to-speech technology, developing a system that supports 21 languages with high speed and fidelity.

This technology can synthesize one second of speech in just 0.1 seconds using a single CPU core, making it eight times faster than previous methods. It also enables fast synthesis with a latency of 0.5 seconds on a mid-range smartphone without needing a network connection.

These advanced neural text-to-speech models have been integrated into VoiceTra, NICT's multilingual speech translation application for smartphones, and are now publicly available. Future applications of this technology could include multilingual speech translation and car navigation systems through commercial licensing.

The Universal Communication Research Institute at NICT has been focused on developing multilingual speech translation technology to facilitate communication across language barriers.

Select screenshots of the VoicTra app. (CREDIT: NICT)

This research and development (R&D) effort has led to the creation of VoiceTra, a speech translation app for smartphones, and various other commercial implementations. Text-to-speech technology, which converts translated text into human-like speech, is crucial for multilingual speech translation.

Recent advancements in neural network technology have significantly improved the quality of synthesized speech, making it nearly indistinguishable from natural speech. However, the computational demands have posed a challenge, particularly for smartphone use without a network connection.

NICT is also working on multilingual simultaneous interpretation technology, which requires continuous translation of speech without waiting for the speaker to finish. This necessitates rapid text-to-speech processing, similar to the needs of automatic speech recognition and machine translation.

Typically, text-to-speech models consist of an acoustic model, which converts input text into intermediate features, and a waveform generative model, which transforms these features into speech waveforms.

While neural networks like the Transformer model are commonly used in acoustic modeling, NICT has introduced the high-speed ConvNeXt model, which has recently been used in image identification, into the acoustic model. This innovation has tripled the synthesis speed without compromising performance.

In 2021, NICT improved the HiFi-GAN model to create MS-HiFi-GAN, which doubled the synthesis speed while maintaining quality. By 2023, further advancements led to MS-FC-HiFi-GAN, which achieved four times faster synthesis than the original HiFi-GAN model.

Combining these innovations, NICT developed a new neural text-to-speech model using a Transformer encoder and ConvNeXt decoder for the acoustic model, along with the MS-FC-HiFi-GAN for waveform generation. This new model can synthesize one second of speech in just 0.1 seconds on a single CPU core, an eightfold improvement over conventional methods.

Additionally, the model supports fast synthesis with a latency of 0.5 seconds on a mid-range smartphone without network connectivity, reducing communication costs and eliminating the need for server-based synthesis.

Since March 2024, this technology has been implemented in VoiceTra, supporting 21 languages: Japanese, English, Chinese, Korean, Thai, French, Indonesian, Vietnamese, Spanish, Myanmar, Filipino, Brazilian Portuguese, Khmer, Nepali, Mongolian, Arabic, Italian, Ukrainian, German, Hindi, and Russian.

Looking ahead, NICT aims to promote the social implementation of this technology, particularly in smartphone applications for multilingual speech translation and car navigation systems through commercial licensing.

By making this technology widely available, NICT hopes to enhance communication across language barriers and improve user experiences in various practical applications.

Note: Materials provided above by The Brighter Side of News. Content may be edited for style and length.


Like these kind of feel good stories? Get The Brighter Side of News' newsletter.


Rebecca Shavit is the Good News, Psychology, Behavioral Science, and Celebrity Good News reporter for the Brighter Side of News.