Microsoft’s new language model VALL-E is an interesting artificial intelligence tool that can copy human voices and even the emotions and tones in them. It only needs a three-second recording to be used as an acoustic prompt but can deliver a different message using the original speaker’s voice. (AITopics via Windows Central)
Microsoft is investing a lot in AI. Aside from OpenAI’s ChatGPT AI technology (which will be integrated into Bing and other Office apps), it also has the recently-released VALL-E tool. It is a language model trained on 60,000 hours of English speech data. Through this technology, a person can synthesize personalized speech using the voice of a different speaker.
In an experiment detailed in a paper (Cornell University), VALL-E was tested and led to favorable results.
“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,” the paper reads. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”
In some of the samples shared, the synthesized speeches using acoustic prompts sound almost flawless. VALL-E managed to copy the same tones and emotions from the original speakers and even used them in delivering a very different personalized speech. For instance, it was able to produce recordings of the same sentence (“We have to reduce the number of plastic bags“) delivered in different moods or tones, such as anger, sleepiness, neutrality, amusement, and disgust.
Despite this exceptional performance, Microsoft probably has further plans to improve VALL-E more in the future to help it provide a more impeccable performance. And while it can be useful for various case scenarios, the technology can also be dangerous under the hands of the wrong individuals. Thankfully, it is currently unavailable to the public, which could give the Redmond company more time to think about how and where it will offer this technology.
What’s your opinion about this? Let us know in the comment section.