Voicebox relies on a novel training method called Flow Matching, which is claimed to offer higher intelligibility at text-to-speech jobs, and returns a higher rate of audio similarity when compared to the original training material. Compared to rival models out there, Meta says Voicebox brings the text-to-speech error rate down from 10.9% to 5.2%. It allows style transfer from one language to another, making the audio output sound more authentic.
But the most impressive capability in Voicebox’s arsenal is the “zero-shot” learning approach, which means it doesn’t need to be trained on a vast training data cache to do its job. All it needs is a two-second audio clip, and it will then learn everything from it, from the distinct tone and pitch to personal pauses — before it starts generating fresh audio clips with a similar sound profile.
For comparison, Microsoft’s Vall-E AI model uses a three-second audio clip to train itself. Meta says its text-to-speech generation model is faster than Vall-E. Just like Microsoft, which paused the public release of Vall-E citing abuse risks, Meta is taking a similar approach with Voicebox.
“We recognize that this technology brings the potential for misuse and unintended harm,” Meta argues, adding that it wants to take a responsible approach to AI innovation. The company has also released a research paper in which it has documented building a classifier model that can differentiate between Voicebox-generated audio and an authentic clip of a real human speaking.
Stay connected with us on social media platform for instant update click here to join our Twitter, & Facebook
We are now on Telegram. Click here to join our channel (@TechiUpdate) and stay updated with the latest Technology headlines.
For all the latest gaming News Click Here
For the latest news and updates, follow us on Google News.