The AI can even maintain the emotional tone of the subject’s voice, including anger and amusement, reports Ars Technica.
Dozens of audio samples generated by the tool, known as Vall-E, are available to hear online, alongside the human voices they are copying. The examples also include instances where the AI has copied the acoustic environment of a recording. This means that if it is fed a sample of a phone call, its simulation can sound like it is being spoken over the phone.
However, unlike publicly available AI tools such as the chatbot ChatGPT, Microsoft is not allowing people to play with its new creation. This is perhaps because the company is aware of the dangers of the software falling into the wrong hands. If you thought spam texts were bad, imagine getting a fake voice call from a loved one asking for your bank details.
In an ethics statement on the Vall-E website, its creators state: “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker.”
There are already cases of scammers using so-called audio deepfakes to try to steal money from businesses. In 2020, a Hong Kong bank manager was reportedly tricked by AI-generated audio into transferring $35 million to attackers.
The tech has also become common in Hollywood in recent years, which indicates its sophistication. Lucasfilm used an AI-generated voice for Darth Vader in Disney series Obi-Wan Kenobi. Meanwhile, the use of an AI version of deceased chef Anthony Bourdain’s voice in a documentary entitled Roadrunner caused outrage among some of his fans.
The Microsoft research team said the AI model could improve text-to-speech applications, speech editing, and content creation when combined with other generative AI models such as GPT-3.
The tech utilises tools created by Facebook parent Meta, including an audio compression codec called Encodec. It was also trained on an audio library originally compiled by Meta that contains over 60,000 hours of English language speech from more than 7,000 speakers.
Ars Technica says that to create a voice simulation, Vall-E analyses how a person sounds and breaks that info down into components, called “tokens”, using Encodec. It then uses the training data to figure out how that person would sound outside the three-second audio sample.