VALL-E: 5 things to know about Microsoft’s AI model that can mimic any voice in 3 seconds – Times of India

Microsoft showed off VALL-E, its text-to-speech AI model that can simulate any voice from a short audio sample. Not only the voice but it can also match the emotion and acoustics of the room. While it can be used in a lot of good ways, there are moral concerns about it. While a lot of samples are available on github to listen to, here are five things to know about VALL-E.
What is VALL-E?
Microsoft calls VALL-E a “neural codec language model” that generates audio from text input and short samples from a target speaker. It can mimic any voice by listening to a voice sample as small as 3 seconds. VALL-E is not generally available yet.
Training models
Researchers say they have trained VALL-E on 60,000 hours of English language speech — which is hundreds of times larger than existing systems — from 7,000-plus speakers on Meta‘s LibriLight audio library.
In order to mimic the voice, the target speaker’s voice must be a close match to the training data. This way, the AI can use its ‘training’ to try to mimic the target speaker’s voice to read aloud a desired text.

AI can mimic emotions
It is to be noted that the AI model can not only mimic the pitch or husk or texture but also the emotional tone of the speaker as well as the acoustics of the room. Which means that if the target voice has a disturbance, VALL-E will also mimic the voice as if there is a disturbance.
“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis,” the team of researchers says.
Use case and threats
The AI model can be used for customised text-to-speech applications or media production industry or robotics. However, it is a potential threat in case of misuse.
“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating,” the company said.

For example, people could use VALL-E to make spam calls sound real for conning people. Politicians or people with decent social presence can also be impersonated like we have seen with deep fakes. Applications that need voice commands or voice passwords can be a threat. Furthermore, VALL-E may also eat up jobs of voice artists.
Ethical statement
There is also an ethical statement by the company which says that “the experiments in this work were carried out under the assumption that the user of the model is the target speaker and has been approved by the speaker.”
“However, when the model is generalised to unseen speakers, relevant components should be accompanied by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech,” it said.
Also Watch:

Is ChatGPT the Google killer? | OpenAI ChatGPT

For all the latest Technology News Click Here 

 For the latest news and updates, follow us on Google News

Read original article here

Denial of responsibility! TechNewsBoy.com is an automatic aggregator around the global media. All the content are available free on Internet. We have just arranged it in one platform for educational purpose only. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials on our website, please contact us by email – [email protected]. The content will be deleted within 24 hours.