Microsoft’s new AI bot VALL-E can replicate anyone’s voice with just a 3-seconds audio sample- Technology News, Firstpost

By Scott Marlette On Jan 10, 2023

Mehul Reuben DasJan 10, 2023 13:39:35 IST

A team of researchers at Microsoft have developed a new text-to-speech AI model called VALL-E that can simulate a person’s voice almost perfectly, once it has been trained. And that in order to train this new AI bot, all they need is a three-second audio sample.

Once the AI bot learns a specific voice, VALL-E can synthesize audio of that person saying anything, and do it in a way that attempts to preserve the speaker’s emotional tone, as well as the environment where the speaker is in.

Moreover, the researchers claim that once the AI bot learns a specific voice, VALL-E can synthesize audio of that person saying anything, and do it in a way that attempts to preserve the speaker’s emotional tone.

The developers of VALL-E can potentially be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript, and in conjunction of content creation with other generative AI models like GPT-3.

Microsoft’s VALL-E builds off of a technology called EnCodec, which Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. Bacially, VALL-E analyzes how a person sounds, and breaks down the voice into tokens. Then it uses the training data to match what it “knows” about how that voice would sound if it spoke other phrases.

Microsoft used LibriLight, an audio library put together by Meta, to train VALL-voice E’s synthesis skills. The majority of the 60,000 hours of English-language speech are taken from LibriVox public domain audiobooks and are spoken by more than 7,000 different people. The voice in the three-second sample must closely resemble a voice in the training data for VALL-E to get a satisfactory result.

In addition to preserving a speaker’s vocal timbre and emotional tone, VALL-E can also imitate the “acoustic environment” of the sample audio. The audio output, for instance, will imitate the acoustic and frequency qualities of a telephone call in its synthetic output, which is a fancy way of stating that it will sound like a telephone call as well. Additionally, Microsoft’s samples (included in the “Synthesis of Diversity” section) show how VALL-E may produce different voice tones by altering the random seed utilised during creation.

For all the latest Technology News Click Here

For the latest news and updates, follow us on Google News.

Read original article here

Denial of responsibility! TechNewsBoy.com is an automatic aggregator around the global media. All the content are available free on Internet. We have just arranged it in one platform for educational purpose only. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials on our website, please contact us by email – [email protected]. The content will be deleted within 24 hours.