Meta’s Voicebox Can Recreate Audio Tracks & Generate Speech From Text

June 18, 2023 Off By Naveen Victor

Facebook’s parent company, Meta is introducing Voicebox. It’s the company’s state of the art AI model that can perform speech generation tasks. The work entailed includes editing, sampling and stylizing by using in-context learning.

Voicebox is able to produce audio clips from text and edit pre-recorded audio to remove disturbances like car horns. It’s able to do all of this because of In-context text-to-speech synthesis, Speech editing and noise reduction, Cross-lingual style transfer and Diverse speech sampling.

Through the use of in-context text-to-speech synthesis, Voicebox can use an audio sample that’s just two seconds long to use for text-to-speech generation while matching the audio style. Speech editing and noise reduction allows the system to recreate a portion of speech that’s been drowned out by noise or ruined by misspoken words.

Said segment can be re-generated without the user having to re-record said portion or the entire piece again. Cross-lingual style transfer can generate speech from a passage of text. All it needs is a person’s speech and said passage that’s written in English, French, German, Spanish, Polish or Portuguese. The sample speech and text don’t even have to be in the same language.

Diverse speech sampling allows Voicebox to generate speech that’s more representative of how people talk, which allows for a more natural sounding tone. This can be done in any of the six languages mentioned earlier.

Meta believes that in the future, generative AI models like Voicebox can produce natural-sounding voices to virtual assistants and non-player characters (NPC) in games. They can also allow visually impaired people to hear messages that have been written to them. Creators could also see their workloads reduced, especially with creating and editing audio tracks.