Microsoft’s VALL-E AI can learn your speech patterns in 3 seconds

Reading time icon 2 min. read

Readers help support Windows Report. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help Windows Report sustain the editorial team Read more

AI and text-to-speech seem to be the spark in early 2023. Microsoft researchers have announced a new text-to-speech AI model called VALL-E that can simulate a person’s voice with just a three-second audio sample.  Once VALL-E learns a specific voice it can synthesize the audio of that person and keep their emotional tone.

VALL-E could be used for high-quality text-to-speech applications where changing the text transcript could allow the recording of a person to be edited to say something they originally didn’t.  Microsoft calls VALL-E a “neural codec language model” that builds off a technology called EnCodec.  VALL-E is different than other text-to-speech methods in that instead of synthesizing the speech by manipulating the waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts.  It uses EnCodec to break that information down into discrete components called tokens and matches training data and what it “knows” about a person’s voice to determine how it might sound with spoken phrases.

VALL-E was trained on an audio library assembled by Meta called LibriLight containing 60,000 hours of English language speech from more than 7,000 speakers, most were pulled from LibriVox public domain audiobooks allowing a good result with just a three-second sample.

Microsoft has set up a VALL-E example website so you can get a taste of the technology using dozens of audio examples of the AI model in action.