Microsoft's new VALL-E 2 text-to-speech synthesis achieves human level performance

You're thinking about Morgan Freeman reading you bedtime stories, right?

Reading time icon 3 min. read


Readers help support Windows Report. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help Windows Report sustain the editorial team Read more

Microsoft VALL E 2 achieves human level speech synthesis

Microsoft has come up with VALL-E 2, a new model that takes human-like speech synthesis to another level. This is not just an improvement; it’s a big step forward in making computer-generated voices sound more natural and high quality. The creation of this advanced technology marks significant progress from previous versions like VALL-E, which were already at par with human speech patterns but still lacked some crucial elements such as intonation control or avoiding monotonous tone repetition.

The researchers fixed the token repetition issue and more

The latest development overcomes these limitations by introducing fresh aspects like Repetition Aware Sampling and Grouped Code Modeling – all aimed at enhancing stability and efficiency during the process of generating spoken words through machine learning techniques. But what does this mean? Well, let’s dive into the details and find out.

One of the sampling problems is token repetition. Sometimes, the model can produce repetitive sequences which might cause stability issues and infinite loops as mentioned above. This method known as Repetition Aware Sampling takes decoding history into account for more stable and reliable results. Have you ever heard speech synthesis that doesn’t sound quite correct? This feature is here to fix that.

Next is Grouped Code Modeling, a method that focuses on efficiency. By grouping together codec codes, it can greatly shorten the sequence length. This approach speeds up inference and deals with issues related to modeling of long sequences. Think about a situation where you have to quickly synthesize a long speech; this feature makes it possible without losing quality.

VALL-E 2 will talk just like a human

These are not merely technical terms; they empower VALL-E 2 to produce speech that is extremely natural, even for intricate sentences. The model’s elegance lies in its simplicity: it only needs a simple set of speech-transcription pairs for training. This makes the process of collecting and handling data much easier.

According to the technical paper of VALL-E 2, on the LibriSpeech and VCTK datasets, the new LLM showed better results in terms of speech robustness, naturalness and speaker similarity. It is the initial model to reach human equality on these tests. The new version can produce very good quality speech which deals well with complicated and repeated sentences.

VALL-E 2 holds great promise for aiding people who have difficulty speaking, but its possible uses are not limited to these areas alone. Think of being able to give a voice to someone who struggles with talking because of conditions like aphasia or amyotrophic lateral sclerosis. Yet, we should not overlook the dangers of misuse, like voice spoofing or impersonation. It is very important for practical uses of this technology to have rules about approving speakers and recognizing if a speech is real or made by computer.

Could you, for instance, have all your e-books on your PC narrated by Morgan Freeman? You probably could. Publishing them online? That would be a totally different story, and you shouldn’t be able to do that for the obvious reasons.

What do you think about VALL-E 2 and speech synthesis? Let’s talk about that in the comments below. We’ve learned about this from AIM.

More about the topics: AI