Microsoft has developed a revolutionary artificial intelligence (AI) speech generator named VALL-E 2, which is reportedly so convincing that it cannot be released to the public due to potential misuse risks. The advanced text-to-speech (TTS) generator can replicate a human voice using just a few seconds of audio, achieving what researchers describe as "human parity."
VALL-E 2, detailed in a paper published on June 17 on the pre-print server arXiv, represents a significant milestone in neural codec language models. The researchers claim it can generate “accurate, natural speech in the exact voice of the original speaker,” making it indistinguishable from human speech in many cases.
"VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time," the researchers wrote. The AI engine’s success is attributed to two key features: Repetition Aware Sampling and Grouped Code Modeling.
Repetition Aware Sampling enhances the AI’s speech generation by preventing repetitive loops of sounds or phrases, resulting in more natural-sounding speech. Grouped Code Modeling improves efficiency by reducing the number of individual tokens the model processes in a single input sequence, speeding up speech generation.
Using audio samples from the LibriSpeech and VCTK speech libraries, Microsoft assessed VALL-E 2’s performance. The AI surpassed previous zero-shot TTS systems in terms of robustness, naturalness, and speaker similarity. According to the researchers, it is the first TTS system to reach human parity on these benchmarks.
Despite its impressive capabilities, Microsoft has decided not to release VALL-E 2 to the public, citing risks associated with voice cloning and deepfake technology. The company expressed concerns about potential misuse, such as spoofing voice identification or impersonating specific speakers.
"VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public," the researchers stated. They noted that the quality of VALL-E 2’s output depends on the length and quality of speech prompts and environmental factors like background noise.
However, the researchers also highlighted potential future applications for AI speech technology. VALL-E 2 could be used in educational learning, entertainment, journalism, accessibility features, interactive voice response systems, translation, and chatbots. For such applications, ensuring ethical use and speaker approval will be crucial.
As AI technology continues to advance, the balance between innovation and ethical considerations remains paramount. Microsoft’s decision to withhold VALL-E 2 from public release underscores the importance of addressing these concerns responsibly.