In recent years, advancements in speech synthesis technology have been driven by neural networks and end-to-end modeling. Microsoft introduced VALL-E, a neural codec language model that can synthesize high-quality personalized speech from a short recording of an unseen speaker, outperforming existing text-to-speech systems. A new paper presents VALL-E 2, which achieves human parity in zero-shot text-to-speech synthesis, marking a significant milestone. VALL-E 2 improves upon its predecessor with repetition-aware sampling and grouped code modeling, enhancing stability and performance in the speech synthesis process. The model requires simple training data, making it scalable and efficient. Experiments show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity, achieving human parity on benchmarks. The model consistently produces high-quality speech, even for complex or repetitive sentences. Demos of VALL-E 2 will be available online, and the paper detailing its advancements is accessible on arXiv. Overall, VALL-E 2 represents a significant leap forward in zero-shot text-to-speech synthesis, offering improved performance and scalability in speech synthesis technology.
Source link
Source link: https://syncedreview.com/2024/06/11/microsofts-vall-e-2-first-time-human-parity-in-zero-shot-text-to-speech-achieved/amp/
#Microsoft achieves human parity in zero-shot text-to-speech with VALL-E 2
![Microsoft’s VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved](https://i0.wp.com/webappia.com/wp-content/uploads/2024/06/image-14.png?fit=758%2C243&quality=80&ssl=1)
GIPHY App Key not set. Please check settings