Researchers from Alibaba have introduced FunAudioLLM, a framework aimed at enhancing natural voice interactions with large language models (LLMs). The system consists of SenseVoice for voice understanding and CosyVoice for voice generation. SenseVoice offers multilingual speech recognition and emotion detection, while CosyVoice specializes in multilingual voice generation and cross-lingual voice cloning. The integration of these models with LLMs enables applications like speech-to-speech translation and interactive podcasts.

Experimental results show that SenseVoice outperforms existing models like Whisper in various benchmarks, with faster speech recognition capabilities. CosyVoice demonstrates high-quality speech synthesis, matching or surpassing original utterances in content consistency and speaker similarity. The researchers have made the models related to SenseVoice and CosyVoice open-source on platforms like Modelscope and Huggingface.

While the system shows promise, researchers acknowledge limitations such as lower performance for under-resourced languages and the need for improvement in emotional expression while maintaining original voice characteristics. This development follows Alibaba’s creation of an image generator called Tongyi, which challenged other models like Midjourney and Dall-E. FunAudioLLM represents a significant advancement in Alibaba’s creative models.

