Skip to content
Home » Qwen Researchers Unveil Qwen3-TTS: A Multilingual TTS Suite Featuring Real-Time Latency and Precise Voice Control

Qwen Researchers Unveil Qwen3-TTS: A Multilingual TTS Suite Featuring Real-Time Latency and Precise Voice Control

Alibaba Cloud’s Qwen team has made a significant move in the world of artificial intelligence by open-sourcing a new suite of multilingual text-to-speech models called Qwen3-TTS. This release aims to simplify tasks related to voice cloning, voice design, and high-quality speech generation, all in one package.

The Qwen3-TTS suite includes five different models designed for various functions. Among them are Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base, which are intended for voice cloning and standard text-to-speech applications. There are also CustomVoice models that come with preset speakers, allowing users to choose from nine distinct voices, including a bright young Chinese female voice named Vivian and a dynamic English male voice named Ryan. The most innovative model, Qwen3-TTS-12Hz-1.7B-VoiceDesign, allows users to create unique voices based on natural language descriptions, such as asking it to generate a voice that sounds like a nervous teenage boy.

All models in this suite support ten languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. This broad language support makes Qwen3-TTS a versatile tool for global applications.

The technology behind Qwen3-TTS is impressive. It uses a dual-track language model that predicts acoustic tokens from text while also managing control signals. The system has been trained on over five million hours of multilingual speech, which enhances its ability to generate high-quality audio. The tokenizer operates at a speed of 12.5 frames per second, allowing it to produce audio packets quickly, which is crucial for real-time applications.

In terms of performance, the Qwen3-TTS models have shown excellent results in various benchmarks. The Qwen3-TTS-12Hz-1.7B-Base model achieved a low Word Error Rate (WER) of 1.24 for English, making it one of the best in its class. It also excels in Chinese, where it closely matches the performance of leading systems. The multilingual capabilities are particularly noteworthy, as Qwen3-TTS outperformed competitors in six out of ten languages, showcasing its robustness.

The alignment and control features of Qwen3-TTS are designed to enhance the user experience further. It employs a multi-stage alignment process that ensures the generated speech aligns well with human preferences. Users can also give detailed instructions about the style and emotion of the speech, making it a powerful tool for personalized voice applications.

This open-source release is licensed under Apache 2.0, making it accessible for developers and researchers alike. The Qwen team encourages users to explore the model weights, repository, and playground to experiment with the technology.

Overall, Qwen3-TTS represents a significant advancement in text-to-speech technology, offering a comprehensive solution for voice generation and customization across multiple languages. As AI continues to evolve, innovations like these pave the way for more interactive and personalized digital experiences.