FlashLabs Unveils Chroma 1.0: A 4B Real-Time Speech Dialogue Model Featuring Personalized Voice Cloning

Researchers have introduced Chroma 1.0, an innovative speech-to-speech dialogue model that processes audio directly and outputs audio while maintaining the speaker’s voice across conversations. This model, which is open source, combines quick interactions with high-quality voice cloning using just a few seconds of reference audio.

Chroma 1.0 stands out because it works directly with speech representations instead of converting audio to text first. This approach helps retain important vocal details, like emotion and tone, which are often lost in traditional systems that rely on multiple steps, including speech recognition and text-to-speech synthesis. With a compact design of 4 billion parameters, Chroma aims to achieve a more natural conversation experience, boasting a 10.96% improvement in speaker similarity compared to human voices.

The model is built on two main systems: the Chroma Reasoner and a speech generation stack. The Reasoner handles understanding and generating text, while the speech stack converts this output into audio. The Reasoner uses advanced techniques to process both text and audio, ensuring that the model remains aware of the speaker’s identity and emotional cues throughout the interaction.

Chroma’s architecture includes a 1 billion parameter backbone that generates audio codes based on the dialogue’s context. It uses a unique interleaving method that allows it to start producing speech almost immediately after generating text, significantly reducing wait times for users.

In terms of performance, Chroma has shown impressive results in both objective and subjective evaluations. It achieved a speaker similarity score of 0.81, surpassing many existing systems. However, in terms of naturalness, listeners preferred the ElevenLabs model over Chroma in a comparative test. Despite this, Chroma’s unique ability to clone voices sets it apart from other models.

Latency measurements indicate that Chroma can generate responses quickly, with a time to first token of around 147 milliseconds, making it suitable for real-time dialogue applications. The model has been tested on various benchmarks, achieving competitive scores in dialogue tasks while demonstrating strong reasoning capabilities.

Chroma 1.0 is a significant step forward in the field of speech technology. It not only offers real-time processing but also excels in voice personalization, making it a promising tool for future applications in conversational AI.