Introducing Kani-TTS-2: A 400M Parameter Open Source Text-to-Speech Model Capable of Running on 3GB VRAM with Voice Cloning Features

A new open-source text-to-speech model called Kani-TTS-2 has been launched by the team at nineninesix.ai. This innovative model is designed to provide high-quality speech synthesis while being more efficient than traditional systems. Kani-TTS-2 treats audio as a language, which allows it to deliver clear and natural-sounding speech without the heavy resource demands of older models.

Available on Hugging Face, Kani-TTS-2 comes in both English and Portuguese versions. It aims to offer a cost-effective alternative to proprietary text-to-speech APIs, making it accessible for developers and businesses alike.

The architecture of Kani-TTS-2 is based on LiquidAI’s LFM2 model, which is known for its efficiency. This model generates audio by predicting the next audio tokens, rather than relying on traditional methods that can be slow and cumbersome. It also employs NVIDIA’s NanoCodec to convert these tokens into high-quality waveforms.

One of the standout features of Kani-TTS-2 is its training efficiency. The English version of the model was trained on 10,000 hours of high-quality speech data in just six hours using a cluster of eight NVIDIA H100 GPUs. This rapid training capability means that large datasets can be processed quickly, reducing the time and cost associated with developing speech synthesis technologies.

Kani-TTS-2 also features zero-shot voice cloning, which allows it to replicate a speaker’s voice from a short audio sample without needing extensive fine-tuning. This makes it easier for developers to create voice applications tailored to specific needs.

The model is lightweight, with a parameter count of 400 million, and it can run on consumer-grade GPUs, such as the RTX 3060 or 4050. This accessibility makes it a practical choice for many users looking to implement advanced speech synthesis in their projects.

Kani-TTS-2 is released under the Apache 2.0 license, which means it can be used commercially. This opens up new possibilities for businesses and developers interested in integrating high-quality text-to-speech capabilities into their applications. The launch of Kani-TTS-2 marks a significant step forward in the world of speech synthesis, providing a powerful tool for creating engaging audio experiences.