Tencent Hunyuan has made a significant move by open-sourcing HPC-Ops, a library designed for operators in large language model (LLM) inference. This library is aimed at improving how these models run on NVIDIA GPUs. By focusing on low-level CUDA kernels, HPC-Ops enhances core operations like Attention and Grouped GEMM, making it easier for developers to integrate these functions into their existing systems using a straightforward C and Python API.
HPC-Ops has already shown impressive results in Tencent’s internal services. It has improved the number of queries processed per minute by about 30% for Tencent-HY models and around 17% for DeepSeek models. These improvements come from faster kernel performance, which is vital for efficient inference in real-time applications.
The library is built by the Tencent Hunyuan AI Infra team and is designed to complement existing inference frameworks rather than replace them. It provides optimized kernels and clean APIs that can be integrated without disrupting the overall system architecture. This flexibility allows developers to use HPC-Ops alongside popular frameworks like vLLM and SGLang, ensuring that they can enhance performance without significant changes to their existing setups.
HPC-Ops supports various data types, including bf16 and fp8, which are increasingly used in production for their balance of performance and accuracy. The library currently includes three main operator families: Attention kernels, Grouped GEMM, and Fused MoE, each designed to optimize different aspects of model inference.
In terms of performance, HPC-Ops has reported maximum speedups of up to 2.22 times for Attention operations in bf16 and up to 1.88 times for Grouped GEMM in decode tasks. These enhancements are particularly important when it comes to reducing latency in autoregressive generation, where performance bottlenecks can significantly affect user experience.
Looking ahead, the roadmap for HPC-Ops includes plans to extend its capabilities with features like sparse attention for long context LLMs and further quantization options. This will help developers better manage resources while maintaining high performance in their applications.
With the open-sourcing of HPC-Ops, Tencent Hunyuan is not just sharing technology but also contributing to the broader AI community, providing tools that can help developers improve their models and applications. The project is available on GitHub, inviting collaboration and innovation from developers around the world.