STORM (Spatiotemporal Token Reduction for Multimodal LLMs): An Innovative AI Architecture Featuring a Specialized Temporal Encoder Between the Image Encoder and the LLM

Researchers from NVIDIA, Rutgers University, UC Berkeley, MIT, Nanjing University, and KAIST have unveiled a new AI architecture called STORM, which aims to improve how we understand videos using artificial intelligence. This innovative model addresses some of the significant challenges faced by existing video-based AI systems, particularly their struggle to process videos as continuous sequences.

Current models often treat videos as a series of static images, which can lead to missed details about movement and continuity. This approach can cause important information to be lost, especially in longer videos, where computational demands increase and techniques like frame skipping can further reduce accuracy. Moreover, overlapping data in frames can lead to wasted resources.

STORM changes the game by incorporating a dedicated temporal encoder that works between the image encoder and the language model (LLM). Instead of processing each frame in isolation, STORM integrates temporal information directly at the video token level. This helps to eliminate redundant computations and enhances efficiency. The model uses a unique bidirectional scanning mechanism to improve video representations while reducing the burden on the language model to infer temporal relationships on its own.

The architecture employs Mamba layers to boost temporal modeling. These layers help the system capture dependencies in both spatial and temporal dimensions. The temporal encoder treats images and videos differently, allowing it to gather global context from images while also capturing the dynamics of video sequences.

During its development, STORM utilized token compression techniques to enhance computational efficiency. This allows the model to perform well even on a single GPU, making it accessible for broader use. The researchers also introduced training-free token subsampling during testing, which further lightens the computational load without sacrificing vital temporal details.

In experiments, STORM was trained using pre-existing models and involved a two-stage process: first aligning the image encoder and LLM while training the temporal projector, and then fine-tuning the model with a large dataset of mixed media. The results showed that STORM outperformed other models on long-video benchmarks, achieving state-of-the-art results and significantly reducing inference time.

The Mamba module proved effective in compressing visual tokens while keeping essential information intact, resulting in up to a 65.5% reduction in inference time. STORM demonstrated superior performance, especially in tasks requiring an understanding of the overall context of videos.

This new model not only sets a high standard for video understanding but also offers a foundation for future research in token compression and multimodal alignment. Researchers believe that STORM could lead to improvements in the accuracy and efficiency of video-language models in real-world applications.