Salesforce’s AI research team has introduced a new framework called FOFPred, designed to enhance motion prediction in robotics and video generation. This innovative system combines large vision language models with diffusion transformers to forecast how objects move over time based on images and natural language instructions.
FOFPred works by taking one or more images along with a simple command, like "move the bottle from right to left." It then predicts four future optical flow frames, which show how each pixel is expected to shift. Unlike traditional methods that require future images, FOFPred focuses solely on current observations and text, making it a more efficient way to represent motion.
The framework uses future optical flow, which captures only the movement of pixels while ignoring static features. This approach simplifies the prediction process and is particularly useful for robot control policies and video generation tasks. By representing optical flow as RGB images, the model can seamlessly integrate with existing diffusion models.
The architecture of FOFPred consists of three main components: a frozen vision language model called Qwen2.5-VL, a latent autoencoder known as Flux.1, and a trainable diffusion transformer called DiT. The DiT generates future flow sequences based on visual and textual inputs. This setup allows the model to leverage pre-existing knowledge from earlier training, enhancing its performance without requiring extensive retraining.
Training for FOFPred involved using a large dataset of human activity videos paired with captions. The team utilized the Something Something V2 and EgoDex datasets, totaling around 500,000 video-caption pairs. A unique aspect of the training process was the calculation of relative optical flow, which helps create cleaner training targets by accounting for camera motion.
FOFPred has shown promising results in two main applications. First, it has been fine-tuned for robot control, achieving impressive performance on benchmarks like CALVIN ABCD and RoboTwin 2.0. In these tests, FOFPred outperformed existing methods, demonstrating its effectiveness in predicting future actions based on language instructions.
Second, FOFPred is also being used for motion-aware text-to-video generation. By connecting it with the Go with the Flow video diffusion model, FOFPred can create videos that accurately reflect the described motion, requiring only a single frame and a text prompt at inference.
Overall, FOFPred represents a significant step forward in the field of AI-driven motion prediction, offering potential benefits for both robotics and video generation. The combination of language understanding and motion forecasting opens up new possibilities for creating more interactive and responsive AI systems.