metadata
library_name: diffusers
license: apache-2.0
pipeline_tag: image-to-video
tags:
- optical-flow prediction
- motion prediction
- diffusion
FOFPred: Language-Driven Future Optical Flow Prediction
FOFPred is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions. Given an input image and a text prompt describing a desired action (e.g., "Moving the water bottle from right to left"), FOFPred generates 4 sequential optical flow frames showing how objects would move to accomplish that action.
Paper | Project Page | GitHub
Usage
import einops
import numpy as np
import torch
from diffusers import DiffusionPipeline
from PIL import Image
# Load pipeline with trust_remote_code
pipeline = DiffusionPipeline.from_pretrained(
"Salesforce/FOFPred",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to("cuda")
# Run inference
results = pipeline(
prompt="Moving the water bottle from right to left.",
input_images=[Image.open("your_image.jpg")],
width=256,
height=256,
num_inference_steps=1,
num_images_per_prompt=4,
frame_count=4,
generator=torch.Generator(device="cuda").manual_seed(42),
output_type="pt",
)
flow_frames = results.images # [B, F, C, H, W]
output_tensor = flow_frames[0] # [F, C, H, W]
output_np = pipeline.image_processor.pt_to_numpy(output_tensor) # [F, H, W, C]
reshaped = einops.rearrange(output_np, "f h w c -> h (f w) c")
img = Image.fromarray((reshaped * 255).astype(np.uint8))
img.save("output_combined.png")
Architecture
| Component | Model | Description |
|---|---|---|
| V-LLM | Qwen2.5-VL-3B-Instruct | Multimodal understanding of images and text |
| DiT | OmniGen2Transformer3DModel | Modification of OmniGen2Transformer to generate frame sequences |
| VAE | FLUX.1-dev AutoencoderKL | VAE (AutoencoderKL model) |
| Scheduler | FlowMatchEulerDiscreteScheduler | Efficient flow-matching sampler |
Citation
@article{ranasinghe2025future,
title={Future Optical Flow Prediction Improves Robot Control & Video Generation},
author={Ranasinghe, Kanchana and Zhou, Honglu and Fang, Yu and Yang, Luyu and Xue, Le and Xu, Ran and Xiong, Caiming and Savarese, Silvio and Ryoo, Michael S and Niebles, Juan Carlos},
journal={arXiv preprint arXiv:2601.10781},
year={2025}
}
Acknowledgements
License
The code and weights in this repository are released under the Apache License 2.0. (Note: Some documentation may refer to CC BY-NC 4.0).