--- license: apache-2.0 library_name: diffusers pipeline_tag: image-to-image tags: - optical-flow prediction - motion prediction - diffusion --- # FOFPred: Language-Driven Future Optical Flow Prediction **FOFPred** is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions. Given an input image and a text prompt describing a desired action (e.g., *"Moving the water bottle from right to left"*), FOFPred generates 4 sequential optical flow frames showing how objects would move. ## Usage ```python import einops import numpy as np import torch from diffusers import DiffusionPipeline from PIL import Image # Load pipeline with trust_remote_code pipeline = DiffusionPipeline.from_pretrained( "Salesforce/FOFPred", torch_dtype=torch.bfloat16, trust_remote_code=True, ).to("cuda") # Run inference results = pipeline( prompt="Moving the water bottle from right to left.", input_images=[Image.open("your_image.jpg")], width=256, height=256, num_inference_steps=1, num_images_per_prompt=4, frame_count=4, generator=torch.Generator(device="cuda").manual_seed(42), output_type="pt", ) flow_frames = results.images # [B, F, C, H, W] output_tensor = flow_frames[0] # [F, C, H, W] output_np = pipeline.image_processor.pt_to_numpy(output_tensor) # [F, H, W, C] reshaped = einops.rearrange(output_np, "f h w c -> h (f w) c") img = Image.fromarray((reshaped * 255).astype(np.uint8)) img.save("output_combined.png") ``` ## Architecture | Component | Model | |-----------|-------| | **V-LLM** | Qwen2.5-VL-3B-Instruct | | **DiT** | OmniGen2Transformer3DModel | | **VAE** | FLUX.1-dev AutoencoderKL | | **Scheduler** | FlowMatchEulerDiscreteScheduler | ## Acknowledgements - [OmniGen2](https://github.com/VectorSpaceLab/OmniGen2) - [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL) - [Flux VAE](https://huggingface.co/black-forest-labs/FLUX.1-dev) ## License Our code and weights are released under the [CC by-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/deed.en).