File size: 2,957 Bytes
e5e3587
 
64cbdd6
 
e5e3587
abdd865
 
 
e5e3587
 
 
 
64cbdd6
 
 
e5e3587
 
 
 
eb214b7
 
e5e3587
eb214b7
e5e3587
 
eb214b7
 
e5e3587
 
eb214b7
e5e3587
 
eb214b7
e5e3587
 
 
 
 
 
 
 
 
 
 
 
 
eb214b7
 
 
 
 
 
e5e3587
 
 
 
64cbdd6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5e3587
 
 
 
 
 
 
 
 
64cbdd6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
library_name: diffusers
license: apache-2.0
pipeline_tag: image-to-video
tags:
- optical-flow prediction
- motion prediction
- diffusion
---

# FOFPred: Language-Driven Future Optical Flow Prediction

**FOFPred** is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions. Given an input image and a text prompt describing a desired action (e.g., *"Moving the water bottle from right to left"*), FOFPred generates 4 sequential optical flow frames showing how objects would move to accomplish that action.

[Paper](https://huggingface.co/papers/2601.10781) | [Project Page](https://fofpred.github.io) | [GitHub](https://github.com/SalesforceAIResearch/FOFPred)

## Usage

```python
import einops
import numpy as np
import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load pipeline with trust_remote_code
pipeline = DiffusionPipeline.from_pretrained(
    "Salesforce/FOFPred",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda")

# Run inference
results = pipeline(
    prompt="Moving the water bottle from right to left.",
    input_images=[Image.open("your_image.jpg")],
    width=256,
    height=256,
    num_inference_steps=1,
    num_images_per_prompt=4,
    frame_count=4,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type="pt",
)

flow_frames = results.images  # [B, F, C, H, W]

output_tensor = flow_frames[0]  # [F, C, H, W]
output_np = pipeline.image_processor.pt_to_numpy(output_tensor)  # [F, H, W, C]
reshaped = einops.rearrange(output_np, "f h w c -> h (f w) c")
img = Image.fromarray((reshaped * 255).astype(np.uint8))
img.save("output_combined.png")
```

## Architecture

| Component | Model | Description |
|-----------|-------|-------------|
| **V-LLM** | Qwen2.5-VL-3B-Instruct | Multimodal understanding of images and text |
| **DiT** | OmniGen2Transformer3DModel | Modification of OmniGen2Transformer to generate frame sequences |
| **VAE** | FLUX.1-dev AutoencoderKL | VAE (AutoencoderKL model) |
| **Scheduler** | FlowMatchEulerDiscreteScheduler | Efficient flow-matching sampler |

## Citation

```bibtex
@article{ranasinghe2025future,
  title={Future Optical Flow Prediction Improves Robot Control & Video Generation},
  author={Ranasinghe, Kanchana and Zhou, Honglu and Fang, Yu and Yang, Luyu and Xue, Le and Xu, Ran and Xiong, Caiming and Savarese, Silvio and Ryoo, Michael S and Niebles, Juan Carlos},
  journal={arXiv preprint arXiv:2601.10781},
  year={2025}
}
```

## Acknowledgements

- [OmniGen2](https://github.com/VectorSpaceLab/OmniGen2)
- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL)
- [Flux VAE](https://huggingface.co/black-forest-labs/FLUX.1-dev)

## License

The code and weights in this repository are released under the [Apache License 2.0](https://github.com/SalesforceAIResearch/FOFPred/blob/main/LICENSE.txt). (Note: Some documentation may refer to CC BY-NC 4.0).