Add comprehensive model card for PlanGen with Diffusers integration
Browse filesThis PR adds a comprehensive model card for PlanGen: Image Generation as a Visual Planner for Robotic Manipulation.
It includes:
- A link to the paper: [Image Generation as a Visual Planner for Robotic Manipulation](https://huggingface.co/papers/2512.00532).
- A link to the GitHub repository: https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation.
- **Metadata**: `license: apache-2.0`, `pipeline_tag: image-to-video`, and `library_name: diffusers` to improve discoverability and enable the automated usage widget.
- A concise summary of the model's purpose and methodology.
- The main teaser image and a results image from the GitHub repository.
- A "Quick Start" section with environment setup, requirements installation, and a sample usage code snippet from the official GitHub repository, demonstrating integration with `diffusers`.
- Details on available weights and citation information.
Please review and merge this PR if it looks good!
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-to-video
|
| 4 |
+
library_name: diffusers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# PlanGen: Image Generation as a Visual Planner for Robotic Manipulation
|
| 8 |
+
|
| 9 |
+
This repository contains the official model for the paper:
|
| 10 |
+
**[Image Generation as a Visual Planner for Robotic Manipulation](https://huggingface.co/papers/2512.00532)**
|
| 11 |
+
|
| 12 |
+
PlanGen explores whether pretrained image generation models, when lightly adapted using LoRA finetuning, can serve as visual planners for robotic manipulation. The framework includes text-conditioned generation and trajectory-conditioned generation, demonstrating the ability to produce smooth, coherent robot videos aligned with respective conditions. This work indicates that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision.
|
| 13 |
+
|
| 14 |
+
For more details, please refer to the [official GitHub repository](https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation).
|
| 15 |
+
|
| 16 |
+
<div align="center">
|
| 17 |
+
<img src='https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation/raw/main/assets/Teaser.png' width='100%' />
|
| 18 |
+
</div>
|
| 19 |
+
|
| 20 |
+
## Quick Start
|
| 21 |
+
|
| 22 |
+
### Configuration
|
| 23 |
+
#### 1. **Environment setup**
|
| 24 |
+
```bash
|
| 25 |
+
git clone https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation.git
|
| 26 |
+
cd Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation
|
| 27 |
+
|
| 28 |
+
conda create -n PlanGen python=3.11.10
|
| 29 |
+
conda activate PlanGen
|
| 30 |
+
```
|
| 31 |
+
#### 2. **Requirements installation**
|
| 32 |
+
```bash
|
| 33 |
+
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
|
| 34 |
+
pip install --upgrade -r requirements.txt
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### Inference
|
| 38 |
+
We provided the integration of diffusers pipeline with our model and uploaded the model weights to huggingface, it's easy to use the our model as example below:
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
from src.pipeline_pe_clone import FluxPipeline
|
| 42 |
+
import torch
|
| 43 |
+
from PIL import Image
|
| 44 |
+
|
| 45 |
+
pretrained_model_name_or_path = "black-forest-labs/FLUX.1-dev"
|
| 46 |
+
pipeline = FluxPipeline.from_pretrained(
|
| 47 |
+
pretrained_model_name_or_path,
|
| 48 |
+
torch_dtype=torch.bfloat16,
|
| 49 |
+
).to('cuda')
|
| 50 |
+
|
| 51 |
+
pipeline.load_lora_weights("yio-ye2004/lora_collection", weight_name="pretrain.safetensors")
|
| 52 |
+
pipeline.fuse_lora()
|
| 53 |
+
pipeline.unload_lora_weights()
|
| 54 |
+
|
| 55 |
+
pipeline.load_lora_weights("yio-ye2004/lora_collection", weight_name="bridge_clean_pytorch_lora_weights.safetensors")
|
| 56 |
+
|
| 57 |
+
height=768
|
| 58 |
+
width=512
|
| 59 |
+
|
| 60 |
+
validation_image = "assets/1.png"
|
| 61 |
+
validation_prompt = "add a halo and wings for the cat by sksmagiceffects"
|
| 62 |
+
condition_image = Image.open(validation_image).resize((height, width)).convert("RGB")
|
| 63 |
+
|
| 64 |
+
result = pipeline(prompt=validation_prompt,
|
| 65 |
+
condition_image=condition_image,
|
| 66 |
+
height=height,
|
| 67 |
+
width=width,
|
| 68 |
+
guidance_scale=3.5,
|
| 69 |
+
num_inference_steps=20,
|
| 70 |
+
max_sequence_length=512).images[0]
|
| 71 |
+
|
| 72 |
+
result.save("output.png")
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
## Weights
|
| 76 |
+
You can download the trained checkpoints of PlanGen for inference. Below are the details of available models, checkpoint name are also trigger words.
|
| 77 |
+
|
| 78 |
+
You would need to load and fuse the `pretrained ` checkpoints model in order to load the other models.
|
| 79 |
+
|
| 80 |
+
| **Model** | **Description** | **Resolution** |
|
| 81 |
+
| :-----------------------------------------------------------: | :--------------------------------------------------------: | :------------: |
|
| 82 |
+
| [pretrained](https://huggingface.co/yio-ye2004/lora_collection/blob/main/pretrained.safetensors) | Base LoRA for PlanGen | |
|
| 83 |
+
| [bridge_clean](https://huggingface.co/yio-ye2004/lora_collection/blob/main/bridge_clean_pytorch_lora_weights.safetensors) | LoRA trained on `bridge_clean` | |
|
| 84 |
+
| [bridge_traj](https://huggingface.co/yio-ye2004/lora_collection/blob/main/bridge_traj_pytorch_lora_weights.safetensors) | PlanGen LoRA trained on `bridge_traj` | |
|
| 85 |
+
| [jocoplay_clean](https://huggingface.co/yio-ye2004/lora_collection/blob/main/jocoplay_clean_pytorch_lora_weights.safetensors) | PlanGen LoRA trained on `jocoplay_clean` | |
|
| 86 |
+
| [jocoplay_traj](https://huggingface.co/yio-ye2004/lora_collection/blob/main/jocoplay_traj_pytorch_lora_weights.safetensors) | PlanGen LoRA trained on `jocoplay_traj` | |
|
| 87 |
+
| [rt1_clean](https://huggingface.co/yio-ye2004/lora_collection/blob/main/rt1_clean_pytorch_lora_weights.safetensors) | PlanGen LoRA trained on `rt1_clean` | |
|
| 88 |
+
| [rt1_traj](https://huggingface.co/yio-ye2004/lora_collection/blob/main/rt1_traj_pytorch_lora_weights.safetensors) | PlanGen LoRA trained on `rt1_traj` | |
|
| 89 |
+
|
| 90 |
+
## Results
|
| 91 |
+
<div align="center">
|
| 92 |
+
<img src='https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation/raw/main/assets/Visual_Comparisons.png'/>
|
| 93 |
+
</div>
|
| 94 |
+
|
| 95 |
+
## Citation
|
| 96 |
+
If you find our work useful, please cite the paper:
|
| 97 |
+
|
| 98 |
+
```bibtex
|
| 99 |
+
@article{ye2025image,
|
| 100 |
+
title={Image Generation as a Visual Planner for Robotic Manipulation},
|
| 101 |
+
author={Ye, Pang},
|
| 102 |
+
journal={arXiv preprint arXiv:2512.00532},
|
| 103 |
+
year={2025}
|
| 104 |
+
}
|
| 105 |
+
```
|