File size: 13,689 Bytes
91d80b1 f23910f aeae87e 91d80b1 f23910f f065916 f23910f f065916 f23910f 91d80b1 f23910f 91d80b1 f23910f 91d80b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 |
---
license: other
library_name: diffusers
tags:
- motion-transfer
- comfyui
- video-generation
- image-to-video
- comfyui
- video-edit
pipeline_tag: video-to-video
base_model:
- alibaba-pai/Wan2.2-Fun-5B-Control
---
<div align="center">
# FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control
<a href="https://arxiv.org/abs/2602.13185"><img src="https://img.shields.io/badge/arXiv-2602.13185-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/IGL-HKUST/FlexAM"><img src="https://img.shields.io/badge/GitHub-Repository-181717.svg?logo=github&logoColor=white" alt="GitHub"></a>
<a href="assets/flexam_workflow.json"><img src="https://img.shields.io/badge/ComfyUI-Download_Workflow-4fd63d" alt="ComfyUI"></a>
<br>
<br>
Mingzhi Sheng<sup>1*</sup>, Zekai Gu<sup>2*</sup>, Peng Li<sup>2</sup>, Cheng Lin<sup>3</sup>, Hao-Xiang Guo<sup>4</sup>, Ying-Cong Chen<sup>1,2β </sup>, Yuan Liu<sup>2β </sup>
<br>
<sup>1</sup>HKUST(GZ), <sup>2</sup>HKUST, <sup>3</sup>MUST, <sup>4</sup>Tsinghua University
<br>
<small><sup>*</sup>Equal Contribution, <sup>β </sup>Corresponding Authors</small>
</div>
<br>

## π° News
- **[2026.02.14]** π The paper is available on arXiv.
- **[2026.02.13]** π We have released the inference code and **ComfyUI** support!
## π οΈ Installation
> π’ **System Requirements**: Both the official Python inference code and the ComfyUI workflow were tested on **Ubuntu 20.04** with **Python 3.10**, **PyTorch 2.5.1**, and **CUDA 12.1** on an **NVIDIA A800** GPU.
Before running any inference (Python or ComfyUI), please setup the environment and download the checkpoints.
### 1. Create environment
Clone the repository and create conda environment:
```
git clone https://github.com/IGL-HKUST/FlexAM
conda create -n flexam python=3.10
conda activate flexam
```
Install pytorch, we recommend `Pytorch 2.5.1` with `CUDA 12.1`:
```
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
```
```
pip install -r requirements.txt
```
### 2. Download Submodules
We rely on several external modules (MoGe, Pi3, etc.).
```
mkdir -p submodules
git submodule update --init --recursive
pip install -r requirements.txt
```
<details>
<summary><em>(Optional) Manual clone if submodule update fails</em></summary>
```
# DELTA
git clone https://github.com/snap-research/DELTA_densetrack3d.git submodules/MoGe
# Pi3
git clone https://github.com/yyfz/Pi3.git submodules/Pi3
# MoGe
git clone https://github.com/microsoft/MoGe.git submodules/MoGe
# VGGT
git clone https://github.com/facebookresearch/vggt.git submodules/vggt
```
</details>
### 3. Download checkpoints
Download the FlexAM checkpoint and place it in the`checkpoints/` directory.
- HuggingFace Link: [Wan2.2-Fun-5B-FLEXAM](https://huggingface.co/SandwichZ/Wan2.2-Fun-5B-FLEXAM)
## π Inference
We provide two ways to use FlexAM: Python Script and ComfyUI.
### Option A: ComfyUI Integration
We provide a native node for seamless integration into ComfyUI workflows.
> β οΈ **Note**: Currently, the ComfyUI node supports **Motion Transfer**, **Foreground Edit**, and **Background Edit**. For *Camera Control* and *Object Manipulation*, please use the Python script.
#### 1. Install Node
Since we are not yet in the Manager, please install manually:
```
cd ComfyUI/custom_nodes/
git clone https://github.com/IGL-HKUST/FlexAM
cd FlexAM
pip install -r requirements.txt
```
#### 2. Run Workflow
- Step 1: Download the workflow JSON: [workflow.json](assets/flexam_workflow.json)
- Step 2: Drag and drop it into ComfyUI.
- Step 3: Ensure checkpoints are in `ComfyUI/models/checkpoints`.
### Option B: Python Script
We provide a inference script for our tasks. Please refer to `run_demo.sh` to run the `demo.py` script.
Or you can run these tasks one by one as follows.
#### 1. Motion Transfer

---
```python
python demo.py \
--prompt <"prompt text"> \ # prompt text
--checkpoint_path <model_path> \ # FlexAM checkpoint path (e.g checkpoints/Diffusion_Transformer/Wan2.2-Fun-5B-FLEXAM)
--output_dir <output_dir> \ # output directory
--input_path <input_path> \ # the reference video path
--repaint <True/repaint_path > \ # the repaint first frame image path of input source video or use FLUX to repaint the first frame \
--video_length=97 \
--sample_size 512 896 \
--generate_type='full_edit' \
--density 10 \ # Control the sparsity of tracking points
--gpu <gpu_id> \ # the gpu id
```
#### 2. foreground edit

```python
python demo.py \
--prompt <"prompt text"> \ # prompt text
--checkpoint_path <model_path> \ # FlexAM checkpoint path (e.g checkpoints/Diffusion_Transformer/Wan2.2-Fun-5B-FLEXAM)
--output_dir <output_dir> \ # output directory
--input_path <input_path> \ # the reference video path
--repaint <True/repaint_path > \ # the repaint first frame image path of input source video or use FLUX to repaint the first frame \
--mask_path <mask_path> \ # White (255) represents the foreground to be edited, and black (0) remains unchanged
--video_length=97 \
--sample_size 512 896 \
--generate_type='foreground_edit' \
--dilation_pixels=30 \ # Dilation pixels for mask processing in foreground_edit mode
--density 10 \ # Control the sparsity of tracking points
--gpu <gpu_id> \ # the gpu id
```
#### 3. background edit

```python
python demo.py \
--prompt <"prompt text"> \ # prompt text
--checkpoint_path <model_path> \ # FlexAM checkpoint path (e.g checkpoints/Diffusion_Transformer/Wan2.2-Fun-5B-FLEXAM)
--output_dir <output_dir> \ # output directory
--input_path <input_path> \ # the reference video path
--repaint <True/repaint_path > \ # the repaint first frame image path of input source video or use FLUX to repaint the first frame \
--mask_path <mask_path> \ # White (255) represents the unchanged foreground, while the background indicates the area to be edited
--video_length=97 \
--sample_size 512 896 \
--generate_type='background_edit' \
--density 10 \ # Control the sparsity of tracking points
--gpu <gpu_id> \ # the gpu id
```
#### 4. Camera Control

We provide three camera control methods: 1. Use predefined templates; 2. Use a pose text file (pose txt); 3. Input another video, where the "Pi3" automatically estimates the camera pose from it and applies it to the video to be generated.
##### 1. Use predefined templates
We provide several template camera motion types, you can choose one of them. In practice, we find that providing a description of the camera motion in prompt will get better results.
```python
python demo.py \
--prompt <"prompt text"> \ # prompt text
--checkpoint_path <model_path> \ # FlexAM checkpoint path (e.g checkpoints/Diffusion_Transformer/Wan2.2-Fun-5B-FLEXAM)
--output_dir <output_dir> \ # output directory
--input_path <input_path> \ # the reference image or video path
--camera_motion <camera_motion> \ # the camera motion type, see examples below
--tracking_method <tracking_method> \ # the tracking method (moge, DELTA). For image input, 'moge' is necessary.
--override_extrinsics <override/append> \ # how to apply camera motion: "override" to replace original camera, "append" to build upon it
--video_length=97 \
--sample_size 512 896 \
--density 5 \ # Control the sparsity of tracking points
--gpu <gpu_id> \ # the gpu id
```
Here are some tips for camera motion:
- trans: translation motion, the camera will move in the direction of the vector (dx, dy, dz) with range [-1, 1]
- Positive X: Move left, Negative X: Move right
- Positive Y: Move down, Negative Y: Move up
- Positive Z: Zoom in, Negative Z: Zoom out
- e.g., 'trans -0.1 -0.1 -0.1' moving right, down and zoom in
- e.g., 'trans -0.1 0.0 0.0 5 45' moving right 0.1 from frame 5 to 45
- rot: rotation motion, the camera will rotate around the axis (x, y, z) by the angle
- X-axis rotation: positive X: pitch down, negative X: pitch up
- Y-axis rotation: positive Y: yaw left, negative Y: yaw right
- Z-axis rotation: positive Z: roll counter-clockwise, negative Z: roll clockwise
- e.g., 'rot y 25' rotating 25 degrees around y-axis (yaw left)
- e.g., 'rot x -30 10 40' rotating -30 degrees around x-axis (pitch up) from frame 10 to 40
- spiral: spiral motion, the camera will move in a spiral path with the given radius
- e.g., 'spiral 2' spiral motion with radius 2
- e.g., 'spiral 2 15 35' spiral motion with radius 2 from frame 15 to 35
Multiple transformations can be combined using semicolon (;) as separator:
- e.g., "trans 0 0 -0.5 0 30; rot x -25 0 30; trans -0.1 0 0 30 48"
This will:
1. Zoom in (z-0.5) from frame 0 to 30
2. Pitch up (rotate -25 degrees around x-axis) from frame 0 to 30
3. Move right (x-0.1) from frame 30 to 48
Notes:
- If start_frame and end_frame are not specified, the motion will be applied to all frames (0-48)
- Frames after end_frame will maintain the final transformation
- For combined transformations, they are applied in sequence
##### 2. Use a pose text file (pose txt)
```python
python demo.py \
--prompt <"prompt text"> \ # prompt text
--checkpoint_path <model_path> \ # FlexAM checkpoint path (e.g checkpoints/Diffusion_Transformer/Wan2.2-Fun-5B-FLEXAM)
--output_dir <output_dir> \ # output directory
--input_path <input_path> \ # the reference image or video path
--camera_motion "path" \ # if camera motion type is "path", --pose_file is needed
--pose_file <pose_file_txt> \ # txt file of camera pose, Each line corresponds to one frame
--tracking_method <tracking_method> \ # the tracking method (moge, DELTA). For image input, 'moge' is necessary.
--override_extrinsics <override/append> \ # how to apply camera motion: "override" to replace original camera, "append" to build upon it
--video_length=97 \
--sample_size 512 896 \
--density 5 \ # Control the sparsity of tracking points
--gpu <gpu_id> \ # the gpu id
```
##### 3. Input another video for extract camera pose
```python
python demo.py \
--prompt <"prompt text"> \ # prompt text
--checkpoint_path <model_path> \ # FlexAM checkpoint path (e.g checkpoints/Diffusion_Transformer/Wan2.2-Fun-5B-FLEXAM)
--output_dir <output_dir> \ # output directory
--input_path <input_path> \ # the reference image or video path
--camera_motion "path" \ # if camera motion type is "path", --pose_file is needed
--pose_file <pose_file_mp4> \ # "Pi3" automatically estimates the camera pose from this video file
--tracking_method <tracking_method> \ # the tracking method (moge, DELTA). For image input, 'moge' is necessary.
--override_extrinsics <override/append> \ # how to apply camera motion: "override" to replace original camera, "append" to build upon it
--video_length=97 \
--sample_size 512 896 \
--density 5 \ # Control the sparsity of tracking points
--gpu <gpu_id> \ # the gpu id
```
#### 5. Object Manipulation

We provide several template object manipulation types, you can choose one of them. In practice, we find that providing a description of the object motion in prompt will get better results.
```python
python demo.py \
--prompt <"prompt text"> \ # prompt text
--checkpoint_path <model_path> \ # FlexAM checkpoint path (e.g checkpoints/Diffusion_Transformer/Wan2.2-Fun-5B-FLEXAM)
--input_path <input_path> \ # the reference image path
--object_motion <object_motion> \ # the object motion type (up, down, left, right)
--object_mask <object_mask_path> \ # the object mask path
--tracking_method <tracking_method> \ # the tracking method (moge, DELTA). For image input, 'moge' is nesserary.
--sample_size 512 896 \
--video_length=49 \
--density 30 \
--gpu <gpu_id> \ # the gpu id
```
It should be noted that depending on the tracker you choose, you may need to modify the scale of translation.
## π Acknowledgements
This project builds upon several excellent open source projects:
* [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun)
* [DELTA](https://github.com/snap-research/DELTA_densetrack3d)
* [MoGe](https://github.com/microsoft/MoGe)
* [vggt](https://github.com/facebookresearch/vggt)
* [Pi3](https://github.com/yyfz/Pi3)
We thank the authors and contributors of these projects for their valuable contributions to the open source community!
## π Citation
If you find FlexAM useful for your research, please cite our paper:
```
@misc{sheng2026FlexAM,
title={FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control},
author={Sheng, Mingzhi and Gu, Zekai and Li, Peng and Lin, Cheng and Guo, Hao-Xiang and Chen, Ying-Cong and Liu, Yuan},
year={2026},
eprint={2602.13185},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.13185},
}
```
## βοΈ License
This model checkpoint is based on **FlexAM**.
- **Model Architecture / Code**: Licensed under **Apache 2.0** (or CC-BY-SA 4.0, consistent with your GitHub).
- **Embedded DELTA Weights**: This checkpoint contains weights from **DELTA (Snap Inc.)**, which are restricted to **Non-Commercial, Research-Only** use.
**β οΈ Usage Note:**
By downloading or using these weights, you agree to comply with the **Snap Inc. License** regarding the DELTA modules. Please refer to the [LICENSE](./LICENSE) file in this repository for the full text. |