zhuhz22
/

Causal-Forcing

Text-to-Video

Model card Files Files and versions

xet

Community

Add text-to-video pipeline tag and project links

by nielsr HF Staff - opened Feb 3

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-89

Files changed (1) hide show

README.md +9 -89

README.md CHANGED Viewed

@@ -1,8 +1,10 @@
 ---
-license: apache-2.0
 base_model:
 - Wan-AI/Wan2.1-T2V-1.3B
 ---
 <div align="center">
 # Causal Forcing
@@ -25,17 +27,17 @@ base_model:
 </div>
   </p>
-  <h3 align="center"><a href="https://arxiv.org/abs/2602.02214">Paper</a> | <a href="https://thu-ml.github.io/CausalForcing.github.io">Website</a> | <a href="https://huggingface.co/zhuhz22/Causal-Forcing/tree/main">Models</a> | <a href="assets/wechat.jpg">WeChat</a></h3>
 </p>
 -----
 Causal Forcing significantly outperforms Self Forcing in **both visual quality and motion dynamics**, while keeping **the same training budget and inference efficiency**—enabling real-time, streaming video generation on a single RTX 4090.
 -----
 ## Quick Start
@@ -82,8 +84,6 @@ python inference.py \
   --data_path prompts/demos.txt
 ```
 ## Training
 <details>
@@ -94,8 +94,6 @@ First download the dataset (we provide a 6K toy dataset here):
 hf download zhuhz22/Causal-Forcing-data  --local-dir dataset
 python utils/merge_and_get_clean.py
 ```
-> If the download gets stuck, Ctrl^C and then resume it.
 Then train the AR-diffusion model:
 - Framewise:
@@ -118,12 +116,8 @@ Then train the AR-diffusion model:
     --logdir logs/ar_diffusion_chunkwise
   ```
-> We recommend training no less than 2K steps, and more steps (e.g., 5~10K) will lead to better performance.
 </details>
 <details>
 <summary> Stage 2: Causal ODE Initialization (Can skip by using our pretrained checkpoints. Click to expand.)</summary>
@@ -133,40 +127,7 @@ hf download zhuhz22/Causal-Forcing framewise/ar_diffusion.pt --local-dir checkpo
 hf download zhuhz22/Causal-Forcing chunkwise/ar_diffusion.pt --local-dir checkpoints
 ```
-In this stage, first generate ODE paired data:
-```bash
-# for the frame-wise model
-torchrun --nproc_per_node=8 \
-  get_causal_ode_data_framewise.py \
-  --generator_ckpt checkpoints/framewise/ar_diffusion.pt \
-  --rawdata_path dataset/clean_data \
-  --output_folder dataset/ODE6KCausal_framewise_latents
-python utils/create_lmdb_iterative.py \
-  --data_path dataset/ODE6KCausal_framewise_latents \
-  --lmdb_path dataset/ODE6KCausal_framewise
-# for the chunk-wise model
-torchrun --nproc_per_node=8 \
-  get_causal_ode_data_chunkwise.py \
-  --generator_ckpt checkpoints/chunkwise/ar_diffusion.pt \
-  --rawdata_path dataset/clean_data \
-  --output_folder dataset/ODE6KCausal_chunkwise_latents
-python utils/create_lmdb_iterative.py \
-  --data_path dataset/ODE6KCausal_chunkwise_latents \
-  --lmdb_path dataset/ODE6KCausal_chunkwise
-```
-Or you can also directly download our prepared dataset (~300G):
-```bash
-hf download zhuhz22/Causal-Forcing-data  --local-dir dataset
-python utils/merge_lmdb.py
-```
-> If the download gets stuck, Ctrl^C and then resume it.
-And then train ODE initialization models:
 - Frame-wise:
   ```bash
   torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
@@ -176,37 +137,12 @@ And then train ODE initialization models:
     --config_path configs/causal_ode_framewise.yaml \
     --logdir logs/causal_ode_framewise
   ```
-- Chunk-wise:
-  ```bash
-  torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
-    --rdzv_backend=c10d \
-    --rdzv_endpoint $MASTER_ADDR \
-    train.py \
-    --config_path configs/causal_ode_chunkwise.yaml \
-    --logdir logs/causal_ode_chunkwise
-  ```
-> We recommend training no less than 1K steps, and more steps (e.g., 5~10K) will lead to better performance.
 </details>
 ### Stage 3: DMD
 > This stage is compatible with Self Forcing training, so you can migrate seamlessly by using our configs and checkpoints.
-First download the dataset:
-```bash
-hf download gdhe17/Self-Forcing vidprom_filtered_extended.txt --local-dir prompts
-```
-If you have skipped Stage 2, you need to download the pretrained checkpoints:
-```bash
-hf download zhuhz22/Causal-Forcing framewise/causal_ode.pt --local-dir checkpoints
-hf download zhuhz22/Causal-Forcing chunkwise/causal_ode.pt --local-dir checkpoints
-```
-And then train DMD models:
 - Frame-wise model:
   ```bash
   torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
@@ -216,28 +152,12 @@ And then train DMD models:
     --config_path configs/causal_forcing_dmd_framewise.yaml \
     --logdir logs/causal_forcing_dmd_framewise
   ```
-  > We recommend training 500 steps. More than 1K steps will reduce dynamic degree.
-- Chunk-wise model:
-  ```bash
-  torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
-    --rdzv_backend=c10d \
-    --rdzv_endpoint $MASTER_ADDR \
-    train.py \
-    --config_path configs/causal_forcing_dmd_chunkwise.yaml \
-    --logdir logs/causal_forcing_dmd_chunkwise
-  ```
-  > We recommend training 100~200 steps. More than 1K steps will reduce dynamic degree.
-Such models are the final models used to generate videos.
 ## Acknowledgements
 This codebase is built on top of the open-source implementation of [CausVid](https://github.com/tianweiy/CausVid), [Self Forcing](https://github.com/guandeh17/Self-Forcing) and the [Wan2.1](https://github.com/Wan-Video/Wan2.1) repo.
 ## References
-If you find the method useful, please cite
-```
 @misc{zhu2026causalforcingautoregressivediffusion,
       title={Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation},
       author={Hongzhou Zhu and Min Zhao and Guande He and Hang Su and Chongxuan Li and Jun Zhu},

 ---
 base_model:
 - Wan-AI/Wan2.1-T2V-1.3B
+license: apache-2.0
+pipeline_tag: text-to-video
 ---
 <div align="center">
 # Causal Forcing
 </div>
   </p>
+  <h3 align="center"><a href="https://arxiv.org/abs/2602.02214">Paper</a> | <a href="https://thu-ml.github.io/CausalForcing.github.io">Website</a> | <a href="https://github.com/thu-ml/Causal-Forcing">Code</a> | <a href="https://huggingface.co/zhuhz22/Causal-Forcing/tree/main">Models</a> | <a href="assets/wechat.jpg">WeChat</a></h3>
 </p>
 -----
 Causal Forcing significantly outperforms Self Forcing in **both visual quality and motion dynamics**, while keeping **the same training budget and inference efficiency**—enabling real-time, streaming video generation on a single RTX 4090.
 -----
+## Abstract
+To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. We propose **Causal Forcing** that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following.
 ## Quick Start
   --data_path prompts/demos.txt
 ```
 ## Training
 <details>
 hf download zhuhz22/Causal-Forcing-data  --local-dir dataset
 python utils/merge_and_get_clean.py
 ```
 Then train the AR-diffusion model:
 - Framewise:
     --logdir logs/ar_diffusion_chunkwise
   ```
 </details>
 <details>
 <summary> Stage 2: Causal ODE Initialization (Can skip by using our pretrained checkpoints. Click to expand.)</summary>
 hf download zhuhz22/Causal-Forcing chunkwise/ar_diffusion.pt --local-dir checkpoints
 ```
+In this stage, train ODE initialization models:
 - Frame-wise:
   ```bash
   torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
     --config_path configs/causal_ode_framewise.yaml \
     --logdir logs/causal_ode_framewise
   ```
 </details>
 ### Stage 3: DMD
 > This stage is compatible with Self Forcing training, so you can migrate seamlessly by using our configs and checkpoints.
 - Frame-wise model:
   ```bash
   torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
     --config_path configs/causal_forcing_dmd_framewise.yaml \
     --logdir logs/causal_forcing_dmd_framewise
   ```
 ## Acknowledgements
 This codebase is built on top of the open-source implementation of [CausVid](https://github.com/tianweiy/CausVid), [Self Forcing](https://github.com/guandeh17/Self-Forcing) and the [Wan2.1](https://github.com/Wan-Video/Wan2.1) repo.
 ## References
+```bibtex
 @misc{zhu2026causalforcingautoregressivediffusion,
       title={Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation},
       author={Hongzhou Zhu and Min Zhao and Guande He and Hang Su and Chongxuan Li and Jun Zhu},