Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
base_model:
|
| 5 |
+
- tencent/HunyuanVideo
|
| 6 |
+
- tencent/HunyuanVideo-1.5
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
<div align=center>
|
| 10 |
+
|
| 11 |
+
# *DisCa*: Accelerating Video Diffusion Transformers with *Dis*tillation-Compatible Learnable Feature *Ca*ching
|
| 12 |
+
<h3>CVPR 2026</h3>
|
| 13 |
+
|
| 14 |
+
**[EPIC Lab@SAI,SJTU](http://zhanglinfeng.tech)** | **[Tencent Hunyuan](https://github.com/Tencent-Hunyuan)**
|
| 15 |
+
|
| 16 |
+
[Chang Zou](https://shenyi-z.github.io/),
|
| 17 |
+
[Changlin Li](https://scholar.google.com/citations?user=wOQjqCMAAAAJ&hl=en),
|
| 18 |
+
Yang Li,
|
| 19 |
+
Patrol Li,
|
| 20 |
+
Jianbing Wu,
|
| 21 |
+
Xiao He,
|
| 22 |
+
Songtao Liu,
|
| 23 |
+
Zhao Zhong,
|
| 24 |
+
Kailin Huang,
|
| 25 |
+
[Linfeng Zhang](https://zhanglinfeng.tech)<sup>โ </sup>
|
| 26 |
+
|
| 27 |
+
<a href="https://arxiv.org/abs/2602.05449" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-2602.05449-B31B1B?logo=arxiv&logoColor=white" height="25"/></a>
|
| 28 |
+
<a href="#citation"><img alt="Citation" src="https://img.shields.io/badge/Citation-BibTeX-6C63FF?logo=bookstack&logoColor=white" height="25"/></a>
|
| 29 |
+
<a href="https://huggingface.co/Tencent/DisCa" target="_blank"><img alt="HuggingFace Models" src="https://img.shields.io/badge/huggingface-DisCa-f0c542?logo=huggingface&logoColor=white" height="25"/></a>
|
| 30 |
+
</div>
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
## ๐ Overview
|
| 34 |
+

|
| 35 |
+
> **<p align="justify"> Abstract:** *While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden.
|
| 36 |
+
Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance,but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps.
|
| 37 |
+
This paper novelly introduces a **distillation-compatible learnable** feature caching mechanism for the first time. We employ **a lightweight learnable neural predictor** instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative **Restricted MeanFlow** approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries while preserving generation quality.* </p>
|
| 38 |
+
|
| 39 |
+
## โ๏ธ Technical Details
|
| 40 |
+

|
| 41 |
+
> **<p align="justify"> See more details in the paper.** </p>
|
| 42 |
+
|
| 43 |
+
## ๐ Installation
|
| 44 |
+
|
| 45 |
+
``` bash
|
| 46 |
+
git clone https://github.com/Tencent-Hunyuan/DisCa.git
|
| 47 |
+
cd DisCa
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
## ๐ช Experiments
|
| 51 |
+
|
| 52 |
+
### Text-to-Video (T2V) Task on HunyuanVideo-1.0
|
| 53 |
+
The experiments for the Text-to-Video task are conducted on **HunyuanVideo-1.0**. Given that the original project does not provide inference scripts for MeanFlow, we have supplemented them in this project.
|
| 54 |
+
|
| 55 |
+
We provide two scripts to implement the proposed methods:
|
| 56 |
+
* `infer_r_meanflow.sh` for **Restricted MeanFlow**
|
| 57 |
+
* `infer_disca.sh` for **DisCa**
|
| 58 |
+
|
| 59 |
+
The environment setup follows the official HunyuanVideo-1.0 configuration. We have enabled command-line argument passing; you can modify these arguments directly within the bash files or execute them via the command line.
|
| 60 |
+
|
| 61 |
+
### Image-to-Video (I2V) Task on HunyuanVideo-1.5
|
| 62 |
+
The experiments for the Image-to-Video task are conducted on **HunyuanVideo-1.5**. Because HunyuanVideo-1.5 is already well-adapted for MeanFlow inference, this project provides a lightweight and concise codebase.
|
| 63 |
+
|
| 64 |
+
Since the Image-to-Video task inherently possesses stronger control signals and is easier to distill, coupled with the enhanced capabilities of HunyuanVideo-1.5, the I2V implementation in this project achieves a higher compression ratio.
|
| 65 |
+
|
| 66 |
+
**Getting Started:**
|
| 67 |
+
1. Follow the instructions in [`disca_i2v_hyvideo15/README.md`](disca_i2v_hyvideo15/README.md) to quickly set up the environment and configure the code.
|
| 68 |
+
2. Execute the inference directly using `infer_disca.sh`.
|
| 69 |
+
|
| 70 |
+
**Restricted MeanFlow Component:**
|
| 71 |
+
For the corresponding Restricted MeanFlow component, you can use the [HunyuanVideo-1.5-480P-I2V-step-distill](https://huggingface.co/tencent/HunyuanVideo-1.5/tree/main/transformer/480p_i2v_step_distilled) model provided in the official HunyuanVideo-1.5 repository. This model is distilled using similar techniques and has undergone further optimization.
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
## ๐ Citation
|
| 75 |
+
|
| 76 |
+
```bibtex
|
| 77 |
+
@inproceedings{zou2026disca,
|
| 78 |
+
abbr = {CVPR},
|
| 79 |
+
title = {DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching},
|
| 80 |
+
author = {Zou, Chang and Li, Changlin and Liu, Songtao and Zhong, Zhao and Huang, Kailin and Zhang, Linfeng},
|
| 81 |
+
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
| 82 |
+
year = {2026},
|
| 83 |
+
url = {https://arxiv.org/abs/2602.05449},
|
| 84 |
+
note = {to appear},
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
## ๐ง Limitations
|
| 89 |
+
Despite achieving promising performance, DisCa still leaves significant room for optimization, particularly regarding issues like inter-frame jitter and the need for higher compression ratios. How can we train a better Predictor? It might be worth exploring more techniques (like DMD and post-training etc..) to further improve the performance.
|
| 90 |
+
|
| 91 |
+
## ๐ Acknowledgements
|
| 92 |
+
Thanks to HunyuanVideo-1.0 and HunyuanVideo-1.5 for open-sourcing their code and models, which make this work possible.
|