APolarishen commited on
Commit
272fb7e
ยท
verified ยท
1 Parent(s): a8c2af8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - tencent/HunyuanVideo
6
+ - tencent/HunyuanVideo-1.5
7
+ ---
8
+
9
+ <div align=center>
10
+
11
+ # *DisCa*: Accelerating Video Diffusion Transformers with *Dis*tillation-Compatible Learnable Feature *Ca*ching
12
+ <h3>CVPR 2026</h3>
13
+
14
+ **[EPIC Lab@SAI,SJTU](http://zhanglinfeng.tech)** | **[Tencent Hunyuan](https://github.com/Tencent-Hunyuan)**
15
+
16
+ [Chang Zou](https://shenyi-z.github.io/),
17
+ [Changlin Li](https://scholar.google.com/citations?user=wOQjqCMAAAAJ&hl=en),
18
+ Yang Li,
19
+ Patrol Li,
20
+ Jianbing Wu,
21
+ Xiao He,
22
+ Songtao Liu,
23
+ Zhao Zhong,
24
+ Kailin Huang,
25
+ [Linfeng Zhang](https://zhanglinfeng.tech)<sup>โ€ </sup>
26
+
27
+ <a href="https://arxiv.org/abs/2602.05449" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-2602.05449-B31B1B?logo=arxiv&logoColor=white" height="25"/></a>
28
+ <a href="#citation"><img alt="Citation" src="https://img.shields.io/badge/Citation-BibTeX-6C63FF?logo=bookstack&logoColor=white" height="25"/></a>
29
+ <a href="https://huggingface.co/Tencent/DisCa" target="_blank"><img alt="HuggingFace Models" src="https://img.shields.io/badge/huggingface-DisCa-f0c542?logo=huggingface&logoColor=white" height="25"/></a>
30
+ </div>
31
+
32
+
33
+ ## ๐Ÿ“‘ Overview
34
+ ![Heading](assets/imgs/DisCa_Heading.png)
35
+ > **<p align="justify"> Abstract:** *While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden.
36
+ Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance,but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps.
37
+ This paper novelly introduces a **distillation-compatible learnable** feature caching mechanism for the first time. We employ **a lightweight learnable neural predictor** instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative **Restricted MeanFlow** approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries while preserving generation quality.* </p>
38
+
39
+ ## โœ๏ธ Technical Details
40
+ ![Tech Details](assets/imgs/DisCa_Pipeline.png)
41
+ > **<p align="justify"> See more details in the paper.** </p>
42
+
43
+ ## ๐Ÿ›  Installation
44
+
45
+ ``` bash
46
+ git clone https://github.com/Tencent-Hunyuan/DisCa.git
47
+ cd DisCa
48
+ ```
49
+
50
+ ## ๐Ÿช Experiments
51
+
52
+ ### Text-to-Video (T2V) Task on HunyuanVideo-1.0
53
+ The experiments for the Text-to-Video task are conducted on **HunyuanVideo-1.0**. Given that the original project does not provide inference scripts for MeanFlow, we have supplemented them in this project.
54
+
55
+ We provide two scripts to implement the proposed methods:
56
+ * `infer_r_meanflow.sh` for **Restricted MeanFlow**
57
+ * `infer_disca.sh` for **DisCa**
58
+
59
+ The environment setup follows the official HunyuanVideo-1.0 configuration. We have enabled command-line argument passing; you can modify these arguments directly within the bash files or execute them via the command line.
60
+
61
+ ### Image-to-Video (I2V) Task on HunyuanVideo-1.5
62
+ The experiments for the Image-to-Video task are conducted on **HunyuanVideo-1.5**. Because HunyuanVideo-1.5 is already well-adapted for MeanFlow inference, this project provides a lightweight and concise codebase.
63
+
64
+ Since the Image-to-Video task inherently possesses stronger control signals and is easier to distill, coupled with the enhanced capabilities of HunyuanVideo-1.5, the I2V implementation in this project achieves a higher compression ratio.
65
+
66
+ **Getting Started:**
67
+ 1. Follow the instructions in [`disca_i2v_hyvideo15/README.md`](disca_i2v_hyvideo15/README.md) to quickly set up the environment and configure the code.
68
+ 2. Execute the inference directly using `infer_disca.sh`.
69
+
70
+ **Restricted MeanFlow Component:**
71
+ For the corresponding Restricted MeanFlow component, you can use the [HunyuanVideo-1.5-480P-I2V-step-distill](https://huggingface.co/tencent/HunyuanVideo-1.5/tree/main/transformer/480p_i2v_step_distilled) model provided in the official HunyuanVideo-1.5 repository. This model is distilled using similar techniques and has undergone further optimization.
72
+
73
+
74
+ ## ๐Ÿ“š Citation
75
+
76
+ ```bibtex
77
+ @inproceedings{zou2026disca,
78
+ abbr = {CVPR},
79
+ title = {DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching},
80
+ author = {Zou, Chang and Li, Changlin and Liu, Songtao and Zhong, Zhao and Huang, Kailin and Zhang, Linfeng},
81
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
82
+ year = {2026},
83
+ url = {https://arxiv.org/abs/2602.05449},
84
+ note = {to appear},
85
+ }
86
+ ```
87
+
88
+ ## ๐Ÿง Limitations
89
+ Despite achieving promising performance, DisCa still leaves significant room for optimization, particularly regarding issues like inter-frame jitter and the need for higher compression ratios. How can we train a better Predictor? It might be worth exploring more techniques (like DMD and post-training etc..) to further improve the performance.
90
+
91
+ ## ๐ŸŽ‰ Acknowledgements
92
+ Thanks to HunyuanVideo-1.0 and HunyuanVideo-1.5 for open-sourcing their code and models, which make this work possible.