Update model card for HoloCine: Add metadata, links, abstract, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +180 -1
README.md CHANGED
@@ -1,3 +1,182 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ pipeline_tag: text-to-video
4
+ library_name: diffusers
5
+ ---
6
+
7
+ # HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
8
+
9
+ [πŸ“„ Paper](https://huggingface.co/papers/2510.20822) - [🌐 Project Page](https://holo-cine.github.io/) - [πŸ’» Code](https://github.com/yihao-meng/HoloCine)
10
+
11
+ **[Yihao Meng<sup>1,2</sup>](https://yihao-meng.github.io/), [Hao Ouyang<sup>2</sup>](https://ken-ouyang.github.io/), [Yue Yu<sup>1,2</sup>](https://bruceyy.com/), [Qiuyu Wang<sup>2</sup>](https://github.com/qiuyu96), [Wen Wang<sup>2,3</sup>](https://github.com/encounter1997), [Ka Leong Cheng<sup>2</sup>](https://felixcheng97.github.io/), <br>[Hanlin Wang<sup>1,2</sup>](https://scholar.google.com/citations?user=0uO4fzkAAAAJ&hl=zh-CN), [Yixuan Li<sup>2,4</sup>](https://yixuanli98.github.io/), [Cheng Chen<sup>2,5</sup>](https://scholar.google.com/citations?user=nNQU71kAAAAJ&hl=zh-CN), [Yanhong Zeng<sup>2</sup>](https://zengyh1900.github.io/), [Yujun Shen<sup>2</sup>](https://shenyujun.github.io/), [Huamin Qu<sup>1</sup>](http://huamin.org/)**
12
+ <br>
13
+ <sup>1</sup>HKUST, <sup>2</sup>Ant Group, <sup>3</sup>ZJU, <sup>4</sup>CUHK, <sup>5</sup>NTU
14
+
15
+ <div align="center">
16
+ <img src="https://github.com/user-attachments/assets/c4dee993-7c6c-4604-a93d-a8eb09cfd69b"/>
17
+ </div>
18
+
19
+ ## TLDR
20
+ * **What it is:** A text-to-video model that generates full scenes, not just isolated clips.
21
+ * **Key Feature:** It maintains consistency of characters, objects, and style across all shots in a scene.
22
+ * **How it works:** You provide shot-by-shot text prompts, giving you directorial control over the final video.
23
+
24
+ **Strongly recommend seeing our [demo page](https://holo-cine.github.io/).**
25
+
26
+ If you enjoyed the videos we created, please consider giving us a star 🌟.
27
+
28
+ ## Abstract
29
+ State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.
30
+
31
+ ## Installation
32
+ Create a conda environment and install requirements:
33
+ ```shell
34
+ git clone https://github.com/yihao-meng/HoloCine.git
35
+ cd HoloCine
36
+ conda create -n HoloCine python=3.10
37
+ pip install -e .
38
+ ```
39
+ We use FlashAttention-3 to implement the sparse inter-shot attention. We highly recommend using FlashAttention-3 for its fast speed. We provide a simple instruction on how to install FlashAttention-3.
40
+
41
+ ```shell
42
+ git clone https://github.com/Dao-AILab/flash-attention.git
43
+ cd flash-attention
44
+ cd hopper
45
+ python setup.py install
46
+ ```
47
+ If you encounter environment problem when installing FlashAttention-3, you can refer to their official github page https://github.com/Dao-AILab/flash-attention.
48
+
49
+ If you cannot install FlashAttention-3, you can use FlashAttention-2 as an alternative, and our code will automatically detect the FlashAttention version. It will be slower than FlashAttention-3,but can also produce the right result.
50
+
51
+ If you want to install FlashAttention-2, you can use the following command:
52
+ ```shell
53
+ pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
54
+ ```
55
+
56
+ ## Checkpoint
57
+
58
+ ### Step 1: Download Wan 2.2 VAE and T5
59
+ If you already have downloaded Wan 2.2 14B T2V before, skip this section.
60
+
61
+ If not, you need the T5 text encoder and the VAE from the original Wan 2.2 repository:
62
+ [https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B)
63
+
64
+ Based on the repository's file structure, you **only** need to download `models_t5_umt5-xxl-enc-bf16.pth` and `Wan2.1_VAE.pth`.
65
+
66
+ You do **not** need to download the `google`, `high_noise_model`, or `low_noise_model` folders, nor any other files.
67
+
68
+ #### Recommended Download (CLI)
69
+
70
+ We recommend using `huggingface-cli` to download only the necessary files. Make sure you have `huggingface_hub` installed (`pip install huggingface_hub`).
71
+
72
+ This command will download *only* the required T5 and VAE models into the correct directory:
73
+
74
+ ```bash
75
+ huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
76
+ --local-dir checkpoints/Wan2.2-T2V-A14B \
77
+ --allow-patterns "models_t5_*.pth" "Wan2.1_VAE.pth"
78
+ ```
79
+
80
+ #### Manual Download
81
+
82
+ Alternatively, go to the "Files" tab on the Hugging Face repo and manually download the following two files:
83
+
84
+ * `models_t5_umt5-xxl-enc-bf16.pth`
85
+ * `Wan2.1_VAE.pth`
86
+
87
+ Place both files inside a new folder named `checkpoints/Wan2.2-T2V-A14B/`.
88
+
89
+ ### Step 2: Download HoloCine Model (HoloCine\_dit)
90
+
91
+ Download our fine-tuned high-noise and low-noise DiT checkpoints from the following link:
92
+
93
+ **[➑️ Download HoloCine\_dit Model Checkpoints [Here](https://huggingface.co/hlwang06/HoloCine)]**
94
+
95
+ This download contain the four fine-tuned model files. Two for full_attention version: `full_high_noise.safetensors`, `full_low_noise.safetensors`. And two for sparse inter-shot attention version: `sparse_high_noise.safetensors`, `sparse_high_noise.safetensors`. The sparse version is still uploading.
96
+
97
+ You can choose a version to download, or try both version if you want.
98
+
99
+ The full attention version will have better performance, so we suggest you start from it. The sparse inter-shot attention version will be slightly unstable (but also great in most cases), but faster than the full attention version.
100
+
101
+ For full attention version:
102
+ Create a new folder named `checkpoints/HoloCine_dit/full/` and place both high and low noise files inside.
103
+
104
+ For sparse attention version:
105
+ Create a new folder named `checkpoints/HoloCine_dit/full/` and place both high and low noise files inside.
106
+
107
+ ### Step 3: Final Directory Structure
108
+
109
+ If you downloaded the `full` model, your `checkpoints` directory should look like this:
110
+
111
+ ```
112
+ checkpoints/
113
+ β”œβ”€β”€ Wan2.2-T2V-A14B/
114
+ β”‚ β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
115
+ β”‚ └── Wan2.1_VAE.pth
116
+ └── HoloCine_dit/
117
+ └── full/
118
+ β”œβ”€β”€ full_high_noise.safetensors
119
+ └── full_low_noise.safetensors
120
+ ```
121
+ (If you downloaded the `sparse` model, replace `full` with `sparse`.)
122
+
123
+ ## Sample Usage (Inference)
124
+ We release two versions of models, one using full attention to model the multi-shot sequence (our default), the other using sparse inter-shot attention.
125
+
126
+ To use the full attention version:
127
+
128
+ ```shell
129
+ python HoloCine_inference_full_attention.py
130
+ ```
131
+
132
+ To use the sparse inter-shot attention version:
133
+
134
+ ```shell
135
+ python HoloCine_inference_sparse_attention.py
136
+ ```
137
+
138
+ ### Prompt Format - Structured Input Example
139
+ This is the easiest way to create new multi-shot prompts. You provide the components as separate arguments inside the script, and our helper function will format them correctly.
140
+
141
+ **Example (inside `HoloCine_inference_full_attention.py`):**
142
+
143
+ ```python
144
+ run_inference(
145
+ pipe=pipe,
146
+ negative_prompt=scene_negative_prompt,
147
+ output_path="test_structured_output.mp4",
148
+
149
+ # Choice 1 inputs
150
+ global_caption="The scene is set in a lavish, 1920s Art Deco ballroom during a masquerade party. [character1] is a mysterious woman with a sleek bob, wearing a sequined silver dress and an ornate feather mask. [character2] is a dapper gentleman in a black tuxedo, his face half-hidden by a simple black domino mask. The environment is filled with champagne fountains, a live jazz band, and dancing couples in extravagant costumes. This scene contains 5 shots.",
151
+ shot_captions=[
152
+ "Medium shot of [character1] standing by a pillar, observing the crowd, a champagne flute in her hand.",
153
+ "Close-up of [character2] watching her from across the room, a look of intrigue on his visible features.",
154
+ "Medium shot as [character2] navigates the crowd and approaches [character1], offering a polite bow. ",
155
+ "Close-up on [character1]'s eyes through her mask, as they crinkle in a subtle, amused smile.",
156
+ "A stylish medium two-shot of them standing together, the swirling party out of focus behind them, as they begin to converse."
157
+
158
+ ],
159
+ num_frames=241
160
+ )
161
+ ```
162
+
163
+ ## Citation
164
+
165
+ If you find this work useful, please consider citing our paper:
166
+
167
+ ```bibtex
168
+ @article{meng2025holocine,
169
+ title={HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives},
170
+ author={Meng, Yihao and Ouyang, Hao and Yu, Yue and Wang, Qiuyu and Wang, Wen and Cheng, Ka Leong and Wang, Hanlin and Li, Yixuan and Chen, Cheng and Zeng, Yanhong and Shen, Yujun and Qu, Huamin},
171
+ journal={arXiv preprint arXiv:2510.20822},
172
+ year={2025}
173
+ }
174
+ ```
175
+
176
+ ## License
177
+
178
+ This project is licensed under the CC BY-NC-SA 4.0 ([Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/)).
179
+
180
+ The code is provided for academic research purposes only.
181
+
182
+ For any questions, please contact ymengas@cse.ust.hk.