Lawrence-cj commited on
Commit
92323d1
·
verified ·
1 Parent(s): cc751da

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-license
4
+ license_link: >-
5
+ https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
6
+ library_name: sana, sana-video
7
+ tags:
8
+ - text-to-video
9
+ - SANA-Video
10
+ - 480p_long_model
11
+ - BF16
12
+ - diffusion
13
+ - Minute-Length-Video-generation
14
+ language:
15
+ - en
16
+ - zh
17
+ base_model:
18
+ - Efficient-Large-Model/SANA-Video_2B_480p_LongLive_diffusers
19
+ pipeline_tag: text-to-video
20
+ ---
21
+ <p align="center" style="border-radius: 10px">
22
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/645b5b09bc7518912e1f9733/N0VlE-y1pau-4O1RlijQd.png" width="98%" alt="logo"/>
23
+ </p>
24
+
25
+ <div style="display:flex;justify-content: center">
26
+ <a href="https://hf.co/collections/Efficient-Large-Model/sana-video"><img src="https://img.shields.io/static/v1?label=Weights&message=Huggingface&color=yellow"></a> &ensp;
27
+ <a href="https://github.com/NVlabs/Sana"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a> &ensp;
28
+ <a href="https://nvlabs.github.io/Sana/Video/"><img src="https://img.shields.io/static/v1?label=Project&message=Github&color=blue&logo=github-pages"></a> &ensp;
29
+ <a href="https://arxiv.org/pdf/2509.24695"><img src="https://img.shields.io/static/v1?label=Arxiv&message=SANA-Video&color=red&logo=arxiv"></a> &ensp;
30
+ </div>
31
+
32
+
33
+ # 🐱 SANA-Video Model Card
34
+
35
+ <!-- <div align="center">
36
+ <a href="https://www.youtube.com/watch?v=nI_Ohgf8eOU" target="_blank">
37
+ <img src="https://img.youtube.com/vi/nI_Ohgf8eOU/0.jpg" alt="Demo Video of SANA-Video" style="width: 48%; display: block; margin: 0 auto; display: inline-block;">
38
+ </a>
39
+ <a href="https://www.youtube.com/watch?v=OOZzkirgsAc" target="_blank">
40
+ <img src="https://img.youtube.com/vi/OOZzkirgsAc/0.jpg" alt="Demo Video of SANA-Video" style="width: 48%; display: block; margin: 0 auto; display: inline-block;">
41
+ </a>
42
+ </div> -->
43
+
44
+
45
+ SANA-Video is a small, ultra-efficient diffusion model designed for rapid generation of high-quality, minute-long videos at resolutions up to 720×1280.
46
+
47
+ Key innovations and efficiency drivers include:
48
+
49
+ (1) **Linear DiT**: Leverages linear attention as the core operation, offering significantly more efficiency than vanilla attention when processing the massive number of tokens required for video generation.
50
+
51
+ (2) **Constant-Memory KV Cache for Block Linear Attention**: Implements a block-wise autoregressive approach that uses the cumulative properties of linear attention to maintain global context at a fixed memory cost, eliminating the traditional KV cache bottleneck and enabling efficient, minute-long video synthesis.
52
+
53
+ SANA-Video achieves exceptional efficiency and cost savings: its training cost is only **1%** of MovieGen's (**12 days on 64 H100 GPUs**). Compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1 and SkyReel-V2), SANA-Video maintains competitive performance while being **16×** faster in measured latency.
54
+ SANA-Video is deployable on RTX 5090 GPUs, accelerating the inference speed for a 5-second 720p video from 71s down to 29s (2.4× speedup), setting a new standard for low-cost, high-quality video generation.
55
+
56
+ Source code is available at https://github.com/NVlabs/Sana.
57
+
58
+ # 🐱 How to Inference
59
+
60
+ ```python
61
+ import torch
62
+ from diffusers import LongSanaVideoPipeline, DPMSolverMultistepScheduler
63
+ from diffusers import AutoencoderKLWan
64
+ from diffusers.utils import export_to_video
65
+
66
+ pipe = LongSanaVideoPipeline.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_LongLive_diffusers", torch_dtype=torch.bfloat16)
67
+ pipe.vae.to(torch.float32)
68
+ pipe.text_encoder.to(torch.bfloat16)
69
+ pipe.to("cuda")
70
+
71
+ prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
72
+ negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
73
+
74
+ video = pipe(
75
+ prompt=prompt,
76
+ negative_prompt=negative_prompt,
77
+ height=480,
78
+ width=832,
79
+ frames=161,
80
+ guidance_scale=1.0,
81
+ timesteps=[1000, 960, 889, 727, 0], # Multi-step denoising per chunk
82
+ generator=torch.Generator(device="cuda").manual_seed(42),
83
+ ).frames[0]
84
+ export_to_video(video, "longsana.mp4", fps=16)
85
+
86
+ ```
87
+
88
+ ### Model Description
89
+
90
+ - **Developed by:** NVIDIA, Sana
91
+ - **Model type:** Efficient Video Generation with Block Linear Diffusion Transformer
92
+ - **Model size:** 2B parameters
93
+ - **Model precision:** torch.bfloat16 (BF16)
94
+ - **Model resolution:** This model is developed to generate 480p resolution 5s-60s(81-961) frames videos with multi-scale heigh and width.
95
+ - **Model Description:** This is a model that can be used to generate and modify videos based on text prompts.
96
+ It is a Linear Diffusion Transformer that uses 8x wan-vae one 32x spatial-compressed latent feature encoder ([DC-AE-V](https://arxiv.org/abs/2509.25182)).
97
+ - **Resources for more information:** Check out our [GitHub Repository](https://github.com/NVlabs/Sana) and the [SANA-Video report on arXiv](https://arxiv.org/pdf/2509.24695).
98
+
99
+ ### Model Sources
100
+
101
+ For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference
102
+ - **Repository:** https://github.com/NVlabs/Sana
103
+ - **Guidance:** https://github.com/NVlabs/Sana/asset/docs/sana_video.md
104
+
105
+ ## License/Terms of Use
106
+
107
+ GOVERNING TERMS: This trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
108
+
109
+ ## Uses
110
+
111
+ ### Direct Use
112
+
113
+ The model is intended for research purposes only. Possible research areas and tasks include
114
+
115
+ - Generation of artworks and use in design and other artistic processes.
116
+ - Applications in educational or creative tools.
117
+ - Research on generative models.
118
+ - Safe deployment of models which have the potential to generate harmful content.
119
+
120
+ - Probing and understanding the limitations and biases of generative models.
121
+
122
+ Excluded uses are described below.
123
+
124
+ ### Out-of-Scope Use
125
+
126
+ The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
127
+
128
+ ## Limitations and Bias
129
+
130
+ ### Limitations
131
+
132
+ - The model does not achieve perfect photorealism
133
+ - The model cannot render complex legible text
134
+ - fingers, .etc in general may not be generated properly.
135
+ - The autoencoding part of the model is lossy.
136
+
137
+ ### Bias
138
+ While the capabilities of video generation models are impressive, they can also reinforce or exacerbate social biases.