Efficient-Large-Model
/

SANA-Video_2B_720p

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+library_name: sana, sana-video
+tags:
+- text-to-video
+- SANA-Video
+- 720p_5s_pretrained_model
+- BF16
+- diffusion
+- LTX2-VAE
+language:
+- en
+- zh
+base_model:
+- Efficient-Large-Model/SANA-Video_2B_720p
+pipeline_tag: text-to-video
+---
+<p align="center" style="border-radius: 10px">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/645b5b09bc7518912e1f9733/N0VlE-y1pau-4O1RlijQd.png" width="98%" alt="logo"/>
+</p>
+<div style="display:flex;justify-content: center">
+  <a href="https://hf.co/collections/Efficient-Large-Model/sana-video"><img src="https://img.shields.io/static/v1?label=Weights&message=Huggingface&color=yellow"></a> &ensp;
+  <a href="https://github.com/NVlabs/Sana"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a> &ensp;
+  <a href="https://nvlabs.github.io/Sana/Video/"><img src="https://img.shields.io/static/v1?label=Project&message=Github&color=blue&logo=github-pages"></a> &ensp;
+  <a href="https://arxiv.org/pdf/2509.24695"><img src="https://img.shields.io/static/v1?label=Arxiv&message=SANA-Video&color=red&logo=arxiv"></a> &ensp;
+</div>
+# 🐱 SANA-Video Model Card
+<!-- <div align="center">
+  <a href="https://www.youtube.com/watch?v=nI_Ohgf8eOU" target="_blank">
+    <img src="https://img.youtube.com/vi/nI_Ohgf8eOU/0.jpg" alt="Demo Video of SANA-Video" style="width: 48%; display: block; margin: 0 auto; display: inline-block;">
+  </a>
+  <a href="https://www.youtube.com/watch?v=OOZzkirgsAc" target="_blank">
+    <img src="https://img.youtube.com/vi/OOZzkirgsAc/0.jpg" alt="Demo Video of SANA-Video" style="width: 48%; display: block; margin: 0 auto; display: inline-block;">
+  </a>
+</div> -->
+SANA-Video is a small, ultra-efficient diffusion model designed for rapid generation of high-quality, minute-long videos at resolutions up to 720×1280.
+Key innovations and efficiency drivers include:
+(1) **Linear DiT**: Leverages linear attention as the core operation, offering significantly more efficiency than vanilla attention when processing the massive number of tokens required for video generation.
+(2) **Constant-Memory KV Cache for Block Linear Attention**: Implements a block-wise autoregressive approach that uses the cumulative properties of linear attention to maintain global context at a fixed memory cost, eliminating the traditional KV cache bottleneck and enabling efficient, minute-long video synthesis.
+SANA-Video achieves exceptional efficiency and cost savings: its training cost is only **1%** of MovieGen's (**12 days on 64 H100 GPUs**). Compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1 and SkyReel-V2), SANA-Video maintains competitive performance while being **16×** faster in measured latency.
+SANA-Video is deployable on RTX 5090 GPUs, accelerating the inference speed for a 5-second 720p video from 71s down to 29s (2.4× speedup), setting a new standard for low-cost, high-quality video generation.
+Source code is available at https://github.com/NVlabs/Sana.
+# 🐱 How to Inference
+Refer to: https://github.com/NVlabs/Sana/blob/main/asset/docs/sana_video.md#1-inference-with-txt-file
+# diffusers pipeline
+refer to: https://huggingface.co/Efficient-Large-Model/SANA-Video_2B_720p_diffusers
+### Model Description
+- **Developed by:** NVIDIA, Sana
+- **Model type:** Efficient Video Generation with Block Linear Diffusion Transformer
+- **Model size:** 2B parameters
+- **Model precision:** torch.bfloat16 (BF16)
+- **Model resolution:** This model is developed to generate 720p resolution 81(5s) frames videos with multi-scale heigh and width.
+- **Model Description:** This is a model that can be used to generate and modify videos based on text prompts.
+It is a Linear Diffusion Transformer that uses LTX2-vae one 32x32x8 spatial-temporal-compressed latent feature encoder ([LTX2](https://github.com/Lightricks/LTX-2)).
+- **Resources for more information:** Check out our [GitHub Repository](https://github.com/NVlabs/Sana) and the [SANA-Video report on arXiv](https://arxiv.org/pdf/2509.24695).
+### Model Sources
+For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference
+- **Repository:** https://github.com/NVlabs/Sana
+- **Guidance:** https://github.com/NVlabs/Sana/asset/docs/sana_video.md
+## License/Terms of Use
+GOVERNING TERMS: This trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
+## Uses
+### Direct Use
+The model is intended for research purposes only. Possible research areas and tasks include
+- Generation of artworks and use in design and other artistic processes.
+- Applications in educational or creative tools.
+- Research on generative models.
+- Safe deployment of models which have the potential to generate harmful content.
+- Probing and understanding the limitations and biases of generative models.
+Excluded uses are described below.
+### Out-of-Scope Use
+The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
+## Limitations and Bias
+### Limitations
+- The model does not achieve perfect photorealism
+- The model cannot render complex legible text
+- fingers, .etc in general may not be generated properly.
+- The autoencoding part of the model is lossy.
+### Bias
+While the capabilities of video generation models are impressive, they can also reinforce or exacerbate social biases.