File size: 9,306 Bytes
d2253eb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | <div align="center">
<img src="assets/logo.png" width="30%" alt="logo"/>
<h1>π» URSA: Uniform Discrete Diffusion with Metric Path<br>for Video Generation</h1>
<p align="center">
<a href="https://arxiv.org/abs/2510.24717"><img src="https://img.shields.io/badge/ArXiv-2510.24717-%23840707.svg" alt="ArXiv"></a>
<a href="https://huggingface.co/collections/BAAI/ursa"><img src="https://img.shields.io/badge/π€ Weights-BAAI/URSA-rgb(166,109,59).svg" alt=""></a>
<a href="https://huggingface.co/spaces/BAAI/nova-d48w1024-osp480"><img src="https://img.shields.io/badge/π€ Demo-TI2V-%26840707.svg" alt="TI2VDemo"></a>
<a href="http://bitterdhg.github.io/URSA_page"><img src="https://img.shields.io/badge/Project-URSA-%237CB4F7.svg" alt="Project"></a>
</p>
<p align="center">
[Haoge Deng](https://scholar.google.com/citations?user=S2sbvjgAAAAJ&hl)<sup>1,4*</sup>, [Ting Pan](https://scholar.google.com/citations?&user=qQv6YbsAAAAJ)<sup>2,4*</sup>, [Fan Zhang](https://scholar.google.com/citations?user=VsJ39HMAAAAJ)<sup>4*</sup>, [Yang Liu](https://scholar.google.com/citations?user=9JcQ2hwAAAAJ&hl)<sup>3,4*</sup>, [Zhuoyan Luo](https://scholar.google.com/citations?user=mKQhEsIAAAAJ&hl)<sup>4</sup>, [Yufeng Cui](https://scholar.google.com/citations?user=5Ydha2EAAAAJ&hl)<sup>4</sup>, [Wenxuan Wang](https://scholar.google.com/citations?user=75OyC-oAAAAJ&hl)<sup>4</sup><br>
[Chunhua Shen](https://scholar.google.com/citations?user=Ljk2BvIAAAAJ&hl)<sup>3</sup>, [Shiguang Shan](https://scholar.google.com/citations?user=Vkzd7MIAAAAJ&hl)<sup>2</sup>, [Zhaoxiang Zhang](https://scholar.google.com/citations?user=qxWfV6cAAAAJ&hl)<sup>1β </sup>, [Xinlong Wang](https://scholar.google.com/citations?user=DPz0DjYAAAAJ&hl)<sup>4β </sup><br>
[CASIA](http://english.ia.cas.cn)<sup>1</sup>, [CASICT](http://english.ict.cas.cn)<sup>2</sup>, [ZJU](https://www.zju.edu.cn/english)<sup>3</sup>, [BAAI](https://www.baai.ac.cn/en)<sup>4</sup><br>
<sup>*</sup> Equal Contribution, <sup>β </sup> Corresponding Author
<br><br><image src="assets/model_preview.gif"/>
<br><br><image src="assets/model_overview.png"/>
</div>
We present **URSA** (**U**niform disc**R**ete diffu**S**ion with metric p**A**th), a simple yet powerful framework that bridges the gap with continuous approaches. **URSA** formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens and scales efficiently to long video generation, requiring fewer inference steps. **URSA** enables multi-task video generation with asynchronous timestep scheduling strategy in one unified model.
## π News
- ```[Feb 2026]``` Accepted by ICLR 2026 [[OpenReview]](https://openreview.net/forum?id=GFU5yCbILk).
- ```[Jan 2026]``` Released [Training Guide](./docs/training.md).
- ```[Oct 2025]``` π URSA is part of [Emu3.5](https://github.com/baaivision/Emu3.5) as DiDA (Discrete Diffusion Adaptation)!
- ```[Oct 2025]``` Released <a href="https://huggingface.co/spaces/BAAI/nova-d48w1024-osp480"><b>TI2V</b></a> π€ Demo.
- ```[Oct 2025]``` Released [Paper](https://arxiv.org/abs/2510.24717) & [Project Page](http://bitterdhg.github.io/URSA_page) & [Evaluation Guide](./docs/evaluation.md).
## β¨Hightlights
- π₯ **Novel Approach**: Uniform Discrete Diffusion with Metric Path.
- π₯ **SOTA Performance**: High efficiency with state-of-the-art T2I/T2V/I2V results.
- π₯ **Unified Modeling**: Multi-task capabilities in a single unified model.
## ποΈ Models
### πΌοΈ Text to Image
| Model | Resolution | Data | Weight | GenEval | DPGBench |
|:-----:|:----------:|:----:|:------:|:-------:|:--------:|
| URSA-0.6B-IBQ1024 | 1024x1024 | 30M | [π€ HF](https://huggingface.co/BAAI/URSA-0.6B-IBQ1024) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-0.6B-IBQ1024) | 0.79 | 85.6 |
| URSA-1.7B-IBQ1024 | 1024x1024 | 30M | [π€ HF](https://huggingface.co/BAAI/URSA-1.7B-IBQ1024) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-1.7B-IBQ1024) | 0.80 | 86.0 |
### π¬ Text to Video
| Model | Resolution | Data | Weight | VBench-T2V | VBench-I2V |
|:-----:|:----------:|:----:|:------:|:----------:|:----------:|
| URSA-0.6B-FSQ320 | 49x512x320 | 24M | [π€ HF](https://huggingface.co/BAAI/URSA-0.6B-FSQ320) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-0.6B-FSQ320) | 81.4 | 86.0 |
| URSA-1.7B-FSQ320 | 49x512x320 | 24M | [π€ HF](https://huggingface.co/BAAI/URSA-1.7B-FSQ320) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-1.7B-FSQ320) | 82.4 | 86.2 |
## π Table of Contents
- [π§ Installation](#installation)
- [π₯ Quick Start](#quick-start)
- [πΌοΈ Image Generation](#quickstart-image-generation)
- [π¬ Video Generation](#quickstart-video-generation)
- [π» Gradio Demo](#gradio-demo)
- [π― Evaluation](./docs/evaluation.md)
- [π€ Training](./docs/training.md)
## π§ Installation
<a id="installation"></a>
Clone this repository to local disk and install:
```bash
pip install diffusers transformers>=4.57.1 accelerate imageio imageio-ffmpeg omegaconf wandb
git clone https://github.com/baaivision/URSA.git
cd URSA && pip install .
```
## π₯ Quick Start
<a id="quick-start"></a>
### πΌοΈ Image Generation
<a id="quickstart-image-generation"></a>
```python
import torch
from diffnext.pipelines import URSAPipeline
model_id, height, width = "BAAI/URSA-1.7B-IBQ1024", 1024, 1024
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = URSAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to(torch.device("cuda"))
prompt = "The bear, calm and still, gazes upward as if lost in contemplation of the cosmos."
negative_prompt = "worst quality, low quality, inconsistent motion, static, still, blurry, jittery, distorted, ugly"
image = pipe(**locals()).frames[0]
image.save("ursa.jpg")
```
### π¬ Video Generation
<a id="quickstart-video-generation"></a>
```python
import os, torch, numpy
from diffnext.pipelines import URSAPipeline
from diffnext.utils import export_to_video
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
model_id, height, width = "BAAI/URSA-1.7B-FSQ320", 320, 512
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = URSAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to(torch.device("cuda"))
text_prompt = "a lone grizzly bear walks through a misty forest at dawn, sunlight catching its fur."
negative_prompt = "worst quality, low quality, inconsistent motion, static, still, blurry, jittery, distorted, ugly"
# Text-to-Image
prompt = text_prompt
num_frames, num_inference_steps = 1, 25
image = pipe(**locals()).frames[0]
image.save("ursa.jpg")
# Image-to-Video
prompt = f"motion=9.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
video = pipe(**locals()).frames[0]
export_to_video(video, "ursa_1+48f.mp4", fps=12)
# Text-to-Video
image, video = None, None
prompt = f"motion=9.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
video = pipe(**locals()).frames[0]
export_to_video(video, "ursa_49f.mp4", fps=12)
# Video-to-Video
prompt = f"motion=5.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
num_cond_frames, cond_noise_scale = 13, 0.1
for i in range(12):
video, start_video = video[-num_cond_frames:], video
video = pipe(**locals()).frames[0]
video = numpy.concatenate([start_video, video[num_cond_frames:]])
export_to_video(video, "ursa_{}f.mp4".format(video.shape[0]), fps=12)
```
## π» Gradio Demo
<a id="gradio-demo"></a>
```bash
# Text-to-Image (T2I)
python scripts/app_ursa_t2i.py --model "BAAI/URSA-1.7B-IBQ1024" --device 0
# Text-to-Image-to-Video (TI2V)
python scripts/app_ursa_ti2v.py --model "BAAI/URSA-1.7B-FSQ320" --device 0
```
## π Todo List
- [X] [Model Zoo](#model-zoo)
- [X] [Quick Start](#quick-start)
- [X] [Gradio Demo](#gradio-demo)
- [X] [Evaluation Guide](./docs/evaluation.md)
- [X] [Training Guide](./docs/training.md)
- [ ] 4B Model
## π Citation
If you find this repository useful, please consider giving a star β and citation π¦:
```
@article{deng2025ursa,
title={Uniform Discrete Diffusion with Metric Path for Video Generation},
author={Deng, Haoge and Pan, Ting and Zhang, Fan and Liu, Yang and Luo, Zhuoyan and Cui, Yufeng and Shen, Chunhua and Shan, Shiguang and Zhang, Zhaoxiang and Wang, Xinlong},
journal={arXiv preprint arXiv:2510.24717},
year={2025}
}
```
```
@article{deng2024nova,
title={Autoregressive Video Generation without Vector Quantization},
author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
journal={arXiv preprint arXiv:2412.14169},
year={2024}
}
```
## π€ Acknowledgement
We thank the repositories:
- [NOVA](https://github.com/baaivision/NOVA). β¨NOVA is the predecessor of π»URSA.
- [FlowMatching](https://github.com/facebookresearch/flow_matching). This codebase systemically provides CFM and DFM implementations.
- [FUDOKI](https://github.com/fudoki-hku/FUDOKI). This codebase provides a naive multimodal DFM implementation.
- [CodeWithGPU](https://github.com/seetacloud/codewithgpu). CodeWithGPU library is the core of our data loading pipeline.
## License
Code and models are licensed under [Apache License 2.0](LICENSE).
|