OSP-Next / README.md
yunyangge's picture
Update README.md
960a293 verified
---
license: apache-2.0
library_name: pytorch
pipeline_tag: text-to-video
language:
- en
- zh
tags:
- text-to-video
- video-generation
- diffusion
- flow-matching
- sparse-attention
- skiparse
- sequence-parallel
- mix-grpo
- lora
- hif8
- quantization
- npu
- ascend
- open-sora-plan
- ospnext
base_model:
- Wan-AI/Wan2.1-T2V-14B
---
<div align="center">
<img src="https://raw.githubusercontent.com/PKU-YuanGroup/OSP-Next/main/assets/logo.png" alt="OSP-Next" width="220">
**Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning**
**Open-Sora Plan Β· Next Generation**
A scalable **sparse** text-to-video diffusion model, introducing **Skiparse-2D Attention**,
**Sparse Sequence Parallelism (SSP)**, **HiF8 quantization**, and
**Mix-GRPO + LoRA** RL post-training.
</div>
<h5 align="center">
[![arXiv](https://img.shields.io/badge/Arxiv-OSP--Next-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2605.28691)
[![GitHub](https://img.shields.io/badge/GitHub-OSP--Next-181717.svg?logo=github)](https://github.com/PKU-YuanGroup/OSP-Next)
</h5>
---
## 🧠 Model Summary
OSP-Next is a 14B-parameter **text-to-video diffusion** model built on top of
the **Wan 2.1** text encoder / VAE backbone, with four tightly co-designed
contributions:
| | What it is | Why it matters |
|---|---|---|
| 🧩 **Skiparse-2D Attention** | Fixed-rule 2D sparse attention applied along H/W. | Approaches 3D full attention in quality, **natively FlashAttention compatible**. |
| πŸ”— **Sparse Sequence Parallelism (SSP)** | A parallel strategy natively co-designed with Skiparse-2D. | **βˆ’75% inter-rank comm**, per-block comm rounds **4 β†’ 1**. |
| πŸͺΆ **HiF8 Quantization** *(NPU only)* | Dynamic-precision 8-bit (exponent / mantissa allocation). | First joint **8-bit + sparse fine-tuning** β€” up to **2.27Γ— speedup** on a single Ascend 950PR with only **βˆ’0.4 pt** on VBench. |
| 🎯 **Mix-GRPO + LoRA RL** | RL post-training on top of the sparse model. | First RL pipeline for **sparse** video diffusion. |
### πŸ“Š End-to-end speed-ups (vs. Wan 2.1 baseline, 5 s Β· 81-frame video)
| Hardware | 720P (padded) | 768P (native) |
|---|---|---|
| ⚑ NVIDIA H200 (BF16 Β· FA3 Β· `torch.compile`) | **1.53Γ—** / 1.42Γ— (1Γ— / 8Γ— GPU) | **1.64Γ—** / 1.52Γ— |
| 🟣 Ascend 950PR (BF16 Β· SDPA) | 1.27Γ— (1Γ— NPU) | 1.76Γ— |
| πŸͺΆ Ascend 950PR (HiF8 Β· 8-bit Β· SDPA) | **1.69Γ—** | **2.27Γ—** |
> πŸ† OSP-Next reaches **VBench total = 83.73%** (Wan 2.1 baseline 83.69%);
> OSP-Next-HiF8 keeps 83.29% with only a 0.4 pt drop. Full benchmark tables,
> ablations and qualitative comparisons live in the
> [paper](https://arxiv.org/abs/<ARXIV_ID>).
---
## πŸ“¦ What's in this repository
| File / folder | Description |
|---|---|
| `OSP-Next-14B/` | OSP-Next 14B BF16 diffusion weights (FSDP `model.pt` + config) |
| `OSP-Next-HiF8-14B/` | HiF8-quantized 14B weights (NPU inference) |
| `config.json` | OSP-Next model architecture metadata |
> ℹ️ OSP-Next reuses **Wan 2.1's T5 (UMT5-XXL) text encoder** and **WAN VAE**
> verbatim. We do **not** re-host them β€” see
> [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)
> for the upstream weights.
---
## πŸš€ Quick Start
OSP-Next ships as a **standalone training & inference repository** rather than
a pip-installable model class β€” the sparse attention / SSP comm / HiF8 kernels
all live inside the project. The typical flow:
```bash
# 1. Clone the code repo
git clone https://github.com/PKU-YuanGroup/OSP-Next.git
cd OSP-Next
conda create -n ospnext python=3.10 -y && conda activate ospnext
pip install -e .
# 2a. Download OSP-Next weights from this Hugging Face repo
huggingface-cli download yunyangge/OSP-Next --local-dir ./checkpoints/osp_next_14b
# 2b. Download Wan 2.1's T5 text encoder and WAN VAE (the components we reuse)
huggingface-cli download Wan-AI/Wan2.1-T2V-14B \
models_t5_umt5-xxl-enc-bf16.pth \
Wan2.1_VAE.pth \
--include "google/umt5-xxl/*" \
--local-dir ./checkpoints/Wan2.1-T2V-14B
# 3. Point the inference config at the three downloaded directories
$EDITOR configs/infer/gpu/osp_14b.yaml
# 4. Run inference
bash scripts/infer/gpu/infer_osp_14b.sh
```
In the inference YAML you'll fill in:
```yaml
model_config:
pretrained_model_dir_or_checkpoint: "./checkpoints/osp_next_14b"
vae_config:
vae_path: "./checkpoints/Wan2.1-T2V-14B/Wan2.1_VAE.pth"
text_encoder_config:
checkpoint_path: "./checkpoints/Wan2.1-T2V-14B/models_t5_umt5-xxl-enc-bf16.pth"
text_tokenizer_path: "./checkpoints/Wan2.1-T2V-14B/google/umt5-xxl/"
```
> 🟣 **On Ascend NPU?** Follow the
> [NPU setup](https://github.com/PKU-YuanGroup/OSP-Next#-npu-ascend) in the
> code repo (CANN 8.5.0 + `pip install -e .[npu]` + source-build `decord`),
> then run `scripts/infer/npu/infer_osp_14b.sh` instead.
### 🐍 Programmatic loading
The diffusion model itself can also be loaded as a regular `OSPNextModel`:
```python
from ospnext.modules.osp_next import OSPNextModel
model = OSPNextModel.from_pretrained("./checkpoints/osp_next_14b")
model = model.to("cuda", dtype="bfloat16").eval()
```
For the full text-to-video pipeline (T5 encoding β†’ diffusion β†’ VAE decoding),
use `ospnext.pipelines.t2v_pipeline.T2VPipeline` β€” see
[`infer/infer_osp.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/infer/infer_osp.py)
for a complete example.
---
## πŸ‹οΈ Training & RL Post-Training
OSP-Next supports both **SFT** (`train/train_osp.py`) and
**Mix-GRPO + LoRA RL post-training** (`train/train_osp_RL.py`) using the same
FSDP2 + Sparse-SP backbone. Highlights of the RL pipeline:
- **LoRA-only updates** on the frozen base model.
- **Mix-GRPO** β€” mixed ODE/SDE flow-matching RL with a configurable SDE step
count, KL penalty and group advantage clipping.
- **VideoAlign** as the multi-axis reward model.
- RL checkpoints **only store the LoRA adapter** (no base model duplication),
plus an **EMA-LoRA** companion for inference. Merge them back into the base
with [`merge_lora_weights.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/merge_lora_weights.py)
before running inference.
Full training / RL recipes, config reference, sequence-parallel sizing tables
and troubleshooting tips are in the
[code repository README](https://github.com/PKU-YuanGroup/OSP-Next#%EF%B8%8F-training-pipeline).
---
## πŸ§ͺ Intended Use & Limitations
**Intended uses**
- Research on **sparse** video diffusion: Skiparse-2D, Sparse Sequence
Parallelism, joint sparse + 8-bit quantization, sparse-model RL.
- Text-to-video generation for non-commercial creative / educational use.
**Out of scope**
- Generating photo-realistic or identifiable likenesses of real individuals.
- Generating illegal, deceptive, harmful, sexually explicit, or
copyright-infringing content.
**Known limitations**
- 14B model β€” single-GPU inference needs a 80 GB-class accelerator
(H100 / H200 / A100 80GB / Ascend 910B / 950PR). Multi-GPU is supported and
recommended via the included SSP / FSDP2 launch scripts.
- HiF8 weights are tuned for the Ascend NPU custom kernel; the BF16 model is
the recommended starting point on NVIDIA GPUs.
- Multi-NPU 950PR numbers are not yet reported β€” current 950PR results in the
paper / model card are single-NPU only.
---
## πŸ“š Training Data
OSP-Next is trained on the same large-scale text-video corpus used by the
**Open-Sora-Plan** lineage, plus internal data filtering / re-captioning
pipelines (see the paper for details). No personal identifiable information is
intentionally included, and any sensitive content is filtered prior to
training to the best of our ability.
The RL post-training uses a **text-only prompt corpus** scored by
[VideoAlign](https://github.com/KwaiVGI/VideoAlign).
---
## πŸ“ Citation
If you find OSP-Next useful in your research, please cite:
```bibtex
@misc{ge2026ospnextefficienthighqualityvideo,
title={OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning},
author={Yunyang Ge and Xianyi He and Zezhong Zhang and Bin Lin and Bin Zhu and Xinhua Cheng and Li Yuan},
year={2026},
eprint={2605.28691},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.28691},
}
```
This work builds on:
```bibtex
@article{wan2025wan,
title={Wan: Open and advanced large-scale video generative models},
author={Wan, Team and Wang, Ang and Ai, Baole and Wen, Bin and Mao, Chaojie and Xie, Chen-Wei and Chen, Di and Yu, Feiwu and Zhao, Haiming and Yang, Jianxiao and others},
journal={arXiv preprint arXiv:2503.20314},
year={2025}
}
@article{lin2024open,
title={Open-sora plan: Open-source large video generation model},
author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
journal={arXiv preprint arXiv:2412.00131},
year={2024}
}
@article{li2025mixgrpo,
title={Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde},
author={Li, Junzhe and Cui, Yutao and Huang, Tao and Ma, Yinping and Fan, Chun and Cheng, Yiming and Yang, Miles and Zhong, Zhao and Bo, Liefeng},
journal={arXiv preprint arXiv:2507.21802},
year={2025}
}
```
---
## πŸ™ Acknowledgements
- 🌊 [**Wan**](https://github.com/Wan-Video/Wan2.1) β€” WAN-VAE and T5 backbone.
- 🎬 [**Open-Sora-Plan**](https://github.com/PKU-YuanGroup/Open-Sora-Plan) β€” the ecosystem this project extends.
- πŸ… [**VideoAlign**](https://github.com/KwaiVGI/VideoAlign) β€” reward model for RL post-training.
- 🎯 [**Mix-GRPO**](https://arxiv.org/abs/2507.21802) β€” mixed ODE-SDE flow-matching RL.
---
## πŸ“„ License
Released under **Apache 2.0** β€” see
[`LICENSE.txt`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/LICENSE.txt)
in the code repository.
The reused Wan 2.1 T5 / VAE weights are governed by **their own licenses** at
[`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B).