| --- |
| license: apache-2.0 |
| library_name: pytorch |
| pipeline_tag: text-to-video |
| language: |
| - en |
| - zh |
| tags: |
| - text-to-video |
| - video-generation |
| - diffusion |
| - flow-matching |
| - sparse-attention |
| - skiparse |
| - sequence-parallel |
| - mix-grpo |
| - lora |
| - hif8 |
| - quantization |
| - npu |
| - ascend |
| - open-sora-plan |
| - ospnext |
| base_model: |
| - Wan-AI/Wan2.1-T2V-14B |
| --- |
| |
| <div align="center"> |
|
|
| <img src="https://raw.githubusercontent.com/PKU-YuanGroup/OSP-Next/main/assets/logo.png" alt="OSP-Next" width="220"> |
|
|
| **Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning** |
|
|
| **Open-Sora Plan Β· Next Generation** |
|
|
| A scalable **sparse** text-to-video diffusion model, introducing **Skiparse-2D Attention**, |
| **Sparse Sequence Parallelism (SSP)**, **HiF8 quantization**, and |
| **Mix-GRPO + LoRA** RL post-training. |
|
|
| </div> |
|
|
| <h5 align="center"> |
|
|
| [](https://arxiv.org/abs/2605.28691) |
| [](https://github.com/PKU-YuanGroup/OSP-Next) |
|
|
| </h5> |
|
|
| --- |
|
|
| ## π§ Model Summary |
|
|
| OSP-Next is a 14B-parameter **text-to-video diffusion** model built on top of |
| the **Wan 2.1** text encoder / VAE backbone, with four tightly co-designed |
| contributions: |
|
|
| | | What it is | Why it matters | |
| |---|---|---| |
| | π§© **Skiparse-2D Attention** | Fixed-rule 2D sparse attention applied along H/W. | Approaches 3D full attention in quality, **natively FlashAttention compatible**. | |
| | π **Sparse Sequence Parallelism (SSP)** | A parallel strategy natively co-designed with Skiparse-2D. | **β75% inter-rank comm**, per-block comm rounds **4 β 1**. | |
| | πͺΆ **HiF8 Quantization** *(NPU only)* | Dynamic-precision 8-bit (exponent / mantissa allocation). | First joint **8-bit + sparse fine-tuning** β up to **2.27Γ speedup** on a single Ascend 950PR with only **β0.4 pt** on VBench. | |
| | π― **Mix-GRPO + LoRA RL** | RL post-training on top of the sparse model. | First RL pipeline for **sparse** video diffusion. | |
|
|
| ### π End-to-end speed-ups (vs. Wan 2.1 baseline, 5 s Β· 81-frame video) |
|
|
| | Hardware | 720P (padded) | 768P (native) | |
| |---|---|---| |
| | β‘ NVIDIA H200 (BF16 Β· FA3 Β· `torch.compile`) | **1.53Γ** / 1.42Γ (1Γ / 8Γ GPU) | **1.64Γ** / 1.52Γ | |
| | π£ Ascend 950PR (BF16 Β· SDPA) | 1.27Γ (1Γ NPU) | 1.76Γ | |
| | πͺΆ Ascend 950PR (HiF8 Β· 8-bit Β· SDPA) | **1.69Γ** | **2.27Γ** | |
|
|
| > π OSP-Next reaches **VBench total = 83.73%** (Wan 2.1 baseline 83.69%); |
| > OSP-Next-HiF8 keeps 83.29% with only a 0.4 pt drop. Full benchmark tables, |
| > ablations and qualitative comparisons live in the |
| > [paper](https://arxiv.org/abs/<ARXIV_ID>). |
|
|
| --- |
|
|
| ## π¦ What's in this repository |
|
|
| | File / folder | Description | |
| |---|---| |
| | `OSP-Next-14B/` | OSP-Next 14B BF16 diffusion weights (FSDP `model.pt` + config) | |
| | `OSP-Next-HiF8-14B/` | HiF8-quantized 14B weights (NPU inference) | |
| | `config.json` | OSP-Next model architecture metadata | |
|
|
| > βΉοΈ OSP-Next reuses **Wan 2.1's T5 (UMT5-XXL) text encoder** and **WAN VAE** |
| > verbatim. We do **not** re-host them β see |
| > [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) |
| > for the upstream weights. |
|
|
| --- |
|
|
| ## π Quick Start |
|
|
| OSP-Next ships as a **standalone training & inference repository** rather than |
| a pip-installable model class β the sparse attention / SSP comm / HiF8 kernels |
| all live inside the project. The typical flow: |
|
|
| ```bash |
| # 1. Clone the code repo |
| git clone https://github.com/PKU-YuanGroup/OSP-Next.git |
| cd OSP-Next |
| conda create -n ospnext python=3.10 -y && conda activate ospnext |
| pip install -e . |
| |
| # 2a. Download OSP-Next weights from this Hugging Face repo |
| huggingface-cli download yunyangge/OSP-Next --local-dir ./checkpoints/osp_next_14b |
| |
| # 2b. Download Wan 2.1's T5 text encoder and WAN VAE (the components we reuse) |
| huggingface-cli download Wan-AI/Wan2.1-T2V-14B \ |
| models_t5_umt5-xxl-enc-bf16.pth \ |
| Wan2.1_VAE.pth \ |
| --include "google/umt5-xxl/*" \ |
| --local-dir ./checkpoints/Wan2.1-T2V-14B |
| |
| # 3. Point the inference config at the three downloaded directories |
| $EDITOR configs/infer/gpu/osp_14b.yaml |
| |
| # 4. Run inference |
| bash scripts/infer/gpu/infer_osp_14b.sh |
| ``` |
|
|
| In the inference YAML you'll fill in: |
|
|
| ```yaml |
| model_config: |
| pretrained_model_dir_or_checkpoint: "./checkpoints/osp_next_14b" |
| vae_config: |
| vae_path: "./checkpoints/Wan2.1-T2V-14B/Wan2.1_VAE.pth" |
| text_encoder_config: |
| checkpoint_path: "./checkpoints/Wan2.1-T2V-14B/models_t5_umt5-xxl-enc-bf16.pth" |
| text_tokenizer_path: "./checkpoints/Wan2.1-T2V-14B/google/umt5-xxl/" |
| ``` |
|
|
| > π£ **On Ascend NPU?** Follow the |
| > [NPU setup](https://github.com/PKU-YuanGroup/OSP-Next#-npu-ascend) in the |
| > code repo (CANN 8.5.0 + `pip install -e .[npu]` + source-build `decord`), |
| > then run `scripts/infer/npu/infer_osp_14b.sh` instead. |
|
|
| ### π Programmatic loading |
|
|
| The diffusion model itself can also be loaded as a regular `OSPNextModel`: |
|
|
| ```python |
| from ospnext.modules.osp_next import OSPNextModel |
| |
| model = OSPNextModel.from_pretrained("./checkpoints/osp_next_14b") |
| model = model.to("cuda", dtype="bfloat16").eval() |
| ``` |
|
|
| For the full text-to-video pipeline (T5 encoding β diffusion β VAE decoding), |
| use `ospnext.pipelines.t2v_pipeline.T2VPipeline` β see |
| [`infer/infer_osp.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/infer/infer_osp.py) |
| for a complete example. |
|
|
| --- |
|
|
| ## ποΈ Training & RL Post-Training |
|
|
| OSP-Next supports both **SFT** (`train/train_osp.py`) and |
| **Mix-GRPO + LoRA RL post-training** (`train/train_osp_RL.py`) using the same |
| FSDP2 + Sparse-SP backbone. Highlights of the RL pipeline: |
|
|
| - **LoRA-only updates** on the frozen base model. |
| - **Mix-GRPO** β mixed ODE/SDE flow-matching RL with a configurable SDE step |
| count, KL penalty and group advantage clipping. |
| - **VideoAlign** as the multi-axis reward model. |
| - RL checkpoints **only store the LoRA adapter** (no base model duplication), |
| plus an **EMA-LoRA** companion for inference. Merge them back into the base |
| with [`merge_lora_weights.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/merge_lora_weights.py) |
| before running inference. |
|
|
| Full training / RL recipes, config reference, sequence-parallel sizing tables |
| and troubleshooting tips are in the |
| [code repository README](https://github.com/PKU-YuanGroup/OSP-Next#%EF%B8%8F-training-pipeline). |
|
|
| --- |
|
|
| ## π§ͺ Intended Use & Limitations |
|
|
| **Intended uses** |
|
|
| - Research on **sparse** video diffusion: Skiparse-2D, Sparse Sequence |
| Parallelism, joint sparse + 8-bit quantization, sparse-model RL. |
| - Text-to-video generation for non-commercial creative / educational use. |
|
|
| **Out of scope** |
|
|
| - Generating photo-realistic or identifiable likenesses of real individuals. |
| - Generating illegal, deceptive, harmful, sexually explicit, or |
| copyright-infringing content. |
|
|
| **Known limitations** |
|
|
| - 14B model β single-GPU inference needs a 80 GB-class accelerator |
| (H100 / H200 / A100 80GB / Ascend 910B / 950PR). Multi-GPU is supported and |
| recommended via the included SSP / FSDP2 launch scripts. |
| - HiF8 weights are tuned for the Ascend NPU custom kernel; the BF16 model is |
| the recommended starting point on NVIDIA GPUs. |
| - Multi-NPU 950PR numbers are not yet reported β current 950PR results in the |
| paper / model card are single-NPU only. |
|
|
| --- |
|
|
| ## π Training Data |
|
|
| OSP-Next is trained on the same large-scale text-video corpus used by the |
| **Open-Sora-Plan** lineage, plus internal data filtering / re-captioning |
| pipelines (see the paper for details). No personal identifiable information is |
| intentionally included, and any sensitive content is filtered prior to |
| training to the best of our ability. |
|
|
| The RL post-training uses a **text-only prompt corpus** scored by |
| [VideoAlign](https://github.com/KwaiVGI/VideoAlign). |
|
|
| --- |
|
|
| ## π Citation |
|
|
| If you find OSP-Next useful in your research, please cite: |
|
|
| ```bibtex |
| @misc{ge2026ospnextefficienthighqualityvideo, |
| title={OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning}, |
| author={Yunyang Ge and Xianyi He and Zezhong Zhang and Bin Lin and Bin Zhu and Xinhua Cheng and Li Yuan}, |
| year={2026}, |
| eprint={2605.28691}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2605.28691}, |
| } |
| ``` |
|
|
| This work builds on: |
|
|
| ```bibtex |
| @article{wan2025wan, |
| title={Wan: Open and advanced large-scale video generative models}, |
| author={Wan, Team and Wang, Ang and Ai, Baole and Wen, Bin and Mao, Chaojie and Xie, Chen-Wei and Chen, Di and Yu, Feiwu and Zhao, Haiming and Yang, Jianxiao and others}, |
| journal={arXiv preprint arXiv:2503.20314}, |
| year={2025} |
| } |
| |
| @article{lin2024open, |
| title={Open-sora plan: Open-source large video generation model}, |
| author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others}, |
| journal={arXiv preprint arXiv:2412.00131}, |
| year={2024} |
| } |
| |
| @article{li2025mixgrpo, |
| title={Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde}, |
| author={Li, Junzhe and Cui, Yutao and Huang, Tao and Ma, Yinping and Fan, Chun and Cheng, Yiming and Yang, Miles and Zhong, Zhao and Bo, Liefeng}, |
| journal={arXiv preprint arXiv:2507.21802}, |
| year={2025} |
| } |
| |
| ``` |
|
|
| --- |
|
|
| ## π Acknowledgements |
|
|
| - π [**Wan**](https://github.com/Wan-Video/Wan2.1) β WAN-VAE and T5 backbone. |
| - π¬ [**Open-Sora-Plan**](https://github.com/PKU-YuanGroup/Open-Sora-Plan) β the ecosystem this project extends. |
| - π
[**VideoAlign**](https://github.com/KwaiVGI/VideoAlign) β reward model for RL post-training. |
| - π― [**Mix-GRPO**](https://arxiv.org/abs/2507.21802) β mixed ODE-SDE flow-matching RL. |
|
|
| --- |
|
|
| ## π License |
|
|
| Released under **Apache 2.0** β see |
| [`LICENSE.txt`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/LICENSE.txt) |
| in the code repository. |
|
|
| The reused Wan 2.1 T5 / VAE weights are governed by **their own licenses** at |
| [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B). |
|
|