--- license: apache-2.0 library_name: pytorch pipeline_tag: text-to-video language: - en - zh tags: - text-to-video - video-generation - diffusion - flow-matching - sparse-attention - skiparse - sequence-parallel - mix-grpo - lora - hif8 - quantization - npu - ascend - open-sora-plan - ospnext base_model: - Wan-AI/Wan2.1-T2V-14B ---
OSP-Next **Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning** **Open-Sora Plan ยท Next Generation** A scalable **sparse** text-to-video diffusion model, introducing **Skiparse-2D Attention**, **Sparse Sequence Parallelism (SSP)**, **HiF8 quantization**, and **Mix-GRPO + LoRA** RL post-training.
[![arXiv](https://img.shields.io/badge/Arxiv-OSP--Next-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2605.28691) [![GitHub](https://img.shields.io/badge/GitHub-OSP--Next-181717.svg?logo=github)](https://github.com/PKU-YuanGroup/OSP-Next)
--- ## ๐Ÿง  Model Summary OSP-Next is a 14B-parameter **text-to-video diffusion** model built on top of the **Wan 2.1** text encoder / VAE backbone, with four tightly co-designed contributions: | | What it is | Why it matters | |---|---|---| | ๐Ÿงฉ **Skiparse-2D Attention** | Fixed-rule 2D sparse attention applied along H/W. | Approaches 3D full attention in quality, **natively FlashAttention compatible**. | | ๐Ÿ”— **Sparse Sequence Parallelism (SSP)** | A parallel strategy natively co-designed with Skiparse-2D. | **โˆ’75% inter-rank comm**, per-block comm rounds **4 โ†’ 1**. | | ๐Ÿชถ **HiF8 Quantization** *(NPU only)* | Dynamic-precision 8-bit (exponent / mantissa allocation). | First joint **8-bit + sparse fine-tuning** โ€” up to **2.27ร— speedup** on a single Ascend 950PR with only **โˆ’0.4 pt** on VBench. | | ๐ŸŽฏ **Mix-GRPO + LoRA RL** | RL post-training on top of the sparse model. | First RL pipeline for **sparse** video diffusion. | ### ๐Ÿ“Š End-to-end speed-ups (vs. Wan 2.1 baseline, 5 s ยท 81-frame video) | Hardware | 720P (padded) | 768P (native) | |---|---|---| | โšก NVIDIA H200 (BF16 ยท FA3 ยท `torch.compile`) | **1.53ร—** / 1.42ร— (1ร— / 8ร— GPU) | **1.64ร—** / 1.52ร— | | ๐ŸŸฃ Ascend 950PR (BF16 ยท SDPA) | 1.27ร— (1ร— NPU) | 1.76ร— | | ๐Ÿชถ Ascend 950PR (HiF8 ยท 8-bit ยท SDPA) | **1.69ร—** | **2.27ร—** | > ๐Ÿ† OSP-Next reaches **VBench total = 83.73%** (Wan 2.1 baseline 83.69%); > OSP-Next-HiF8 keeps 83.29% with only a 0.4 pt drop. Full benchmark tables, > ablations and qualitative comparisons live in the > [paper](https://arxiv.org/abs/). --- ## ๐Ÿ“ฆ What's in this repository | File / folder | Description | |---|---| | `OSP-Next-14B/` | OSP-Next 14B BF16 diffusion weights (FSDP `model.pt` + config) | | `OSP-Next-HiF8-14B/` | HiF8-quantized 14B weights (NPU inference) | | `config.json` | OSP-Next model architecture metadata | > โ„น๏ธ OSP-Next reuses **Wan 2.1's T5 (UMT5-XXL) text encoder** and **WAN VAE** > verbatim. We do **not** re-host them โ€” see > [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) > for the upstream weights. --- ## ๐Ÿš€ Quick Start OSP-Next ships as a **standalone training & inference repository** rather than a pip-installable model class โ€” the sparse attention / SSP comm / HiF8 kernels all live inside the project. The typical flow: ```bash # 1. Clone the code repo git clone https://github.com/PKU-YuanGroup/OSP-Next.git cd OSP-Next conda create -n ospnext python=3.10 -y && conda activate ospnext pip install -e . # 2a. Download OSP-Next weights from this Hugging Face repo huggingface-cli download yunyangge/OSP-Next --local-dir ./checkpoints/osp_next_14b # 2b. Download Wan 2.1's T5 text encoder and WAN VAE (the components we reuse) huggingface-cli download Wan-AI/Wan2.1-T2V-14B \ models_t5_umt5-xxl-enc-bf16.pth \ Wan2.1_VAE.pth \ --include "google/umt5-xxl/*" \ --local-dir ./checkpoints/Wan2.1-T2V-14B # 3. Point the inference config at the three downloaded directories $EDITOR configs/infer/gpu/osp_14b.yaml # 4. Run inference bash scripts/infer/gpu/infer_osp_14b.sh ``` In the inference YAML you'll fill in: ```yaml model_config: pretrained_model_dir_or_checkpoint: "./checkpoints/osp_next_14b" vae_config: vae_path: "./checkpoints/Wan2.1-T2V-14B/Wan2.1_VAE.pth" text_encoder_config: checkpoint_path: "./checkpoints/Wan2.1-T2V-14B/models_t5_umt5-xxl-enc-bf16.pth" text_tokenizer_path: "./checkpoints/Wan2.1-T2V-14B/google/umt5-xxl/" ``` > ๐ŸŸฃ **On Ascend NPU?** Follow the > [NPU setup](https://github.com/PKU-YuanGroup/OSP-Next#-npu-ascend) in the > code repo (CANN 8.5.0 + `pip install -e .[npu]` + source-build `decord`), > then run `scripts/infer/npu/infer_osp_14b.sh` instead. ### ๐Ÿ Programmatic loading The diffusion model itself can also be loaded as a regular `OSPNextModel`: ```python from ospnext.modules.osp_next import OSPNextModel model = OSPNextModel.from_pretrained("./checkpoints/osp_next_14b") model = model.to("cuda", dtype="bfloat16").eval() ``` For the full text-to-video pipeline (T5 encoding โ†’ diffusion โ†’ VAE decoding), use `ospnext.pipelines.t2v_pipeline.T2VPipeline` โ€” see [`infer/infer_osp.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/infer/infer_osp.py) for a complete example. --- ## ๐Ÿ‹๏ธ Training & RL Post-Training OSP-Next supports both **SFT** (`train/train_osp.py`) and **Mix-GRPO + LoRA RL post-training** (`train/train_osp_RL.py`) using the same FSDP2 + Sparse-SP backbone. Highlights of the RL pipeline: - **LoRA-only updates** on the frozen base model. - **Mix-GRPO** โ€” mixed ODE/SDE flow-matching RL with a configurable SDE step count, KL penalty and group advantage clipping. - **VideoAlign** as the multi-axis reward model. - RL checkpoints **only store the LoRA adapter** (no base model duplication), plus an **EMA-LoRA** companion for inference. Merge them back into the base with [`merge_lora_weights.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/merge_lora_weights.py) before running inference. Full training / RL recipes, config reference, sequence-parallel sizing tables and troubleshooting tips are in the [code repository README](https://github.com/PKU-YuanGroup/OSP-Next#%EF%B8%8F-training-pipeline). --- ## ๐Ÿงช Intended Use & Limitations **Intended uses** - Research on **sparse** video diffusion: Skiparse-2D, Sparse Sequence Parallelism, joint sparse + 8-bit quantization, sparse-model RL. - Text-to-video generation for non-commercial creative / educational use. **Out of scope** - Generating photo-realistic or identifiable likenesses of real individuals. - Generating illegal, deceptive, harmful, sexually explicit, or copyright-infringing content. **Known limitations** - 14B model โ€” single-GPU inference needs a 80 GB-class accelerator (H100 / H200 / A100 80GB / Ascend 910B / 950PR). Multi-GPU is supported and recommended via the included SSP / FSDP2 launch scripts. - HiF8 weights are tuned for the Ascend NPU custom kernel; the BF16 model is the recommended starting point on NVIDIA GPUs. - Multi-NPU 950PR numbers are not yet reported โ€” current 950PR results in the paper / model card are single-NPU only. --- ## ๐Ÿ“š Training Data OSP-Next is trained on the same large-scale text-video corpus used by the **Open-Sora-Plan** lineage, plus internal data filtering / re-captioning pipelines (see the paper for details). No personal identifiable information is intentionally included, and any sensitive content is filtered prior to training to the best of our ability. The RL post-training uses a **text-only prompt corpus** scored by [VideoAlign](https://github.com/KwaiVGI/VideoAlign). --- ## ๐Ÿ“ Citation If you find OSP-Next useful in your research, please cite: ```bibtex @misc{ge2026ospnextefficienthighqualityvideo, title={OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning}, author={Yunyang Ge and Xianyi He and Zezhong Zhang and Bin Lin and Bin Zhu and Xinhua Cheng and Li Yuan}, year={2026}, eprint={2605.28691}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.28691}, } ``` This work builds on: ```bibtex @article{wan2025wan, title={Wan: Open and advanced large-scale video generative models}, author={Wan, Team and Wang, Ang and Ai, Baole and Wen, Bin and Mao, Chaojie and Xie, Chen-Wei and Chen, Di and Yu, Feiwu and Zhao, Haiming and Yang, Jianxiao and others}, journal={arXiv preprint arXiv:2503.20314}, year={2025} } @article{lin2024open, title={Open-sora plan: Open-source large video generation model}, author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others}, journal={arXiv preprint arXiv:2412.00131}, year={2024} } @article{li2025mixgrpo, title={Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde}, author={Li, Junzhe and Cui, Yutao and Huang, Tao and Ma, Yinping and Fan, Chun and Cheng, Yiming and Yang, Miles and Zhong, Zhao and Bo, Liefeng}, journal={arXiv preprint arXiv:2507.21802}, year={2025} } ``` --- ## ๐Ÿ™ Acknowledgements - ๐ŸŒŠ [**Wan**](https://github.com/Wan-Video/Wan2.1) โ€” WAN-VAE and T5 backbone. - ๐ŸŽฌ [**Open-Sora-Plan**](https://github.com/PKU-YuanGroup/Open-Sora-Plan) โ€” the ecosystem this project extends. - ๐Ÿ… [**VideoAlign**](https://github.com/KwaiVGI/VideoAlign) โ€” reward model for RL post-training. - ๐ŸŽฏ [**Mix-GRPO**](https://arxiv.org/abs/2507.21802) โ€” mixed ODE-SDE flow-matching RL. --- ## ๐Ÿ“„ License Released under **Apache 2.0** โ€” see [`LICENSE.txt`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/LICENSE.txt) in the code repository. The reused Wan 2.1 T5 / VAE weights are governed by **their own licenses** at [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B).