README.md · yunyangge/OSP-Next at main

OSP-Next / README.md

yunyangge

Update README.md

960a293 verified about 21 hours ago

preview code

raw

history blame contribute delete

10.2 kB

	---
	license: apache-2.0
	library_name: pytorch
	pipeline_tag: text-to-video
	language:
	- en
	- zh
	tags:
	- text-to-video
	- video-generation
	- diffusion
	- flow-matching
	- sparse-attention
	- skiparse
	- sequence-parallel
	- mix-grpo
	- lora
	- hif8
	- quantization
	- npu
	- ascend
	- open-sora-plan
	- ospnext
	base_model:
	- Wan-AI/Wan2.1-T2V-14B
	---

	<div align="center">

	<img src="https://raw.githubusercontent.com/PKU-YuanGroup/OSP-Next/main/assets/logo.png" alt="OSP-Next" width="220">

	Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

	Open-Sora Plan · Next Generation

	A scalable sparse text-to-video diffusion model, introducing Skiparse-2D Attention,
	Sparse Sequence Parallelism (SSP), HiF8 quantization, and
	Mix-GRPO + LoRA RL post-training.

	</div>

	<h5 align="center">

	[![arXiv](https://img.shields.io/badge/Arxiv-OSP--Next-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2605.28691)
	[![GitHub](https://img.shields.io/badge/GitHub-OSP--Next-181717.svg?logo=github)](https://github.com/PKU-YuanGroup/OSP-Next)

	</h5>

	---

	## 🧠 Model Summary

	OSP-Next is a 14B-parameter text-to-video diffusion model built on top of
	the Wan 2.1 text encoder / VAE backbone, with four tightly co-designed
	contributions:

	\| \| What it is \| Why it matters \|
	\|---\|---\|---\|
	\| 🧩 Skiparse-2D Attention \| Fixed-rule 2D sparse attention applied along H/W. \| Approaches 3D full attention in quality, natively FlashAttention compatible. \|
	\| 🔗 Sparse Sequence Parallelism (SSP) \| A parallel strategy natively co-designed with Skiparse-2D. \| −75% inter-rank comm, per-block comm rounds 4 → 1. \|
	\| 🪶 HiF8 Quantization (NPU only) \| Dynamic-precision 8-bit (exponent / mantissa allocation). \| First joint 8-bit + sparse fine-tuning — up to 2.27× speedup on a single Ascend 950PR with only −0.4 pt on VBench. \|
	\| 🎯 Mix-GRPO + LoRA RL \| RL post-training on top of the sparse model. \| First RL pipeline for sparse video diffusion. \|

	### 📊 End-to-end speed-ups (vs. Wan 2.1 baseline, 5 s · 81-frame video)

	\| Hardware \| 720P (padded) \| 768P (native) \|
	\|---\|---\|---\|
	\| ⚡ NVIDIA H200 (BF16 · FA3 · `torch.compile`) \| 1.53× / 1.42× (1× / 8× GPU) \| 1.64× / 1.52× \|
	\| 🟣 Ascend 950PR (BF16 · SDPA) \| 1.27× (1× NPU) \| 1.76× \|
	\| 🪶 Ascend 950PR (HiF8 · 8-bit · SDPA) \| 1.69× \| 2.27× \|

	> 🏆 OSP-Next reaches VBench total = 83.73% (Wan 2.1 baseline 83.69%);
	> OSP-Next-HiF8 keeps 83.29% with only a 0.4 pt drop. Full benchmark tables,
	> ablations and qualitative comparisons live in the
	> [paper](https://arxiv.org/abs/<ARXIV_ID>).

	---

	## 📦 What's in this repository

	\| File / folder \| Description \|
	\|---\|---\|
	\| `OSP-Next-14B/` \| OSP-Next 14B BF16 diffusion weights (FSDP `model.pt` + config) \|
	\| `OSP-Next-HiF8-14B/` \| HiF8-quantized 14B weights (NPU inference) \|
	\| `config.json` \| OSP-Next model architecture metadata \|

	> ℹ️ OSP-Next reuses Wan 2.1's T5 (UMT5-XXL) text encoder and WAN VAE
	> verbatim. We do not re-host them — see
	> [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)
	> for the upstream weights.

	---

	## 🚀 Quick Start

	OSP-Next ships as a standalone training & inference repository rather than
	a pip-installable model class — the sparse attention / SSP comm / HiF8 kernels
	all live inside the project. The typical flow:

	```bash
	# 1. Clone the code repo
	git clone https://github.com/PKU-YuanGroup/OSP-Next.git
	cd OSP-Next
	conda create -n ospnext python=3.10 -y && conda activate ospnext
	pip install -e .

	# 2a. Download OSP-Next weights from this Hugging Face repo
	huggingface-cli download yunyangge/OSP-Next --local-dir ./checkpoints/osp_next_14b

	# 2b. Download Wan 2.1's T5 text encoder and WAN VAE (the components we reuse)
	huggingface-cli download Wan-AI/Wan2.1-T2V-14B \
	models_t5_umt5-xxl-enc-bf16.pth \
	Wan2.1_VAE.pth \
	--include "google/umt5-xxl/*" \
	--local-dir ./checkpoints/Wan2.1-T2V-14B

	# 3. Point the inference config at the three downloaded directories
	$EDITOR configs/infer/gpu/osp_14b.yaml

	# 4. Run inference
	bash scripts/infer/gpu/infer_osp_14b.sh
	```

	In the inference YAML you'll fill in:

	```yaml
	model_config:
	pretrained_model_dir_or_checkpoint: "./checkpoints/osp_next_14b"
	vae_config:
	vae_path: "./checkpoints/Wan2.1-T2V-14B/Wan2.1_VAE.pth"
	text_encoder_config:
	checkpoint_path: "./checkpoints/Wan2.1-T2V-14B/models_t5_umt5-xxl-enc-bf16.pth"
	text_tokenizer_path: "./checkpoints/Wan2.1-T2V-14B/google/umt5-xxl/"
	```

	> 🟣 On Ascend NPU? Follow the
	> [NPU setup](https://github.com/PKU-YuanGroup/OSP-Next#-npu-ascend) in the
	> code repo (CANN 8.5.0 + `pip install -e .[npu]` + source-build `decord`),
	> then run `scripts/infer/npu/infer_osp_14b.sh` instead.

	### 🐍 Programmatic loading

	The diffusion model itself can also be loaded as a regular `OSPNextModel`:

	```python
	from ospnext.modules.osp_next import OSPNextModel

	model = OSPNextModel.from_pretrained("./checkpoints/osp_next_14b")
	model = model.to("cuda", dtype="bfloat16").eval()
	```

	For the full text-to-video pipeline (T5 encoding → diffusion → VAE decoding),
	use `ospnext.pipelines.t2v_pipeline.T2VPipeline` — see
	[`infer/infer_osp.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/infer/infer_osp.py)
	for a complete example.

	---

	## 🏋️ Training & RL Post-Training

	OSP-Next supports both SFT (`train/train_osp.py`) and
	Mix-GRPO + LoRA RL post-training (`train/train_osp_RL.py`) using the same
	FSDP2 + Sparse-SP backbone. Highlights of the RL pipeline:

	- LoRA-only updates on the frozen base model.
	- Mix-GRPO — mixed ODE/SDE flow-matching RL with a configurable SDE step
	count, KL penalty and group advantage clipping.
	- VideoAlign as the multi-axis reward model.
	- RL checkpoints only store the LoRA adapter (no base model duplication),
	plus an EMA-LoRA companion for inference. Merge them back into the base
	with [`merge_lora_weights.py`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/merge_lora_weights.py)
	before running inference.

	Full training / RL recipes, config reference, sequence-parallel sizing tables
	and troubleshooting tips are in the
	[code repository README](https://github.com/PKU-YuanGroup/OSP-Next#%EF%B8%8F-training-pipeline).

	---

	## 🧪 Intended Use & Limitations

	Intended uses

	- Research on sparse video diffusion: Skiparse-2D, Sparse Sequence
	Parallelism, joint sparse + 8-bit quantization, sparse-model RL.
	- Text-to-video generation for non-commercial creative / educational use.

	Out of scope

	- Generating photo-realistic or identifiable likenesses of real individuals.
	- Generating illegal, deceptive, harmful, sexually explicit, or
	copyright-infringing content.

	Known limitations

	- 14B model — single-GPU inference needs a 80 GB-class accelerator
	(H100 / H200 / A100 80GB / Ascend 910B / 950PR). Multi-GPU is supported and
	recommended via the included SSP / FSDP2 launch scripts.
	- HiF8 weights are tuned for the Ascend NPU custom kernel; the BF16 model is
	the recommended starting point on NVIDIA GPUs.
	- Multi-NPU 950PR numbers are not yet reported — current 950PR results in the
	paper / model card are single-NPU only.

	---

	## 📚 Training Data

	OSP-Next is trained on the same large-scale text-video corpus used by the
	Open-Sora-Plan lineage, plus internal data filtering / re-captioning
	pipelines (see the paper for details). No personal identifiable information is
	intentionally included, and any sensitive content is filtered prior to
	training to the best of our ability.

	The RL post-training uses a text-only prompt corpus scored by
	[VideoAlign](https://github.com/KwaiVGI/VideoAlign).

	---

	## 📝 Citation

	If you find OSP-Next useful in your research, please cite:

	```bibtex
	@misc{ge2026ospnextefficienthighqualityvideo,
	title={OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning},
	author={Yunyang Ge and Xianyi He and Zezhong Zhang and Bin Lin and Bin Zhu and Xinhua Cheng and Li Yuan},
	year={2026},
	eprint={2605.28691},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2605.28691},
	}
	```

	This work builds on:

	```bibtex
	@article{wan2025wan,
	title={Wan: Open and advanced large-scale video generative models},
	author={Wan, Team and Wang, Ang and Ai, Baole and Wen, Bin and Mao, Chaojie and Xie, Chen-Wei and Chen, Di and Yu, Feiwu and Zhao, Haiming and Yang, Jianxiao and others},
	journal={arXiv preprint arXiv:2503.20314},
	year={2025}
	}

	@article{lin2024open,
	title={Open-sora plan: Open-source large video generation model},
	author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
	journal={arXiv preprint arXiv:2412.00131},
	year={2024}
	}

	@article{li2025mixgrpo,
	title={Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde},
	author={Li, Junzhe and Cui, Yutao and Huang, Tao and Ma, Yinping and Fan, Chun and Cheng, Yiming and Yang, Miles and Zhong, Zhao and Bo, Liefeng},
	journal={arXiv preprint arXiv:2507.21802},
	year={2025}
	}

	```

	---

	## 🙏 Acknowledgements

	- 🌊 [Wan](https://github.com/Wan-Video/Wan2.1) — WAN-VAE and T5 backbone.
	- 🎬 [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan) — the ecosystem this project extends.
	- 🏅 [VideoAlign](https://github.com/KwaiVGI/VideoAlign) — reward model for RL post-training.
	- 🎯 [Mix-GRPO](https://arxiv.org/abs/2507.21802) — mixed ODE-SDE flow-matching RL.

	---

	## 📄 License

	Released under Apache 2.0 — see
	[`LICENSE.txt`](https://github.com/PKU-YuanGroup/OSP-Next/blob/main/LICENSE.txt)
	in the code repository.

	The reused Wan 2.1 T5 / VAE weights are governed by their own licenses at
	[`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B).