MUG-V-training / README.md

Update README.md

bba1432 verified 3 months ago

3.86 kB

	---
	license: apache-2.0
	tags:
	- video-generation
	- diffusion
	- transformer
	- megatron-lm
	- megatron-checkpoints
	language:
	- en
	---

	# MUG-V 10B Training Checkpoints

	Pre-trained Megatron-format checkpoints for [MUG-V 10B](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) video generation model.

	## Available Checkpoints

	### MUG-V-10B-torch_dist (Recommended)

	Torch Distributed Checkpoint - Flexible parallelism support

	- Format: Torch Distributed (`.distcp`)
	- Parallelism: Can be loaded with any TP/PP configuration
	- Use Case: Production training, flexible distributed setup

	```bash
	huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"
	```

	### MUG-V-10B-TP4-legacy

	Torch Format (Legacy) - Fixed TP=4

	- Format: Torch format (`mp_rank_XX/model_optim_rng.pt`)
	- Parallelism: Must be loaded with TP=4
	- Use Case: Fixed TP setup or conversion to Torch Distributed

	```bash
	huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-TP4-legacy/*"
	```

	## Quick Start

	### Option 1: Direct Training

	Use the Torch Distributed checkpoint directly for training:

	```bash
	# Download checkpoint
	huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"

	# Download sample data
	huggingface-cli download MUG-V/MUG-V-Training-Samples --repo-type dataset --local-dir ./sample_dataset

	# Set environment variables
	export CHECKPOINT_DIR="./checkpoints/MUG-V-10B-torch_dist/torch_dist"
	export MODEL_TYPE="mugdit_10b"
	export DATA_TRAIN="./sample_dataset/train.csv"

	# Start training (8 GPUs)
	bash examples/mugv/pretrain_slurm.sh
	```

	### Option 2: Convert to HuggingFace Format

	Convert Megatron checkpoint to HuggingFace format for inference:

	```bash
	python -m examples.mugv.convertor.mugdit_mcore2hf \
	--dcp-dir ./checkpoints/MUG-V-10B-torch_dist/torch_dist/iter_0000000 \
	--output ./mugdit_10b_hf.pt \
	--model-size 10B
	```

	## Checkpoint Formats Comparison

	\| Format \| Parallelism \| File Structure \| Training \| Conversion \|
	\|--------\|-------------\|----------------\|----------\|------------\|
	\| Torch Distributed \| Flexible TP/PP \| `*.distcp` files \| ✅ Recommended \| ✅ To HF \|
	\| Torch (Legacy) \| Fixed TP=4 \| `mp_rank_XX/` dirs \| ⚠️ TP=4 only \| ✅ To Torch Dist / HF \|
	\| HuggingFace \| None (inference) \| Single `.pt` file \| ❌ Not for training \| - \|

	## Model Architecture

	- Parameters: ~10 billion
	- Architecture: Diffusion Transformer (DiT)
	- Hidden Size: 3456
	- Attention Heads: 48
	- Layers: 56
	- Compression: VideoVAE 8×8×8

	## Related Resources

	- Training Code: [MUG-V-Megatron-LM-Training](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training)
	- Inference Code: [MUG-V](https://github.com/Shopee-MUG/MUG-V)
	- Inference Weights: [MUG-V-inference](https://huggingface.co/MUG-V/MUG-V-inference)
	- Sample Dataset: [MUG-V-Training-Samples](https://huggingface.co/datasets/MUG-V/MUG-V-Training-Samples)

	## Documentation

	- Training Guide: [examples/mugv/README.md](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training/blob/main/examples/mugv/README.md)
	- Checkpoint Conversion: [Conversion Guide](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training/blob/main/examples/mugv/README.md#checkpoint-conversion)

	## Citation

	```bibtex
	@article{zhang2025mugv10b,
	title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
	author={Zhang, Yongshun and Fan, Zhongyi and Zhang, Yonghang and Li, Zhangzikang and Chen, Weifeng and Feng, Zhongwei and Wang, Chaoyue and Hou, Peng and Zeng, Anxiang},
	journal={arXiv preprint},
	year={2025}
	}
	```

	## License

	Apache License 2.0

	---

	Developed by Shopee Multimodal Understanding and Generation (MUG) Team