---
license: apache-2.0
tags:
- video-generation
- diffusion
- transformer
- megatron-lm
- megatron-checkpoints
language:
- en
---

# MUG-V 10B Training Checkpoints

Pre-trained Megatron-format checkpoints for [MUG-V 10B](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) video generation model.

## Available Checkpoints

### MUG-V-10B-torch_dist (Recommended)

**Torch Distributed Checkpoint** - Flexible parallelism support

- **Format**: Torch Distributed (`.distcp`)
- **Parallelism**: Can be loaded with **any TP/PP configuration**
- **Use Case**: Production training, flexible distributed setup

```bash
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"
```

### MUG-V-10B-TP4-legacy

**Torch Format (Legacy)** - Fixed TP=4

- **Format**: Torch format (`mp_rank_XX/model_optim_rng.pt`)
- **Parallelism**: Must be loaded with **TP=4**
- **Use Case**: Fixed TP setup or conversion to Torch Distributed

```bash
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-TP4-legacy/*"
```

## Quick Start

### Option 1: Direct Training

Use the Torch Distributed checkpoint directly for training:

```bash
# Download checkpoint
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"

# Download sample data
huggingface-cli download MUG-V/MUG-V-Training-Samples --repo-type dataset --local-dir ./sample_dataset

# Set environment variables
export CHECKPOINT_DIR="./checkpoints/MUG-V-10B-torch_dist/torch_dist"
export MODEL_TYPE="mugdit_10b"
export DATA_TRAIN="./sample_dataset/train.csv"

# Start training (8 GPUs)
bash examples/mugv/pretrain_slurm.sh
```

### Option 2: Convert to HuggingFace Format

Convert Megatron checkpoint to HuggingFace format for inference:

```bash
python -m examples.mugv.convertor.mugdit_mcore2hf \
    --dcp-dir ./checkpoints/MUG-V-10B-torch_dist/torch_dist/iter_0000000 \
    --output ./mugdit_10b_hf.pt \
    --model-size 10B
```

## Checkpoint Formats Comparison

| Format | Parallelism | File Structure | Training | Conversion |
|--------|-------------|----------------|----------|------------|
| **Torch Distributed** | Flexible TP/PP | `*.distcp` files | ✅ Recommended | ✅ To HF |
| **Torch (Legacy)** | Fixed TP=4 | `mp_rank_XX/` dirs | ⚠️ TP=4 only | ✅ To Torch Dist / HF |
| **HuggingFace** | None (inference) | Single `.pt` file | ❌ Not for training | - |

## Model Architecture

- **Parameters**: ~10 billion
- **Architecture**: Diffusion Transformer (DiT)
- **Hidden Size**: 3456
- **Attention Heads**: 48
- **Layers**: 56
- **Compression**: VideoVAE 8×8×8

## Related Resources

- **Training Code**: [MUG-V-Megatron-LM-Training](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training)
- **Inference Code**: [MUG-V](https://github.com/Shopee-MUG/MUG-V)
- **Inference Weights**: [MUG-V-inference](https://huggingface.co/MUG-V/MUG-V-inference)
- **Sample Dataset**: [MUG-V-Training-Samples](https://huggingface.co/datasets/MUG-V/MUG-V-Training-Samples)

## Documentation

- **Training Guide**: [examples/mugv/README.md](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training/blob/main/examples/mugv/README.md)
- **Checkpoint Conversion**: [Conversion Guide](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training/blob/main/examples/mugv/README.md#checkpoint-conversion)

## Citation

```bibtex
@article{zhang2025mugv10b,
  title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
  author={Zhang, Yongshun and Fan, Zhongyi and Zhang, Yonghang and Li, Zhangzikang and Chen, Weifeng and Feng, Zhongwei and Wang, Chaoyue and Hou, Peng and Zeng, Anxiang},
  journal={arXiv preprint},
  year={2025}
}
```

## License

Apache License 2.0

---

**Developed by Shopee Multimodal Understanding and Generation (MUG) Team**