Buckets:
| # Expert parallelism | |
| [Expert parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=expert_parallelism) is a parallelism strategy for [mixture-of-experts (MoE) models](https://huggingface.co/blog/moe). Each expert's feedforward layer lives on a different hardware accelerator. A router dispatches tokens to the appropriate experts and gathers the results. This approach scales models to far larger parameter counts without increasing computation cost because each token activates only a few experts. | |
| ## DistributedConfig | |
| > [!WARNING] | |
| > The `DistributedConfig` API is experimental and its usage may change in the future. | |
| Enable expert parallelism with the `DistributedConfig` class and the `enable_expert_parallel` argument. | |
| ```py | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from transformers.distributed.configuration_utils import DistributedConfig | |
| distributed_config = DistributedConfig(enable_expert_parallel=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "openai/gpt-oss-120b", | |
| dtype="auto", | |
| distributed_config=distributed_config, | |
| ) | |
| ``` | |
| > [!TIP] | |
| > Expert parallelism automatically enables [tensor parallelism](./perf_infer_gpu_multi) for attention layers. | |
| This argument switches to the `ep_plan` (expert parallel plan) defined in each MoE model's config file. The `GroupedGemmParallel` class splits expert weights so each device loads only its local experts. The `ep_router` routes tokens to experts and an all-reduce operation combines their outputs. | |
| Launch your inference script with [torchrun](https://pytorch.org/docs/stable/elastic/run.html) and specify how many devices to use. The number of devices must evenly divide the total number of experts. | |
| ```zsh | |
| torchrun --nproc-per-node 8 your_script.py | |
| ``` | |
Xet Storage Details
- Size:
- 1.8 kB
- Xet hash:
- 6b12639622f2590c7aa8f7e57e886b80da6d9edf4765eedc04f0d19a4196bf36
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.