| --- |
| title: "Multi-GPU" |
| format: |
| html: |
| toc: true |
| toc-depth: 3 |
| number-sections: true |
| code-tools: true |
| execute: |
| enabled: false |
| --- |
| |
| This guide covers advanced training configurations for multi-GPU setups using Axolotl. |
|
|
| |
|
|
| Axolotl supports several methods for multi-GPU training: |
|
|
| - DeepSpeed (recommended) |
| - FSDP (Fully Sharded Data Parallel) |
| - FSDP + QLoRA |
|
|
| |
|
|
| DeepSpeed is the recommended approach for multi-GPU training due to its stability and performance. It provides various optimization levels through ZeRO stages. |
|
|
| |
|
|
| Add to your YAML config: |
|
|
| ```{.yaml} |
| deepspeed: deepspeed_configs/zero1.json |
| ``` |
|
|
| |
|
|
| ```{.bash} |
| |
| axolotl train config.yml |
|
|
| |
| axolotl train config.yml --deepspeed deepspeed_configs/zero1.json |
| ``` |
|
|
| |
|
|
| We provide default configurations for: |
|
|
| - ZeRO Stage 1 (`zero1.json`) |
| - ZeRO Stage 2 (`zero2.json`) |
| - ZeRO Stage 3 (`zero3.json`) |
|
|
| Choose based on your memory requirements and performance needs. |
|
|
| |
|
|
| |
|
|
| ```{.yaml} |
| fsdp: |
| - full_shard |
| - auto_wrap |
| fsdp_config: |
| fsdp_offload_params: true |
| fsdp_state_dict_type: FULL_STATE_DICT |
| fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer |
| ``` |
|
|
| |
|
|
| For combining FSDP with QLoRA, see our [dedicated guide](fsdp_qlora.qmd). |
|
|
| |
|
|
| |
|
|
| Please see [docs](custom_integrations.qmd#liger) for more info. |
|
|
| |
|
|
| |
|
|
| For NCCL-related problems, see our [NCCL troubleshooting guide](nccl.qmd). |
|
|
| |
|
|
| ::: {.panel-tabset} |
|
|
| |
|
|
| - Reduce `micro_batch_size` |
| - Reduce `eval_batch_size` |
| - Adjust `gradient_accumulation_steps` |
| - Consider using a higher ZeRO stage |
|
|
| |
|
|
| - Start with DeepSpeed ZeRO-2 |
| - Monitor loss values |
| - Check learning rates |
|
|
| ::: |
|
|
| For more detailed troubleshooting, see our [debugging guide](debugging.qmd). |
|
|