pi05-so100-diverse / lerobot /docs /source /multi_gpu_training.mdx

bot

Update lerobot to latest with SO100 rename_map fix

a8eb6e5 about 2 months ago

4.99 kB

	# Multi-GPU Training

	This guide shows you how to train policies on multiple GPUs using [Hugging Face Accelerate](https://huggingface.co/docs/accelerate).

	## Installation

	First, ensure you have accelerate installed:

	```bash
	pip install accelerate
	```

	## Training with Multiple GPUs

	You can launch training in two ways:

	### Option 1: Without config (specify parameters directly)

	You can specify all parameters directly in the command without running `accelerate config`:

	```bash
	accelerate launch \
	--multi_gpu \
	--num_processes=2 \
	$(which lerobot-train) \
	--dataset.repo_id=${HF_USER}/my_dataset \
	--policy.type=act \
	--policy.repo_id=${HF_USER}/my_trained_policy \
	--output_dir=outputs/train/act_multi_gpu \
	--job_name=act_multi_gpu \
	--wandb.enable=true
	```

	Key accelerate parameters:

	- `--multi_gpu`: Enable multi-GPU training
	- `--num_processes=2`: Number of GPUs to use
	- `--mixed_precision=fp16`: Use fp16 mixed precision (or `bf16` if supported)

	### Option 2: Using accelerate config

	If you prefer to save your configuration, you can optionally configure accelerate for your hardware setup by running:

	```bash
	accelerate config
	```

	This interactive setup will ask you questions about your training environment (number of GPUs, mixed precision settings, etc.) and saves the configuration for future use. For a simple multi-GPU setup on a single machine, you can use these recommended settings:

	- Compute environment: This machine
	- Number of machines: 1
	- Number of processes: (number of GPUs you want to use)
	- GPU ids to use: (leave empty to use all)
	- Mixed precision: fp16 or bf16 (recommended for faster training)

	Then launch training with:

	```bash
	accelerate launch $(which lerobot-train) \
	--dataset.repo_id=${HF_USER}/my_dataset \
	--policy.type=act \
	--policy.repo_id=${HF_USER}/my_trained_policy \
	--output_dir=outputs/train/act_multi_gpu \
	--job_name=act_multi_gpu \
	--wandb.enable=true
	```

	## How It Works

	When you launch training with accelerate:

	1. Automatic detection: LeRobot automatically detects if it's running under accelerate
	2. Data distribution: Your batch is automatically split across GPUs
	3. Gradient synchronization: Gradients are synchronized across GPUs during backpropagation
	4. Single process logging: Only the main process logs to wandb and saves checkpoints

	## Learning Rate and Training Steps Scaling

	Important: LeRobot does NOT automatically scale learning rates or training steps based on the number of GPUs. This gives you full control over your training hyperparameters.

	### Why No Automatic Scaling?

	Many distributed training frameworks automatically scale the learning rate by the number of GPUs (e.g., `lr = base_lr × num_gpus`).
	However, LeRobot keeps the learning rate exactly as you specify it.

	### When and How to Scale

	If you want to scale your hyperparameters when using multiple GPUs, you should do it manually:

	Learning Rate Scaling:

	```bash
	# Example: 2 GPUs with linear LR scaling
	# Base LR: 1e-4, with 2 GPUs -> 2e-4
	accelerate launch --num_processes=2 $(which lerobot-train) \
	--optimizer.lr=2e-4 \
	--dataset.repo_id=lerobot/pusht \
	--policy=act
	```

	Training Steps Scaling:

	Since the effective batch size `bs` increases with multiple GPUs (batch_size × num_gpus), you may want to reduce the number of training steps proportionally:

	```bash
	# Example: 2 GPUs with effective batch size 2x larger
	# Original: batch_size=8, steps=100000
	# With 2 GPUs: batch_size=8 (16 in total), steps=50000
	accelerate launch --num_processes=2 $(which lerobot-train) \
	--batch_size=8 \
	--steps=50000 \
	--dataset.repo_id=lerobot/pusht \
	--policy=act
	```

	## Notes

	- The `--policy.use_amp` flag in `lerobot-train` is only used when not running with accelerate. When using accelerate, mixed precision is controlled by accelerate's configuration.
	- Training logs, checkpoints, and hub uploads are only done by the main process to avoid conflicts. Non-main processes have console logging disabled to prevent duplicate output.
	- The effective batch size is `batch_size × num_gpus`. If you use 4 GPUs with `--batch_size=8`, your effective batch size is 32.
	- Learning rate scheduling is handled correctly across multiple processes—LeRobot sets `step_scheduler_with_optimizer=False` to prevent accelerate from adjusting scheduler steps based on the number of processes.
	- When saving or pushing models, LeRobot automatically unwraps the model from accelerate's distributed wrapper to ensure compatibility.
	- WandB integration automatically initializes only on the main process, preventing multiple runs from being created.

	For more advanced configurations and troubleshooting, see the [Accelerate documentation](https://huggingface.co/docs/accelerate). If you want to learn more about how to train on a large number of GPUs, checkout this awesome guide: [Ultrascale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook).