Buckets:
| # Troubleshoot | |
| This guide provides solutions to some issues you might encounter when using Accelerate. Not all errors are covered because Accelerate is an active library that is continuously evolving and there are many different use cases and distributed training setups. If the solutions described here don't help with your specific error, please take a look at the [Ask for help](#ask-for-help) section to learn where and how to get help. | |
| ## Logging | |
| Logging can help you identify where an error is coming from. In a distributed setup with multiple processes, logging can be a challenge, but Accelerate provides the `logging()` utility to ensure logs are synchronized. | |
| To troubleshoot an issue, use `logging()` instead of the standard Python [`logging`](https://docs.python.org/3/library/logging.html#module-logging) module. Set the verbosity level (`INFO`, `DEBUG`, `WARNING`, `ERROR`, `CRITICAL`) with the `log_level` parameter, and then you can either: | |
| 1. Export the `log_level` as the `ACCELERATE_LOG_LEVEL` environment variable. | |
| 2. Pass the `log_level` directly to `get_logger`. | |
| For example, to set `log_level="INFO"`: | |
| ```py | |
| from accelerate.logging import get_logger | |
| logger = get_logger(__name__, log_level="DEBUG") | |
| ``` | |
| By default, the log is called on main processes only. To call it on all processes, pass `main_process_only=False`. | |
| If a log should be called on all processes and in order, also pass `in_order=True`. | |
| ```py | |
| from accelerate.logging import get_logger | |
| logger = get_logger(__name__, log_level="DEBUG") | |
| # log all processes | |
| logger.debug("thing_to_log", main_process_only=False) | |
| # log all processes in order | |
| logger.debug("thing_to_log", main_process_only=False, in_order=True) | |
| ``` | |
| ## Hanging code and timeout errors | |
| There can be many reasons why your code is hanging. Let's take a look at how to solve some of the most common issues that can cause your code to hang. | |
| ### Mismatched tensor shapes | |
| Mismatched tensor shapes is a common issue that can cause your code to hang for a significant amount of time on a distributed setup. | |
| When running scripts in a distributed setup, functions such as [Accelerator.gather()](/docs/accelerate/pr_4021/en/package_reference/accelerator#accelerate.Accelerator.gather) and [Accelerator.reduce()](/docs/accelerate/pr_4021/en/package_reference/accelerator#accelerate.Accelerator.reduce) are necessary to grab tensors across devices to collectively perform operations on them. These (and other) functions rely on `torch.distributed` to perform a `gather` operation, which requires tensors to have the **exact same shape** across all processes. When the tensor shapes don't match, your code hangs and you'll eventually hit a timeout exception. | |
| You can use Accelerate's operational debug mode to immediately catch this issue. We recommend enabling this mode during the `accelerate config` setup, but you can also enable it from the CLI, as an environment variable, or by manually editing the `config.yaml` file. | |
| ```bash | |
| accelerate launch --debug {my_script.py} --arg1 --arg2 | |
| ``` | |
| If enabling debug mode as an environment variable, you don't need to call `accelerate launch`. | |
| ```bash | |
| ACCELERATE_DEBUG_MODE="1" torchrun {my_script.py} --arg1 --arg2 | |
| ``` | |
| Add `debug: true` to your `config.yaml` file. | |
| ```yaml | |
| compute_environment: LOCAL_MACHINE | |
| debug: true | |
| ``` | |
| Once you enable debug mode, you should get a traceback that points to the tensor shape mismatch issue. | |
| ```py | |
| Traceback (most recent call last): | |
| File "/home/zach_mueller_huggingface_co/test.py", line 18, in | |
| main() | |
| File "/home/zach_mueller_huggingface_co/test.py", line 15, in main | |
| broadcast_tensor = broadcast(tensor) | |
| File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/utils/operations.py", line 303, in wrapper | |
| accelerate.utils.operations.DistributedOperationException: | |
| Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid. | |
| Operation: `accelerate.utils.operations.broadcast` | |
| Input shapes: | |
| - Process 0: [1, 5] | |
| - Process 1: [1, 2, 5] | |
| ``` | |
| ### Early stopping | |
| For early stopping in distributed training, if each process has a specific stopping condition (e.g. validation loss), it may not be synchronized across all processes. As a result, a break can happen on process 0 but not on process 1 which will cause your code to hang indefinitely until a timeout occurs. | |
| If you have early stopping conditionals, use the `set_trigger` and `check_trigger` methods to make sure all the processes | |
| are ended correctly. | |
| ```py | |
| # Assume `should_do_breakpoint` is a custom-defined function that returns a conditional, | |
| # and that conditional might be true only on process 1 | |
| if should_do_breakpoint(loss): | |
| accelerator.set_trigger() | |
| # Later in the training script when we need to check for the breakpoint | |
| if accelerator.check_trigger(): | |
| break | |
| ``` | |
| ### Low kernel versions on Linux | |
| On Linux with kernel version < 5.5, hanging processes have been reported. To avoid this problem, upgrade your system to a later kernel version. | |
| ### MPI | |
| If your distributed CPU training job using MPI is hanging, ensure that you have | |
| [passwordless SSH](https://www.open-mpi.org/faq/?category=rsh#ssh-keys) setup (using keys) between the nodes. This means | |
| that for all nodes in your hostfile, you should be able to SSH from one node to another without being prompted for a password. | |
| Next, try to run the `mpirun` command as a sanity check. For example, the command below should print out the | |
| hostnames for each of the nodes. | |
| ```bash | |
| mpirun -f hostfile -n {number of nodes} -ppn 1 hostname | |
| ``` | |
| ## Out-of-Memory | |
| One of the most frustrating errors when it comes to running training scripts is hitting "Out-of-Memory" on devices like CUDA, XPU or CPU. The entire script needs to be restarted and any progress is lost. | |
| To address this problem, Accelerate provides the [find_executable_batch_size()](/docs/accelerate/pr_4021/en/package_reference/utilities#accelerate.find_executable_batch_size) utility that is heavily based on [toma](https://github.com/BlackHC/toma). | |
| This utility retries code that fails due to OOM (out-of-memory) conditions and automatically lowers batch sizes. For each OOM condition, the algorithm decreases the batch size by half and retries the code until it succeeds. | |
| To use [find_executable_batch_size()](/docs/accelerate/pr_4021/en/package_reference/utilities#accelerate.find_executable_batch_size), restructure your training function to include an inner function with `find_executable_batch_size` and build your dataloaders inside it. At a minimum, this only takes 4 new lines of code. | |
| The inner function **must** take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handle this for you. Any object (models, optimizers) that consumes device memory and is passed to the [Accelerator](/docs/accelerate/pr_4021/en/package_reference/accelerator#accelerate.Accelerator) also **must** be declared inside the inner function. | |
| ```diff | |
| def training_function(args): | |
| accelerator = Accelerator() | |
| + @find_executable_batch_size(starting_batch_size=args.batch_size) | |
| + def inner_training_loop(batch_size): | |
| + nonlocal accelerator # Ensure they can be used in our context | |
| + accelerator.free_memory() # Free all lingering references | |
| model = get_model() | |
| model.to(accelerator.device) | |
| optimizer = get_optimizer() | |
| train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size) | |
| lr_scheduler = get_scheduler( | |
| optimizer, | |
| num_training_steps=len(train_dataloader)*num_epochs | |
| ) | |
| model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( | |
| model, optimizer, train_dataloader, eval_dataloader, lr_scheduler | |
| ) | |
| train(model, optimizer, train_dataloader, lr_scheduler) | |
| validate(model, eval_dataloader) | |
| + inner_training_loop() | |
| ``` | |
| ## Non-reproducible results between device setups | |
| If you changed the device setup and observe different model performance, it is likely you didn't update your script when moving from one setup to another. Even if you're using the same script with the same batch size, the results will still be different on a TPU, multi-GPU, and single GPU. | |
| For example, if you were training on a single GPU with a batch size of 16 and you move to a dual GPU setup, you need to change the batch size to 8 to have the same effective batch size. This is because when training with Accelerate, the batch size passed to the dataloader is the **batch size per GPU**. | |
| To make sure you can reproduce the results between the setups, make sure to use the same seed, adjust the batch size accordingly, and consider scaling the learning rate. | |
| For more details and a quick reference for batch sizes, check out the [Comparing performance between different device setups](../concept_guides/performance) guide. | |
| ## Performance issues on different GPUs | |
| If your multi-GPU setup consists of different GPUs, you may encounter some performance issues: | |
| - There may be an imbalance in GPU memory between the GPUs. In this case, the GPU with the smaller memory will limit the batch size or the size of the model that can be loaded onto the GPUs. | |
| - If you are using GPUs with different performance profiles, the performance will be driven by the slowest GPU you are using because the other GPUs will have to wait for it to complete its workload. | |
| Vastly different GPUs within the same setup can lead to performance bottlenecks. | |
| ## Ask for help | |
| If none of the solutions and advice here helped resolve your issue, you can always reach out to the community and Accelerate team for help. | |
| - Ask for help on the Hugging Face forums by posting your question in the [Accelerate category](https://discuss.huggingface.co/c/accelerate/18). Make sure to write a descriptive post with relevant context about your setup and reproducible code to maximize the likelihood that your problem is solved! | |
| - Post a question on [Discord](http://hf.co/join/discord), and let the team and the community help you. | |
| - Create an Issue on the Accelerate [GitHub repository](https://github.com/huggingface/accelerate/issues) if you think you've found a bug related to the library. Include context regarding the bug and details about your distributed setup to help us better figure out what's wrong and how we can fix it. | |
Xet Storage Details
- Size:
- 10.4 kB
- Xet hash:
- 0895df9529808b7fc28f5d8863decdd3aebf27724a1f366a0416f4d2a08c1599
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.