Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / transformers /main /en /debugging.md

HuggingFaceDocBuilder

about 10 hours ago

preview code

download

raw

17.7 kB

	# Debugging

	Debugging distributed training problems typically falls into one of these categories: numerical issues, communication failures, runtime errors, and build errors.

	## Underflow and overflow detection

	Underflow and overflow occur when activations or weights reach `inf` or `nan`, or when `loss=NaN`. To detect these, enable the `DebugUnderflowOverflow` module in `TrainingArguments.debug()`, or import and add it to your own training loop.

	```py
	from transformers import TrainingArguments

	args = TrainingArguments(
	debug="underflow_overflow",
	...
	)
	```

	```py
	from transformers.debug_utils import DebugUnderflowOverflow

	debug_overflow = DebugUnderflowOverflow(model)
	```

	[DebugUnderflowOverflow](/docs/transformers/main/en/internal/trainer_utils#transformers.debug_utils.DebugUnderflowOverflow) inserts hooks into the model to test input and output variables and the corresponding model weights after each forward call. When `inf` or `nan` is detected in at least one element of the activations or weights, the module prints a report like the one below.

	The example below is for fp16 mixed precision training with [google/mt5-small](https://huggingface.co/google/mt5-small).

	```shell
	Detected inf/nan during batch_number=0
	Last 21 forward frames:
	abs min abs max metadata
	encoder.block.1.layer.1.DenseReluDense.dropout Dropout
	0.00e+00 2.57e+02 input[0]
	0.00e+00 2.85e+02 output
	[...]
	encoder.block.2.layer.0 T5LayerSelfAttention
	6.78e-04 3.15e+03 input[0]
	2.65e-04 3.42e+03 output[0]
	None output[1]
	2.25e-01 1.00e+04 output[2]
	encoder.block.2.layer.1.layer_norm T5LayerNorm
	8.69e-02 4.18e-01 weight
	2.65e-04 3.42e+03 input[0]
	1.79e-06 4.65e+00 output
	encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
	2.17e-07 4.50e+00 weight
	1.79e-06 4.65e+00 input[0]
	2.68e-06 3.70e+01 output
	encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
	8.08e-07 2.66e+01 weight
	1.79e-06 4.65e+00 input[0]
	1.27e-04 2.37e+02 output
	encoder.block.2.layer.1.DenseReluDense.dropout Dropout
	0.00e+00 8.76e+03 input[0]
	0.00e+00 9.74e+03 output
	encoder.block.2.layer.1.DenseReluDense.wo Linear
	1.01e-06 6.44e+00 weight
	0.00e+00 9.74e+03 input[0]
	3.18e-04 6.27e+04 output
	encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
	1.79e-06 4.65e+00 input[0]
	3.18e-04 6.27e+04 output
	encoder.block.2.layer.1.dropout Dropout
	3.18e-04 6.27e+04 input[0]
	0.00e+00 inf output
	```

	The first line shows the batch number where the error occurred. In this case, it occurred on batch 0.

	Each frame describes the module it reports on. For example, the frame below reports on `encoder.block.2.layer.1.layer_norm`, the layer norm in the first layer of the encoder's second block. The forward calls are to `T5LayerNorm`.

	```shell
	encoder.block.2.layer.1.layer_norm T5LayerNorm
	8.69e-02 4.18e-01 weight
	2.65e-04 3.42e+03 input[0]
	1.79e-06 4.65e+00 output
	```

	The last frame reports on the `Dropout.forward` function, which calls the `dropout` attribute inside the `DenseReluDense` class. The overflow (`inf`) occurred in the encoder's second block on the first batch. The largest input element was 6.27e+04.

	```shell
	encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
	1.79e-06 4.65e+00 input[0]
	3.18e-04 6.27e+04 output
	encoder.block.2.layer.1.dropout Dropout
	3.18e-04 6.27e+04 input[0]
	0.00e+00 inf output
	```

	`T5DenseGatedGeluDense.forward` output activations reached a maximum of 6.27e+04, which is close to fp16's maximum of 6.4e+04. In the next step, `Dropout` renormalizes the weights after zeroing some elements, pushing the maximum above 6.4e+04 and causing the overflow.

	Now that you know where the error is happening, investigate the modeling code in [modeling_t5.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py).

	```py
	class T5DenseGatedGeluDense(nn.Module):
	def __init__(self, config):
	super().__init__()
	self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
	self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
	self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
	self.dropout = nn.Dropout(config.dropout_rate)
	self.gelu_act = ACT2FN["gelu_new"]

	def forward(self, hidden_states):
	hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
	hidden_linear = self.wi_1(hidden_states)
	hidden_states = hidden_gelu * hidden_linear
	hidden_states = self.dropout(hidden_states)
	hidden_states = self.wo(hidden_states)
	return hidden_states
	```

	One fix is to switch to fp32 a few steps before the values grew too large, so numbers don't overflow when multiplied or summed. Another option is to disable mixed precision training (`amp`) temporarily.

	```py
	import torch

	def forward(self, hidden_states):
	device_type = hidden_states.device.type
	if torch.is_autocast_enabled(device_type):
	with torch.amp.autocast(device_type, enabled=False):
	return self._forward(hidden_states)
	else:
	return self._forward(hidden_states)
	```

	The report only covers inputs and outputs of full frames. To analyze intermediate values inside any `forward` function, add `detect_overflow` after each forward call to track `inf` or `nan` in `forwarded_states`.

	```py
	from transformers.debug_utils import detect_overflow

	class T5LayerFF(nn.Module):
	[...]

	def forward(self, hidden_states):
	forwarded_states = self.layer_norm(hidden_states)
	detect_overflow(forwarded_states, "after layer_norm")
	forwarded_states = self.DenseReluDense(forwarded_states)
	detect_overflow(forwarded_states, "after DenseReluDense")
	return hidden_states + self.dropout(forwarded_states)
	```

	Configure the number of frames printed by [DebugUnderflowOverflow](/docs/transformers/main/en/internal/trainer_utils#transformers.debug_utils.DebugUnderflowOverflow).

	```py
	from transformers.debug_utils import DebugUnderflowOverflow

	debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
	```

	### Batch tracing

	[DebugUnderflowOverflow](/docs/transformers/main/en/internal/trainer_utils#transformers.debug_utils.DebugUnderflowOverflow) can also trace the absolute minimum and maximum values in each batch with underflow and overflow detection disabled. This helps you locate where values start diverging in your model.

	The example below traces batches 1 and 3 (batches are zero-indexed).

	```py
	debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
	```

	```shell
	* Starting batch number=1 *
	abs min abs max metadata
	shared Embedding
	1.01e-06 7.92e+02 weight
	0.00e+00 2.47e+04 input[0]
	5.36e-05 7.92e+02 output
	[...]
	decoder.dropout Dropout
	1.60e-07 2.27e+01 input[0]
	0.00e+00 2.52e+01 output
	decoder T5Stack
	not a tensor output
	lm_head Linear
	1.01e-06 7.92e+02 weight
	0.00e+00 1.11e+00 input[0]
	6.06e-02 8.39e+01 output
	T5ForConditionalGeneration
	not a tensor output

	* Starting batch number=3 *
	abs min abs max metadata
	shared Embedding
	1.01e-06 7.92e+02 weight
	0.00e+00 2.78e+04 input[0]
	5.36e-05 7.92e+02 output
	[...]
	```

	[DebugUnderflowOverflow](/docs/transformers/main/en/internal/trainer_utils#transformers.debug_utils.DebugUnderflowOverflow) reports many frames, which makes it easier to spot where values diverge. If you know the problem is around batch 150, focus the trace on batches 149 and 150 to compare where the numbers start to differ.

	You can also stop the trace after a specific batch number, for example batch 3.

	```py
	debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
	```

	## Communication

	Distributed training requires inter-process and inter-node communication, which is a common source of errors.

	Download the script below to diagnose network issues, then run it to test GPU communication. The command below tests two GPUs. Adjust `--nproc_per_node` and `--nnodes` for your system.

	```bash
	wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py
	python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
	```

	The script prints `OK` if both GPUs communicate and allocate memory successfully. See the diagnostic script for more details and a recipe for running it in a SLURM environment.

	Set `NCCL_DEBUG=INFO` to get detailed NCCL debugging output.

	```bash
	NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
	```

	## DeepSpeed

	When you hit an error, first check whether DeepSpeed is the cause. Retry your setup without DeepSpeed, and if the error persists, report the issue. For issues unrelated to the Transformers integration, open an issue on the DeepSpeed [repository](https://github.com/microsoft/DeepSpeed).

	For issues related to the Transformers integration, include the following information.

	* The full DeepSpeed config file.
	* The command line arguments for [Trainer](/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) or the [TrainingArguments](/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) if you're scripting the [Trainer](/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) setup yourself (don't dump the entire [TrainingArguments](/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) which contains many irrelevant entries).
	* The outputs of these commands.

	```bash
	python -c 'import torch; print(f"torch: {torch.__version__}")'
	python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
	python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
	```

	* A link to a Google Colab notebook to reproduce the issue.
	* A standard or non-custom dataset or an existing example to reproduce the issue.

	### Process killed at startup

	If the DeepSpeed process is killed during launch without a traceback, the program tried to allocate more CPU memory than is available or allowed. The OS kernel terminates the process in either case.

	Check whether your config file has `offload_optimizer`, `offload_param`, or both configured to offload to the CPU.

	If you have NVMe and ZeRO-3 set up, try offloading to the NVMe instead. [Estimate](https://deepspeed.readthedocs.io/en/latest/memory.html) the memory requirements of your model first.

	### NaN loss

	NaN loss often occurs when a model is pretrained in bf16 and then it is used with fp16 (this is especially common with TPU-trained models). Use fp32 or bf16 if your hardware supports it (TPUs, Ampere GPUs or newer).

	fp16 can also cause overflow. If your config file looks like the one below, you may see overflow errors in the logs.

	```json
	{
	"fp16": {
	"enabled": "auto",
	"loss_scale": 0,
	"loss_scale_window": 1000,
	"initial_scale_power": 16,
	"hysteresis": 2,
	"min_loss_scale": 1
	}
	}
	```

	The `OVERFLOW!` error below means the DeepSpeed loss scaler couldn't find a scaling coefficient to overcome the loss overflow. Try a higher `initial_scale_power` value (32 usually works).

	```bash
	0%\| \| 0/189 [00:00<?, ?it/s]
	[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 262144
	1%\|▌ \| 1/189 [00:00<01:26, 2.17it/s]
	[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072.0
	1%\|█▏
	[...]
	[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
	14%\|████████████████▌ \| 27/189 [00:14<01:13, 2.21it/s]
	[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
	15%\|█████████████████▏ \| 28/189 [00:14<01:13, 2.18it/s]
	[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
	15%\|█████████████████▊ \| 29/189 [00:15<01:13, 2.18it/s]
	[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
	[...]
	```

	## DeepSpeed CUDA

	DeepSpeed compiles CUDA C++ code, which is a common source of build errors for PyTorch extensions that require CUDA. These errors depend on how CUDA is installed on your system.

	```bash
	pip install deepspeed
	```

	> [!TIP]
	> For any other installation issues, [open an issue](https://github.com/microsoft/DeepSpeed/issues) with the DeepSpeed team.

	### Non-identical toolkits

	PyTorch ships with its own CUDA toolkit, but DeepSpeed requires an identical CUDA version installed system-wide. If you installed PyTorch with `cudatoolkit==10.2` in your Python environment, you'll also need CUDA 10.2 installed everywhere.

	The exact location varies by system, but `/usr/local/cuda-10.2` is the most common path on Unix systems. Once CUDA is set up and added to your `PATH`, find the installation location with this command.

	```bash
	which nvcc
	```

	### Multiple toolkits

	Your system may have more than one CUDA toolkit installed.

	```text
	/usr/local/cuda-10.2
	/usr/local/cuda-11.0
	```

	Package installers typically set paths to the last installed version. If the build fails because it can't find the right CUDA version, configure `PATH` and `LD_LIBRARY_PATH` to point to the correct path.

	Check these environment variables first.

	```bash
	echo $PATH
	echo $LD_LIBRARY_PATH
	```

	`PATH` lists executable locations. `LD_LIBRARY_PATH` lists shared library locations. Earlier entries take priority, and `:` separates multiple entries. Prepend the correct CUDA path to prioritize it.

	```bash
	# adjust the version and full path if needed
	export PATH=/usr/local/cuda-10.2/bin:$PATH
	export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
	```

	Also verify the assigned directories exist. The `lib64` sub-directory contains CUDA `.so` objects like `libcudart.so`. Check the actual filenames and update accordingly.

	### Older versions

	Older CUDA versions sometimes require older compiler versions. For example, if CUDA requires `gcc-7` but your system only has `gcc-9`, the build will fail. Install the required older compiler and create a symlink so the CUDA build system can find it.

	```bash
	# adjust the path to your system
	sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
	sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
	```

	### Prebuild

	If you're still having trouble installing DeepSpeed or building it at runtime, prebuild the DeepSpeed modules first. Run the commands below for a local build.

	```bash
	git clone https://github.com/deepspeedai/DeepSpeed/
	cd DeepSpeed
	rm -rf build
	TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
	--global-option="build_ext" --global-option="-j8" --no-cache -v \
	--disable-pip-version-check 2>&1 \| tee build.log
	```

	> [!TIP]
	> Add `DS_BUILD_AIO=1` to the build command to use NVMe offload. Make sure you install the libaio-dev package system-wide.

	Next, set your GPU architecture in `TORCH_CUDA_ARCH_LIST`. A complete list of NVIDIA GPUs and their architectures is on the [CUDA GPUs page](https://developer.nvidia.com/cuda-gpus). To check the PyTorch version that corresponds to your architecture, run the command below.

	```bash
	python -c "import torch; print(torch.cuda.get_arch_list())"
	```

	Find the architecture for a GPU with the following command.

	```bash
	CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
	```

	Run the following command to find the architecture for GPU `0`. The output shows `major` and `minor` values that together form the GPU architecture. The example below shows architecture `8.6`.

	```bash
	CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
	print(torch.cuda.get_device_properties(torch.device('cuda')))
	"_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)"
	```

	For a result of `8, 6`, set `TORCH_CUDA_ARCH_LIST="8.6"`. For multiple GPUs with different architectures, list them like `TORCH_CUDA_ARCH_LIST="6.1;8.6"`.

	You can omit `TORCH_CUDA_ARCH_LIST` and let the build program detect the GPU architecture automatically, but it might not match the actual GPU on the target machine. Explicitly setting the architecture is more reliable.

	For training on multiple machines with the same setup, build a binary wheel.

	```bash
	git clone https://github.com/deepspeedai/DeepSpeed/
	cd DeepSpeed
	rm -rf build
	TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
	python setup.py build_ext -j8 bdist_wheel
	```

	This generates a binary wheel like `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`. Install it locally or on another machine.

	```bash
	pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
	```

Xet Storage Details

Size:: 17.7 kB
Xet hash:: 33e51cb7457047a9f24d6868d89e400261bcb78e32f1c1bf07906ab20c54cf3c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.