|
|
| Model training anatomy |
| To understand performance optimization techniques that one can apply to improve efficiency of model training |
| speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute |
| intensity varies depending on an operation performed. |
| Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration, |
| we'll need to install a few libraries: |
|
|
| pip install transformers datasets accelerate nvidia-ml-py3 |
| The nvidia-ml-py3 library allows us to monitor the memory usage of the models from within Python. You might be familiar |
| with the nvidia-smi command in the terminal - this library allows to access the same information in Python directly. |
| Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier. |
| In total, we get 512 sequences each with length 512 and store them in a [~datasets.Dataset] with PyTorch format. |
|
|
| import numpy as np |
| from datasets import Dataset |
| seq_len, dataset_size = 512, 512 |
| dummy_data = { |
| "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)), |
| "labels": np.random.randint(0, 1, (dataset_size)), |
| } |
| ds = Dataset.from_dict(dummy_data) |
| ds.set_format("pt") |
|
|
| To print summary statistics for the GPU utilization and the training run with the [Trainer] we define two helper functions: |
|
|
| from pynvml import * |
| def print_gpu_utilization(): |
| nvmlInit() |
| handle = nvmlDeviceGetHandleByIndex(0) |
| info = nvmlDeviceGetMemoryInfo(handle) |
| print(f"GPU memory occupied: {info.used//1024**2} MB.") |
| def print_summary(result): |
| print(f"Time: {result.metrics['train_runtime']:.2f}") |
| print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}") |
| print_gpu_utilization() |
|
|
| Let's verify that we start with a free GPU memory: |
|
|
| print_gpu_utilization() |
| GPU memory occupied: 0 MB. |
|
|
| That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on |
| your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by |
| the user. When a model is loaded to the GPU the kernels are also loaded, which can take up 1-2GB of memory. To see how |
| much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well. |
|
|
| import torch |
| torch.ones((1, 1)).to("cuda") |
| print_gpu_utilization() |
| GPU memory occupied: 1343 MB. |
|
|
| We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how much space the model uses. |
| Load Model |
| First, we load the google-bert/bert-large-uncased model. We load the model weights directly to the GPU so that we can check |
| how much space just the weights use. |
|
|
| from transformers import AutoModelForSequenceClassification |
| model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-large-uncased").to("cuda") |
| print_gpu_utilization() |
| GPU memory occupied: 2631 MB. |
|
|
| We can see that the model weights alone take up 1.3 GB of GPU memory. The exact number depends on the specific |
| GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an |
| optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result |
| as with nvidia-smi CLI: |
|
|
| nvidia-smi |
| ```bash |
| Tue Jan 11 08:58:05 2022 |
| +-----------------------------------------------------------------------------+ |
| | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |
| |-------------------------------+----------------------+----------------------+ |
| | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | |
| | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |
| | | | MIG M. | |
| |===============================+======================+======================| |
| | 0 Tesla V100-SXM2 On | 00000000:00:04.0 Off | 0 | |
| | N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | |
| | | | N/A | |
| +-------------------------------+----------------------+----------------------+ |
| +-----------------------------------------------------------------------------+ |
| | Processes: | |
| | GPU GI CI PID Type Process name GPU Memory | |
| | ID ID Usage | |
| |=============================================================================| |
| | 0 N/A N/A 3721 C nvs/codeparrot/bin/python 2629MiB | |
| +-----------------------------------------------------------------------------+ |
|
|
| We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can |
| start training the model and see how the GPU memory consumption changes. First, we set up a few standard training |
| arguments: |
| py |
| default_args = { |
| "output_dir": "tmp", |
| "evaluation_strategy": "steps", |
| "num_train_epochs": 1, |
| "log_level": "error", |
| "report_to": "none", |
| } |
|
|
| If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python |
| kernel between experiments. |
|
|
| Memory utilization at vanilla training |
| Let's use the [Trainer] and train the model without using any GPU performance optimization techniques and a batch size of 4: |
|
|
| from transformers import TrainingArguments, Trainer, logging |
| logging.set_verbosity_error() |
| training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) |
| trainer = Trainer(model=model, args=training_args, train_dataset=ds) |
| result = trainer.train() |
| print_summary(result) |
|
|
| Time: 57.82 |
| Samples/second: 8.86 |
| GPU memory occupied: 14949 MB. |
| We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size |
| can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our |
| model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. |
| To understand a bit better why this is the case let's have a look at a model's operations and memory needs. |
| Anatomy of Model's Operations |
| Transformers architecture includes 3 main groups of operations grouped below by compute-intensity. |
|
|
| Tensor Contractions |
| Linear layers and components of Multi-Head Attention all do batched matrix-matrix multiplications. These operations are the most compute-intensive part of training a transformer. |
|
|
| Statistical Normalizations |
| Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more reduction operations, the result of which is then applied via a map. |
|
|
| Element-wise Operators |
| These are the remaining operators: biases, dropout, activations, and residual connections. These are the least compute-intensive operations. |
|
|
| This knowledge can be helpful to know when analyzing performance bottlenecks. |
| This summary is derived from Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020 |
| Anatomy of Model's Memory |
| We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there |
| are many components during training that use GPU memory. The components on GPU memory are the following: |
|
|
| model weights |
| optimizer states |
| gradients |
| forward activations saved for gradient computation |
| temporary buffers |
| functionality-specific memory |
|
|
| A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For |
| inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per |
| model parameter for mixed precision inference, plus activation memory. |
| Let's look at the details. |
| Model Weights: |
|
|
| 4 bytes * number of parameters for fp32 training |
| 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory) |
|
|
| Optimizer States: |
|
|
| 8 bytes * number of parameters for normal AdamW (maintains 2 states) |
| 2 bytes * number of parameters for 8-bit AdamW optimizers like bitsandbytes |
| 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state) |
|
|
| Gradients |
|
|
| 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32) |
|
|
| Forward Activations |
|
|
| size depends on many factors, the key ones being sequence length, hidden size and batch size. |
|
|
| There are the input and output that are being passed and returned by the forward and the backward functions and the |
| forward activations saved for gradient computation. |
| Temporary Memory |
| Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the |
| moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think |
| strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. |
| Functionality-specific memory |
| Then, your software could have special memory needs. For example, when generating text using beam search, the software |
| needs to maintain multiple copies of inputs and outputs. |
| forward vs backward Execution Speed |
| For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates |
| into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually |
| bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward |
| (e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, |
| and writes once, gradInput). |
| As you can see, there are potentially a few places where we could save GPU memory or speed up operations. |
| Now that you understand what affects GPU utilization and computation speed, refer to |
| the Methods and tools for efficient training on a single GPU documentation page to learn about |
| performance optimization techniques. |