--- license: apache-2.0 --- ## Unpadding implementation for transformers This repository contains the implementation of automatic unpadding and sequence packing for Hugging Face Transformers models. In version 5 of Transformers, attention handling across models was standardized, enabling older architectures such as BERT and RoBERTa to use Flash Attention. With Flash Attention support, it becomes possible to implement the unpadding trick previously utilized in [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-large) (see [Unpadding and Sequence Packing](https://huggingface.co/blog/modernbert#unpadding-and-sequence-packing)). This technique involves automatically removing padding tokens and transforming the input tokens, typically represented as an N x M matrix (where N is the batch size and M is the length of the longest sequence in the batch), into a single, long vector containing the concatenated, padding-free sequences. This vector is then processed by the embedding layer and attention blocks. Right before the final model layer (e.g., LM head, classification head), it is transformed back into a matrix format. Unpadding can yield significant benefits in terms of computational efficiency and memory savings. The most substantial gains are achieved when there is high variance in the lengths of the processed texts. If the texts are of roughly equal length, the benefits will be minimal, though the model should still run faster than the original implementation. Below are several benchmarks supporting these claims. ### Inference The experiment involved performing reranking using the [polish-reranker-roberta-v3](https://huggingface.co/sdadas/polish-reranker-roberta-v3) model on a subset of queries from the [PIRB benchmark](https://huggingface.co/spaces/sdadas/pirb). We used 9 datasets from the Web Datasets group, with a maximum of 1,000 queries per dataset. Each query had 100 candidate documents. The model's task was to evaluate the relevance of the query-document pairs, resulting in over 800,000 predictions in total. The test was conducted on a single NVIDIA RTX A6000 GPU. Across all tested implementations, we used a fixed batch size of 32. The results are presented in the table below.
| Implementation |
Total time |
Queries per second |
Max VRAM |
|---|---|---|---|
| SDPA (transformers default) | 1h 38m 46s |
1.39 |
9006 MB |
| Flash-Attention | 1h 16m 3s |
1.81 |
6286 MB |
| Flash-Attention with unpadding | 39m 13s | 3.50 | 2368 MB |
| Total batch size | SDPA (default) | Flash-Attention | Flash-Attention with unpadding |
||||
|---|---|---|---|---|---|---|---|
| Micro batch size |
Gradient accumulation |
Total time |
Max VRAM |
Total time |
Max VRAM |
Total time |
Max VRAM |
| POLEMO-IN (KLEJ) | |||||||
| 32 | 1 | 1054s | 40.88GB | 931s | 35.71GB | 489s | 12.64GB |
| BANKING-LONG | |||||||
| 2 | 16 | 4734s | 28.98GB | 4906s | 22.82GB | 4290s | 16.81GB |
| 4 | 8 | OOM | OOM | 3434s | 36.43GB | 2347s | 19.30GB |
| 8 | 4 | OOM | OOM | OOM | OOM | 1753s | 21.25GB |
| 16 | 2 | OOM | OOM | OOM | OOM | 1563s | 24.04GB |
| 32 | 1 | OOM | OOM | OOM | OOM | 1475s | 32.00GB |