--- license: apache-2.0 --- ## Unpadding implementation for transformers This repository contains the implementation of automatic unpadding and sequence packing for Hugging Face Transformers models. In version 5 of Transformers, attention handling across models was standardized, enabling older architectures such as BERT and RoBERTa to use Flash Attention. With Flash Attention support, it becomes possible to implement the unpadding trick previously utilized in [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-large) (see [Unpadding and Sequence Packing](https://huggingface.co/blog/modernbert#unpadding-and-sequence-packing)). This technique involves automatically removing padding tokens and transforming the input tokens, typically represented as an N x M matrix (where N is the batch size and M is the length of the longest sequence in the batch), into a single, long vector containing the concatenated, padding-free sequences. This vector is then processed by the embedding layer and attention blocks. Right before the final model layer (e.g., LM head, classification head), it is transformed back into a matrix format. Unpadding can yield significant benefits in terms of computational efficiency and memory savings. The most substantial gains are achieved when there is high variance in the lengths of the processed texts. If the texts are of roughly equal length, the benefits will be minimal, though the model should still run faster than the original implementation. Below are several benchmarks supporting these claims. ### Inference The experiment involved performing reranking using the [polish-reranker-roberta-v3](https://huggingface.co/sdadas/polish-reranker-roberta-v3) model on a subset of queries from the [PIRB benchmark](https://huggingface.co/spaces/sdadas/pirb). We used 9 datasets from the Web Datasets group, with a maximum of 1,000 queries per dataset. Each query had 100 candidate documents. The model's task was to evaluate the relevance of the query-document pairs, resulting in over 800,000 predictions in total. The test was conducted on a single NVIDIA RTX A6000 GPU. Across all tested implementations, we used a fixed batch size of 32. The results are presented in the table below.

Implementation	Total time	Queries per second	Max VRAM
SDPA (transformers default)	1h 38m 46s	1.39	9006 MB
Flash-Attention	1h 16m 3s	1.81	6286 MB
Flash-Attention with unpadding	39m 13s	3.50	2368 MB

### Fine-Tuning In the next experiment, we fine-tuned the [polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k) model on classification tasks. We measured the total time required to train the model for 10 epochs, including the time taken for evaluation on the validation split after each epoch and evaluation on the test set upon training completion. Two datasets were selected for this experiment: POLEMO-IN from the [KLEJ benchmark](https://klejbenchmark.com/) (short to medium text lengths) and banking-long from the [FinBench](https://arxiv.org/abs/2603.12191) benchmark (medium to very long texts). The tests were conducted on a single NVIDIA RTX A6000 GPU. The results are presented in the table below (OOM - CUDA out-of-memory error). For the dataset with shorter texts, training with unpadding is over twice as fast and consumes three times less memory compared to the default SDPA implementation. For the dataset with long texts, unpadding allows for training with 16 times larger batch sizes and is over three times faster than the original model.

Total batch size		SDPA (default)		Flash-Attention		Flash-Attention with unpadding
Micro batch size	Gradient accumulation	Total time	Max VRAM	Total time	Max VRAM	Total time	Max VRAM
POLEMO-IN (KLEJ)
32	1	1054s	40.88GB	931s	35.71GB	489s	12.64GB
BANKING-LONG
2	16	4734s	28.98GB	4906s	22.82GB	4290s	16.81GB
4	8	OOM	OOM	3434s	36.43GB	2347s	19.30GB
8	4	OOM	OOM	OOM	OOM	1753s	21.25GB
16	2	OOM	OOM	OOM	OOM	1563s	24.04GB
32	1	OOM	OOM	OOM	OOM	1475s	32.00GB