File size: 6,888 Bytes
35b7aea | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
license: apache-2.0
---
## Unpadding implementation for transformers
This repository contains the implementation of automatic unpadding and sequence packing for Hugging Face Transformers models.
In version 5 of Transformers, attention handling across models was standardized, enabling older architectures such as BERT and RoBERTa to use Flash Attention.
With Flash Attention support, it becomes possible to implement the unpadding trick previously utilized in [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-large) (see [Unpadding and Sequence Packing](https://huggingface.co/blog/modernbert#unpadding-and-sequence-packing)).
This technique involves automatically removing padding tokens and transforming the input tokens, typically represented as an N x M matrix (where N is the batch size and M is the length of the longest sequence in the batch), into a single, long vector containing the concatenated, padding-free sequences.
This vector is then processed by the embedding layer and attention blocks. Right before the final model layer (e.g., LM head, classification head), it is transformed back into a matrix format.
Unpadding can yield significant benefits in terms of computational efficiency and memory savings.
The most substantial gains are achieved when there is high variance in the lengths of the processed texts.
If the texts are of roughly equal length, the benefits will be minimal, though the model should still run faster than the original implementation. Below are several benchmarks supporting these claims.
### Inference
The experiment involved performing reranking using the [polish-reranker-roberta-v3](https://huggingface.co/sdadas/polish-reranker-roberta-v3) model on a subset of queries from the [PIRB benchmark](https://huggingface.co/spaces/sdadas/pirb).
We used 9 datasets from the Web Datasets group, with a maximum of 1,000 queries per dataset. Each query had 100 candidate documents.
The model's task was to evaluate the relevance of the query-document pairs, resulting in over 800,000 predictions in total.
The test was conducted on a single NVIDIA RTX A6000 GPU. Across all tested implementations, we used a fixed batch size of 32.
The results are presented in the table below.
<table>
<thead>
<tr>
<th>Implementation<br></th>
<th>Total time<br></th>
<th>Queries per second<br></th>
<th>Max VRAM<br></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>SDPA (transformers default)</strong></td>
<td>1h 38m 46s<br></td>
<td>1.39<br></td>
<td>9006 MB<br></td>
</tr>
<tr>
<td><strong>Flash-Attention</strong></td>
<td>1h 16m 3s<br></td>
<td>1.81<br></td>
<td>6286 MB</td>
</tr>
<tr>
<td><strong>Flash-Attention with unpadding</strong></td>
<td bgcolor="#DAF2D0">39m 13s</td>
<td bgcolor="#DAF2D0">3.50</td>
<td bgcolor="#DAF2D0">2368 MB</td>
</tr>
</tbody>
</table>
### Fine-Tuning
In the next experiment, we fine-tuned the [polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k) model on classification tasks.
We measured the total time required to train the model for 10 epochs, including the time taken for evaluation on the validation split after each epoch and evaluation on the test set upon training completion.
Two datasets were selected for this experiment: <strong>POLEMO-IN</strong> from the [KLEJ benchmark](https://klejbenchmark.com/) (short to medium text lengths) and <strong>banking-long</strong> from the [FinBench](https://arxiv.org/abs/2603.12191) benchmark (medium to very long texts).
The tests were conducted on a single NVIDIA RTX A6000 GPU. The results are presented in the table below (OOM - CUDA out-of-memory error).
For the dataset with shorter texts, training with unpadding is over twice as fast and consumes three times less memory compared to the default SDPA implementation.
For the dataset with long texts, unpadding allows for training with 16 times larger batch sizes and is over three times faster than the original model.
<table>
<thead>
<tr>
<th colspan="2">Total batch size</th>
<th colspan="2">SDPA (default)</th>
<th colspan="2">Flash-Attention</th>
<th colspan="2">Flash-Attention<br>with unpadding</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Micro <br>batch size</strong></td>
<td><strong>Gradient<br>accumulation</strong></td>
<td><strong>Total<br>time</strong></td>
<td><strong>Max<br>VRAM</strong></td>
<td><strong>Total<br>time</strong></td>
<td><strong>Max<br>VRAM</strong></td>
<td><strong>Total<br>time</strong></td>
<td><strong>Max<br>VRAM</strong></td>
</tr>
<tr>
<td colspan="8"><strong>POLEMO-IN (KLEJ)</strong></td>
</tr>
<tr>
<td>32</td>
<td>1</td>
<td>1054s</td>
<td>40.88GB</td>
<td>931s</td>
<td>35.71GB</td>
<td bgcolor="#DAF2D0">489s</td>
<td bgcolor="#DAF2D0">12.64GB</td>
</tr>
<tr>
<td colspan="8"><strong>BANKING-LONG</strong></td>
</tr>
<tr>
<td>2</td>
<td>16</td>
<td>4734s</td>
<td>28.98GB</td>
<td>4906s</td>
<td>22.82GB</td>
<td>4290s</td>
<td bgcolor="#DAF2D0">16.81GB</td>
</tr>
<tr>
<td>4</td>
<td>8</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td>3434s</td>
<td>36.43GB</td>
<td>2347s</td>
<td>19.30GB</td>
</tr>
<tr>
<td>8</td>
<td>4</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td>1753s</td>
<td>21.25GB</td>
</tr>
<tr>
<td>16</td>
<td>2</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td>1563s</td>
<td>24.04GB</td>
</tr>
<tr>
<td>32</td>
<td>1</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#FFD3D3">OOM</td>
<td bgcolor="#DAF2D0">1475s</td>
<td>32.00GB</td>
</tr>
</tbody>
</table> |