more than 40GB vRAM consumed by paddleocr-vl on A100 GPU!!!
this is running on colab with the official code snippet in the model card using the transformers library, it consumed more than 40GB vRAM during the inference of a single page, although it is a simple page contains no more than 300 arabic words, also i taken more than 2 minutes to generate the result which not very optimized, I will investigate on that to know if that is because specific missing configuration or attention stuff or what is the reason for that.
for reference here's the sample image i run on it:
I modified the code snippet to use the flash attn 2 implementation and the performance got boosted to be 19 seconds instead of 2 minutes and using GPU vRAM of 3.3 GB instead of the massive 45 GB.
could I do a pull request for the new snippet?
@maksym-ostapenko
Hi! I recently updated the README to modify the Transformers code and enable the use of FlashAttention 2. This change significantly reduces memory usage and improves performance.
Hereβs the pull request I opened for the model card README: Model Card PR
I modified the code snippet to use the flash attn 2 implementation and the performance got boosted to be 19 seconds instead of 2 minutes and using GPU vRAM of 3.3 GB instead of the massive 45 GB.
could I do a pull request for the new snippet?
Contributions are highly welcome!
@Vinci
Hi!, you could find the updated colab optimized code under the new section "Click to expand: Use flash-attn to boost performance and reduce memory usage" on the model card,
and no problem here's the full notebook of the experiment, but please comment out the first part as I assume it will crash on the free t4 gpu due to limited memory, go for the 2 section directly when using the flash attn.
paddle-paddle-inference
I found that using with torch.inference_mode(): makes a much larger difference than using flash-attn.
At the end of the script:
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False, use_cache=True)
outputs = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(outputs)
Hi
@alex-dinh
,
what is the difference in terms of time and vRAM to process one image? and what is your technical analysis for that if true?
Here's my benchmarking data on my RTX 5060 Ti machine on one image (dimensions are 1468x1140). SDPA is the default attention mode.
| Inference mode? | Attn mode | Inference Time (seconds) | VRAM Usage (GB) |
|---|---|---|---|
| Yes | flash_attention_2 | 9.196 | 2.844 |
| No | flash_attention_2 | 10.058 | 2.825 |
| Yes | sdpa | 7.390 | 2.954 |
| No | sdpa | 8.053 | 2.979 |
I'm not too familiar with the different attention types, but I know that using inference mode prevents gradients from being computed during model execution which saves a lot of compute, they only matter during training.
For any linux users, remember to run sudo apt install -y nvidia-cuda-toolkit
@alex-dinh
hi, please share the notebook using colab, to reproduce these numbers,
for the experiment i did, you can find a notebook above reproducing what i found.

