more than 40GB vRAM consumed by paddleocr-vl on A100 GPU!!!

#59

by sayed99 - opened Nov 2, 2025

Nov 2, 2025

•

edited Nov 11, 2025

this is running on colab with the official code snippet in the model card using the transformers library, it consumed more than 40GB vRAM during the inference of a single page, although it is a simple page contains no more than 300 arabic words, also i taken more than 2 minutes to generate the result which not very optimized, I will investigate on that to know if that is because specific missing configuration or attention stuff or what is the reason for that.

for reference here's the sample image i run on it:

sayed99

Nov 2, 2025

I modified the code snippet to use the flash attn 2 implementation and the performance got boosted to be 19 seconds instead of 2 minutes and using GPU vRAM of 3.3 GB instead of the massive 45 GB.
could I do a pull request for the new snippet?

maksym-ostapenko

Nov 4, 2025

•

edited Nov 4, 2025

@sayed99 , Could you point exactly to the place where you have made the change?

sayed99

Nov 4, 2025

@maksym-ostapenko
Hi! I recently updated the README to modify the Transformers code and enable the use of FlashAttention 2. This change significantly reduces memory usage and improves performance.

Here’s the pull request I opened for the model card README: Model Card PR

gggdddfff

PaddlePaddle org Nov 12, 2025

I modified the code snippet to use the flash attn 2 implementation and the performance got boosted to be 19 seconds instead of 2 minutes and using GPU vRAM of 3.3 GB instead of the massive 45 GB.
could I do a pull request for the new snippet?

Contributions are highly welcome!

Vinci

Nov 17, 2025

@sayed99 please can you share your notebook?
I have been trying to run the model on Colab with no luck.

sayed99

Nov 18, 2025

@Vinci Hi!, you could find the updated colab optimized code under the new section "Click to expand: Use flash-attn to boost performance and reduce memory usage" on the model card,
and no problem here's the full notebook of the experiment, but please comment out the first part as I assume it will crash on the free t4 gpu due to limited memory, go for the 2 section directly when using the flash attn.
paddle-paddle-inference

alex-dinh

Jan 16

I found that using with torch.inference_mode(): makes a much larger difference than using flash-attn.

At the end of the script:

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False, use_cache=True)
outputs = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(outputs)

sayed99

Jan 16

Hi @alex-dinh ,
what is the difference in terms of time and vRAM to process one image? and what is your technical analysis for that if true?

alex-dinh

Jan 17

•

edited Jan 17

Here's my benchmarking data on my RTX 5060 Ti machine on one image (dimensions are 1468x1140). SDPA is the default attention mode.

Inference mode?	Attn mode	Inference Time (seconds)	VRAM Usage (GB)
Yes	flash_attention_2	9.196	2.844
No	flash_attention_2	10.058	2.825
Yes	sdpa	7.390	2.954
No	sdpa	8.053	2.979

I'm not too familiar with the different attention types, but I know that using inference mode prevents gradients from being computed during model execution which saves a lot of compute, they only matter during training.

For any linux users, remember to run sudo apt install -y nvidia-cuda-toolkit

sayed99

Jan 24

@alex-dinh hi, please share the notebook using colab, to reproduce these numbers,
for the experiment i did, you can find a notebook above reproducing what i found.

alex-dinh

Jan 29

Here is my test script, it's mostly the same as in the example provided by the model authors.

# Adapted from: https://huggingface.co/PaddlePaddle/PaddleOCR-VL

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
import time


# ---- Settings ----
model_path = "/your/path/to/PaddleOCR-VL"
image_path = "/your/test/image.png"
task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula'
# ------------------

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

PROMPTS = {
    "ocr": "OCR:",
    "table": "Table Recognition:",
    "formula": "Formula Recognition:",
    "chart": "Chart Recognition:",
}

image = Image.open(image_path).convert("RGB")

start_time = time.time()
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # options: sdpa, flash_attention_2
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {"role": "user",
     "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": PROMPTS[task]},
        ]
    }
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(DEVICE)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False, use_cache=True)
outputs = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(outputs)
print(f"Elapsed time: {time.time() - start_time :.3f} seconds")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment