Extremely slow inference (~130s) on RTX 3080 - "slow image processor" warning
#3
by
simpple28
- opened
TranslateGemma-4b-it extremely slow inference (~130 seconds per translation) - Am I doing something wrong?
Environment
- OS: Windows 11
- GPU: NVIDIA GeForce RTX 3080 (10GB VRAM)
- Python: 3.13
- PyTorch: 2.6.0+cu124
- Transformers: 4.57.6
Issue
I'm experiencing extremely slow inference times (~80-130 seconds) for even simple single-word translations like "Hello". The delay is consistent regardless of input length (short or long text takes roughly the same time).
Warning Message
Every time the model loads, I see this warning:
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model.
What I've Tried
- Official documentation code (direct initialization) - Still slow
- 4-bit quantization with bitsandbytes - Still slow
- Setting
use_fast=Trueon AutoProcessor - Warning still appears - Using AutoImageProcessor with
use_fast=True- Warning still appears - Bypassing pipeline, using model.generate() directly - Still slow
My Code (Official Documentation Style)
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "google/translategemma-4b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"source_lang_code": "en",
"target_lang_code": "ko",
"text": "Hello",
}
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
input_len = len(inputs['input_ids'][0])
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Observations
- GPU VRAM usage: ~9GB (model loads successfully)
- GPU utilization during inference: High
- The delay appears to be a fixed overhead (~90 seconds) regardless of input length
- Short text (1 word) and long text (500 words) both take similar time
Questions
- Is this expected behavior for TranslateGemma-4b-it on consumer hardware?
- Is the "slow image processor" warning causing the delay?
- Is there a way to use a fast image processor for text-only translation?
- Are there any recommended optimizations for Windows + CUDA environment?
Any help would be greatly appreciated!
This comment has been hidden (marked as Resolved)
bitsandbytes is slow.
Try using awq+vllm or exl3
kaitchup/translategemma-4b-it-FP8-Dynamic
kaitchup/translategemma-4b-it-NVFP4