Extremely slow inference (~130s) on RTX 3080 - "slow image processor" warning

#3
by simpple28 - opened

TranslateGemma-4b-it extremely slow inference (~130 seconds per translation) - Am I doing something wrong?

Environment

  • OS: Windows 11
  • GPU: NVIDIA GeForce RTX 3080 (10GB VRAM)
  • Python: 3.13
  • PyTorch: 2.6.0+cu124
  • Transformers: 4.57.6

Issue

I'm experiencing extremely slow inference times (~80-130 seconds) for even simple single-word translations like "Hello". The delay is consistent regardless of input length (short or long text takes roughly the same time).

Warning Message

Every time the model loads, I see this warning:

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model.

What I've Tried

  1. Official documentation code (direct initialization) - Still slow
  2. 4-bit quantization with bitsandbytes - Still slow
  3. Setting use_fast=True on AutoProcessor - Warning still appears
  4. Using AutoImageProcessor with use_fast=True - Warning still appears
  5. Bypassing pipeline, using model.generate() directly - Still slow

My Code (Official Documentation Style)

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "google/translategemma-4b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, 
    device_map="auto",
    torch_dtype=torch.bfloat16
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "source_lang_code": "en",
                "target_lang_code": "ko",
                "text": "Hello",
            }
        ],
    }
]

inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_dict=True, 
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = len(inputs['input_ids'][0])

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Observations

  • GPU VRAM usage: ~9GB (model loads successfully)
  • GPU utilization during inference: High
  • The delay appears to be a fixed overhead (~90 seconds) regardless of input length
  • Short text (1 word) and long text (500 words) both take similar time

Questions

  1. Is this expected behavior for TranslateGemma-4b-it on consumer hardware?
  2. Is the "slow image processor" warning causing the delay?
  3. Is there a way to use a fast image processor for text-only translation?
  4. Are there any recommended optimizations for Windows + CUDA environment?

Any help would be greatly appreciated!

This comment has been hidden (marked as Resolved)

bitsandbytes is slow.
Try using awq+vllm or exl3

kaitchup/translategemma-4b-it-FP8-Dynamic
kaitchup/translategemma-4b-it-NVFP4

Sign up or log in to comment