Significant Performance Regression and Content Omission after 8-bit Quantization

#13
by Emily901017 - opened
  1. 8-bit Latency Regression
    Observation: Inference time jumped from 4s (FP16) to 22s (8-bit).

Memory: Dropped from 8.6GB to 5.5GB.

Problem: The quantization overhead is significantly higher than the memory savings, making 8-bit unusable in this environment.
We are currently using bitsandbytes for this; are there any alternative quantization methods or optimizations recommended for this specific model to avoid such a penalty?

  1. Max Tokens Latency Bug
    Observation: Increasing max_new_tokens (e.g., 1024) drastically increases latency, even if the model generates only a few words.

Problem: The system seems to incur costs proportional to the limit set, rather than the actual tokens generated.

  1. Vision Task Omission
    Observation: For images with dense text, the model omits significant portions of the content.

Problem: Raising max_new_tokens does not improve content coverage; it only slows down the process.

Google org

Hi @Emily901017 ,

The issues you are facing are specific to how this architecture interacts with standard hugging face transformers and bitsandbytes library.

The bitsandbytes library was designed primarily to fit larger models into limited VRAM, not to speed up interface. When you run in 8-bit mode, the weights are stored in int8. however, the GPU cannot natively perform matrix multiplication between int8 and fp16 activations in the specific way bitsandbytes implements it.
If done, for every layer, the library must fetch int8, dequantize them back to fp16 on the fly and perform compute in fp16. You save memory but you add a massive computational step to every forward pass. For smaller models, this compute overhead dwarfs the bandwidth savings.
Try using AutoAWQ or GGUF quantizations.

For max token latency, the behaviour indicates your inference pipeline is using s Static KV Cache. Try attn_implementation="flash_attention_2" and ensure dynamic caching is enabled.

For vision omission, as you can see in Model Card, for inputs, Images are normalised to 896 x 896 resolution and encoded to 256 tokens each. So you cannot fix this by changing model parameters, so you can implement a sliding window strategy(Image tiling) for your input pipeline. There are multiple ways to do this, one of them is Splitting the dense document into 4 overlapping quadrants, run translategemma on each quadrant independently and combine text outputs. This effectively increases resolution.

Thank you!

Sign up or log in to comment