Is the training speed of t5gemma really much slower than Qwen/Intern VL?

#4
by Hzzone - opened

I am trying to train a VL model using t5gemma. I expected the encoder-decoder to be slower than the decoder-only structure, but I didn't expect it to be this much slower.

Compared to internvl, the number of cropped images for both is similar. However, internvl can achieve 2it/s, while t5gemma is at 4s/it.

This gap is a bit too large. Is there something wrong?

Google org

Hey @Hzzone ,

While an encoder-decoder model like T5Gemma-2 is. naturally heavier than a decoder only architecture, the 8x gap is larger than what is typically expected from architectural difference alone. Encoder-decoder models do incur additional compute from running both stacks plus cross-attention, but that overhead by itself usually doesn't translate into such a dramatic slowdown. This suggests there may be an optimisation bottleneck in your current setup.
One common cause is the attention implementation. T5Gemma-2 uses a merged attention design that combines self and cross attention and performance can vary significantly depending on whether optimised kernels are enabled. If the training run is falling back to an unfused or eager attention path, throughput can drop substantially on modern GPUs. it's worth checking the attn_implementation setting and logs to confirm which backend is active.
You can also verify that the SigLIP vision encoder is frozen (requires_grad=False). If the vision tower is being trained unintentionally, the backward pass cost increases significantly and can easily widen the performance gap compared to setups where the vision backbone is frozen.
Please make sure the comparison is end to end: confirm total token counts(input+output), vision tokenisation strategy and numeric precision. T5Gemma-2 is optimised for bfloat16 and running in FP32 or with higher effective token counts can materially impact throughput. Checking these factors should help explain, likely reduce a significant portion of the gap.

Thank you!

Thank you for your reply. I have fixed the vit. I found that the main reason is gradient_checkpointing; enabling this has a much greater impact than a decoder-only model. After turning it off, the batch size can remain the same, but the training speed has significantly improved. Comparison: 120 samples/s vs 87 samples/s, I think this comparison meets expectations.

Google org

Thank you,
It's great to hear that the speed has increased.
Please reach out if you have any other doubts.

Sign up or log in to comment