Review of gemma-3-270M

#31

by Clemylia-LLMs - opened Mar 5

Mar 5

Hello, this is a very good model for its size and performance, it is certainly very complete and incredible on benchmarks.

However, I am used to training my own models from scratch, which are about the same size (270m), without any problems.
When I try to fine-tune your gemma, it crashes my session as if it were much bigger.

So it is unusable in the context of fine-tuning and open source, sorry
For original creations, it's best to start from scratch.

pannaga10

Google org Mar 6

Hi @Clemylia-LLMs
Thanks for the feedback and for trying Gemma .
Just to clarify from your message it sounds like you usually train models from scratch (~270M parameters) and when attempting to fine-tune Gemma, your session crashes due to memory usage.
Even with similar parameter counts, different architectures can have very different training-time memory footprints. Factors such as attention implementation optimizer states, sequence length, and framework defaults can significantly affect memory usage.
To help understand what’s happening in your case, could you share a bit more detail about your setup?

Which Gemma checkpoint are you using?
What hardware / GPU and VRAM are you running on?
Are you attempting full fine-tuning or PEFT?
What sequence length and batch size are you using?
Are you using the Transformers Trainer, Accelerate, or a custom script?

Thanks

Clemylia-LLMs

Mar 6

Hello!
I used the gemma shown here (gemma-3-270m.)
I tried fine-tuning it on a GPU T4 In full-fine-tuning ,
I don't remember the more information remains.
I'm used to training much larger models from scratch, so I was surprised by the crash.

pannaga10

Google org Mar 11

•

edited Mar 11

Hi @Clemylia-LLMs

A T4 (16 GB VRAM) should generally be capable of fine-tuning a ~270M parameter model, but with Gemma-3-270M there are a few architectural and training-time factors that can increase memory usage compared to similarly sized models trained from scratch.
A few things that could explain the crash:

Optimizer state memory
Full fine-tuning with optimizers like AdamW requires storing model weights + gradients + optimizer states (~3×–4× parameter size).Depending on framework defaults, this can grow quickly.
Larger vocabulary
Gemma 3 270M uses a relatively large vocabulary (~256k tokens)
Sequence length
If the default context length 8k or higher was used, attention activations can dominate memory usage during training.
Precision settings on T4
The T4 (Turing architecture) does not support bf16.If your training script defaults to bf16, some frameworks may fall back to fp32, which significantly increases memory usage.

Alternative could be to use parameter-efficient methods like LoRA can significantly reduce memory usage by freezing most of the base model while training only small adapter layers.

If you're able to share a minimal training script or configuration that reproduces the crash, we'd be happy to try running it on our side.

Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment