token masking and space_between_special_tokens during finetuning

#36
by Darkhn - opened

Good morning team Gemma,

After fine-tunning gemma-4-31B-it, i have a few questions, since the way the model reason, changed.

i make use extensively of asterisks in my dataset as paragraphs delimiter, it seems after training, the model will no longer reason at all without a prefill of asterisk on top of <|channel>thought\n* like so.

my questions are:

  1. does it requires the use space_between_special_tokens and so i should process my dataset like so?

<|channel>thought\n *

  1. since it says to add those control tokens to non reasoning datasets (which i did) must those be masked?

https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4

image

the first two screenshots are processed, with reasoning sample, and the third one is a non-reasoning dataset

image

  1. i do see that EOS and EOT are different, must EOS be added to the end of every samples or were EOS used only during pre-training? (the chat template does not seem to)

Thanks for your time, and the model is just, one of the best, it does seem that hybrid attention sliding+full leads to great results, even Qwen3.5 27B is hitting above its weight, but Gemma 4 with it's world/media knowledge make's it punch way above its weight, i was still running llama 3.3 70B and mistral large 2 123B, since there was no real dense alternative, most model in the 24-32B range were not getting close.

Google org

Hi @Darkhn Apologies for late response .
The dependency on the asterisk (*) is a result of pattern association during SFT; since your dataset consistently used it as a delimiter, the model now treats it as the statistical trigger for the reasoning state . To fix this, you should diversify your training data to avoid over-relying on a single character as an anchor.
If your training data contains a space (<|channel>thought\n ), but your inference prefill does not (<|channel>thought\n), you are introducing a distribution shift. You must ensure your inference prefill exactly matches the token sequence used during training .
To answer your second question when fine-tuning larger Gemma models with a dataset that does not include thinking, you can achieve better results by adding the empty channel to your training prompts.
Thanks

Sign up or log in to comment