Extra tokens for UL2

by memyprokotow - opened Sep 22, 2025

Sep 22, 2025

I have a question about the tokenizer, as I noticed that there are no additional_special_tokens with extra_id_{number} in the model's tokenizer config that was trained on the UL2 task. Is this intentional, or are you using different tokens for extra_id ? They were present in the previous UL2-trained model: https://huggingface.co/google/ul2/blob/main/tokenizer_config.json

Renu11

Google org Oct 3, 2025

Yes, It's intentional. The model uses token (ex- <unused0> to <unused98>) to serve the identical purpose of sentinel tokens for the UL2 denoising objective. These tokens are already integrated directly into the model's main vocabulary added_tokens_decoder and do not need to be separately listed under additional_special_tokens. Please have a look at this t5gemma-2b-2b-ul2 - tokenizer_config.json file.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment