why UMT5

by pszemraj - opened Apr 4, 2024

Apr 4, 2024

Why does this use UMT5 for the model class/arch (for a model trained primarily on English), yet the card says nothing about it?

From some test fine-tuning of this model, the gradients do not seem to update except for the LM head when using run_summarization.py, which might be related to this.. t5-v1_1 in this model's place works fine

lintang

EleutherAI org Apr 4, 2024

Hi, UMT5 model checkpoints were originally trained with T5x while T5v1.1 uses the text-to-text repository. I used T5x for this and since it’s compatible, I figured it would be easier to use UMT5. Please also note this is still a WIP and an official release/blogpost is coming soon.

I can also check. What script was this from?

pszemraj

Apr 16, 2024

hey! sorry for the delay. So in the process of going through my stuff/writing this response, I realized that this model uses a verbatim T5 Tokenizer, while both the smaller (base) and larger (xl) checkpoints use the llama tokenizer. is this model supposed to also use that ?

lintang

EleutherAI org Apr 17, 2024

Thanks for letting me know. I'd updated it.

lintang changed discussion status to closed Apr 17, 2024

pszemraj

Apr 17, 2024

awesome thanks! let me know if I should create an issue elsewhere, but either I'm doing something wrong, or the UMT5 arch has a bug with params not updating for anything but the task-specific head. Have you guys finetuned literally your checkpoints on hf with any of the example scripts or similar?

Running summarization with your pile t5 base

if I update the state_dict etc to use standard T5 arch/ T5ForConditionalGeneration

if you find it useful/want to explore further the wandb project is open here

stellaathena

EleutherAI org Apr 28, 2024

awesome thanks! let me know if I should create an issue elsewhere, but either I'm doing something wrong, or the UMT5 arch has a bug with params not updating for anything but the task-specific head. Have you guys finetuned literally your checkpoints on hf with any of the example scripts or similar?

Running summarization with your pile t5 base

if I update the state_dict etc to use standard T5 arch/ T5ForConditionalGeneration

if you find it useful/want to explore further the wandb project is open here

This seems like a HF-specific bug. Very frustrating, but we did also release the T5x-compatible checkpoints which don't have this issue (add -t5x to the end of the URL).

zokica

Jun 29, 2024

•

edited Jun 29, 2024

I do not understand? Did you use t5 tokenizer or llama tokenizer for training of the large model?

Because results of the large model are actually worse than other models, so I guess you made a mistake and used a wrong tokenizer. Otherwise you would get improvements as for the other models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment