Model Vocab Size and Tokenizer Vocab Size NOT SAME. It is problem For Training ?

by GokhanAI - opened Feb 16, 2024

Discussion

GokhanAI

Feb 16, 2024

•

edited Feb 16, 2024

WHICH ONE IS TRUE ? HOW TO SOLVE THIS PROBLEM .
model.config.vocab = 157824
tokenizer.vocab_size = 157797

We solve this problem. This example:

"""
if model.config.vocab != tokenizer.vocab_size:
model.resize_token_embeddings(len(tokenizer))
"""

But this solution, during training this cause problem ? And Reduce Embedding vocab size O

tosh97

Feb 16, 2024

Hey GokhanAI!

You don’t need to resize the embedding matrix. It’s fine if the model embedding matrix is larger in length than the config vocab size. We pad the model embedding matrix to a multiple of 64 to take advantage of the ampere GPU’s. Hence the reason, model.config.vocab > tokenizer.vocab_size

tosh97

Feb 16, 2024

Please do let me know if that works, otherwise I can provide a short snippet on how to further fine-tune the model for a downstream task.

GokhanAI

Feb 17, 2024

Please do let me know if that works, otherwise I can provide a short snippet on how to further fine-tune the model for a downstream task.

Thank you for your return.

I want to implement /huggingface/alignment-handbook sft and dpo structures. I want to make the dataset they suggested by making it suitable for Turkish. I aim to establish a chat structure. Do you think I can achieve success with this? I'm afraid I don't know how successful it will be in Turkish, but I would like to get your thoughts and advice.

tosh97

Feb 19, 2024

Hey Gokhan! Yes, I believe it should be successful with Turkish. However, this solely depends on your dataset and the quality of it. One chat structure you could consider taking after is HuggingFaceH4/zephyr-7b-alpha.

That would look something like:

messages = [
{
"role": "system",
"content": "Sen arkadaş canlısı bir sohbet robotusun.",
},
{"role": "user", "content": "Adın ne senin??"},
]

GokhanAI

Feb 28, 2024

Please do let me know if that works, otherwise I can provide a short snippet on how to further fine-tune the model for a downstream task.

Hello, /huggingface/alignment-handbook for training The results were not good at all. I also tried different SFT methods. But the result either produces no answers or produces complete meaningless sentences. I trained the data in Turkish format. How can you help with this? Do you have any SFT-DPO codes you recommend?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment