Can you provide more details on the training?

#10

by dequ777 - opened May 11, 2024

May 11, 2024

I try to reproduce the result on 700M bitnet b1.58b model, but I failed.
Instead of being S-shaped, the loss curve showed an exponential decay. The ppl of the final model was 18.7, but in the paper it was 12.87, and the 700M model you provided was also achievable.
I think my training setup is exactly the same as the paper, but I don't know exactly how the training set RedPajama-100B was generated, I need more details about the training and dataset

zwz2023

May 16, 2024

I have met the same problem, my loss curve declined quickly when training a 1.1B model, and the MFU is very low.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment