Optimization details

#16

by ShaneTian - opened Mar 6, 2024

I noticed that StarCoder2 uses two stages of pre-training, with the stage 2 for long-context training.

Take StarCoder-15B as an example,

So the question is:

BigCode org Mar 11, 2024

min_lr is max_lr/10 and the warmup of the second stage is also 1000

Thanks

ShaneTian changed discussion status to closed Mar 11, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment