Optimization details
#16
by
ShaneTian
- opened
I noticed that StarCoder2 uses two stages of pre-training, with the stage 2 for long-context training.
Take StarCoder-15B as an example,
- In the stage 1,
rope_theta=1e4,warmup=1000,max_lr=3e-4 - In the stage 2,
rope_theta=1e5,max_lr=3e-5
So the question is:
- In the stage 1, what is
min_lr? - In the stage 2, what is
min_lrandwarmup?
min_lr is max_lr/10 and the warmup of the second stage is also 1000
Thanks
ShaneTian
changed discussion status to
closed