RaymondLi
/

sc2-1b-data-ablation

Model card Files Files and versions

sc2-1b-data-ablation / README.md

RaymondLi's picture

Create README.md

1ac955a over 2 years ago

|

history blame contribute delete

712 Bytes

	1B-parameter models trained on Python-only datasets. In the different branches, models are trained on different versions of the Stack:
	- stack v1
	- stack v2 - permissive
	- stack v2 - permissive and unlicensed

	24 layers, a hidden-size of 2048 and 16 attention heads (multiquery).
	The learning-rate is set to $4\times10^{-4}$ after a warmup of $1000$ steps and follows a cosine decay to $4\times10^{-5}$ at the end of training.
	Trained with a batch size of 128 samples of 8192 tokens each, for $100$k iterations, such that the model sees $100$B tokens at the end of training.
	We use a FIM-rate of $0.5$, the same tokenizer as StarCoder (except for tokenizer ablations) and learned absolute positional embeddings.