124M GPT with Symbolic Reasoning Distillation

Teacher: SmolLM-135M-Instruct (frozen)
Time: ~75 min on 1x A100
Tokens: 327,680,000 (0 reasoning / 20,000 general batches)
Best loss: 186.6474

Trained from scratch on mixed data with dual-alpha distillation:

Stream	Dataset	Alpha	Purpose
General	FineWeb-Edu	0.2	Language modeling, light teacher guidance
Reasoning	GSM8K chain-of-thought	0.8	Heavy distillation: teacher guides step-by-step math reasoning

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train farpluto/zubenelgenubi-1.1-124m