Data Mixtures for LilTii-v0.2

Stage 1 (Warmup+Stable) Data Mixture

For this stage, 40% is Bengali text (40B tokens), 35% is educational English text (35B tokens), 14.6% is reasoning-focused English text (14.6B tokens), and 9.5% is educational math English text (9.5B tokens). The detailed breakdown is as follows:

Dataset Name	Subset	Size (Tokens)	Repetition Factor
Polygl0t/gigakriya-v1	Edu Score of 1	5.87B	2
	Edu Score of 2	8.62B	2
	Edu Score of 3	4.25B	2
	Edu Score of 4	1.52B	2
	Edu Score of 5	5.50M	2
HuggingFaceFW/fineweb-edu	Edu Score of 3	35.00B	1
HuggingFaceTB/finemath	Edu Score of 4	8.59B	1
	Edu Score of 5	1.08B	1
allenai/big-reasoning-traces	All	2.44B	1
allenai/math-meta-reasoning-filtered	All	1.24B	2
nvidia/OpenScience	All	9.87B	1

During this stage, the learning rate follows a linear warmup for the first 2,000 steps, reaching a peak of 7e-4. It then remains stable at this peak for the next 47,500 steps before transitioning to the next stage.

Stage 2 (Stable) Data Mixture

For this stage, 40% is Bengali text (40B tokens), 25% is synthetic English text (25B tokens), 14% is educational English text (14B tokens), 14.6% is reasoning-focused English text (14.6B tokens), and 9.5% is educational math English text (9.5B tokens).

Dataset Name	Subset	Size (Tokens)	Repetition Factor
Polygl0t/gigakriya-v1	Edu Score of 1	5.87B	2
	Edu Score of 2	8.62B	2
	Edu Score of 3	4.25B	2
	Edu Score of 4	1.52B	2
	Edu Score of 5	5.50M	2
HuggingFaceFW/fineweb-edu	Edu Score of 4	14.22B	1
HuggingFaceTB/smollm-corpus (Cosmopedia v2)	All	25.0B	1
HuggingFaceTB/finemath	Edu Score of 4	8.59B	1
	Edu Score of 5	1.08B	1
allenai/big-reasoning-traces	All	2.44B	1
allenai/math-meta-reasoning-filtered	All	1.24B	2
nvidia/OpenScience	All	9.87B	1

During this stage, the learning rate remains stable at 7e-4 for the entire duration of 47,500 steps.

Stage 3 (Stable+LinearDecay) Data Mixture

For this stage, 50% is Bengali text (15B tokens), 40% is synthetic English text (12.5B tokens), 1% is highly educational English text (0.27B tokens), 8% is reasoning-focused English text (2.4B tokens), and 1% is highly-educational math English text (1B tokens).

Dataset Name	Subset	Size (Tokens)	Repetition Factor
Polygl0t/gigakriya-v1	Edu Score of 3	4.25B	3
	Edu Score of 4	1.52B	2
	Edu Score of 5	5.50M	3
HuggingFaceFW/fineweb-edu	Edu Score of 5	0.27B	4
HuggingFaceTB/smollm-corpus (Cosmopedia v2)	Half	12.5B	1
HuggingFaceTB/finemath	Edu Score of 5	1.08B	1
allenai/big-reasoning-traces	All	2.44B	1

During this stage, the learning rate starts at 7e-4 and remains stable for the first 3,000 steps. It then linearly decays to 0 over the remaining 12,000 steps. The decay phase covers approximately 25 billion tokens, about 10% of the total training tokens.