LilTii-v0.2 / data_mixtures.md
nicholasKluge's picture
Update data_mixtures.md
adf1d17 verified

Data Mixtures for LilTii-v0.2

Stage 1 (Warmup+Stable) Data Mixture

For this stage, 40% is Bengali text (40B tokens), 35% is educational English text (35B tokens), 14.6% is reasoning-focused English text (14.6B tokens), and 9.5% is educational math English text (9.5B tokens). The detailed breakdown is as follows:

Dataset Name Subset Size (Tokens) Repetition Factor
Polygl0t/gigakriya-v1 Edu Score of 1 5.87B 2
Edu Score of 2 8.62B 2
Edu Score of 3 4.25B 2
Edu Score of 4 1.52B 2
Edu Score of 5 5.50M 2
HuggingFaceFW/fineweb-edu Edu Score of 3 35.00B 1
HuggingFaceTB/finemath Edu Score of 4 8.59B 1
Edu Score of 5 1.08B 1
allenai/big-reasoning-traces All 2.44B 1
allenai/math-meta-reasoning-filtered All 1.24B 2
nvidia/OpenScience All 9.87B 1

During this stage, the learning rate follows a linear warmup for the first 2,000 steps, reaching a peak of 7e-4. It then remains stable at this peak for the next 47,500 steps before transitioning to the next stage.

Stage 2 (Stable) Data Mixture

For this stage, 40% is Bengali text (40B tokens), 25% is synthetic English text (25B tokens), 14% is educational English text (14B tokens), 14.6% is reasoning-focused English text (14.6B tokens), and 9.5% is educational math English text (9.5B tokens).

Dataset Name Subset Size (Tokens) Repetition Factor
Polygl0t/gigakriya-v1 Edu Score of 1 5.87B 2
Edu Score of 2 8.62B 2
Edu Score of 3 4.25B 2
Edu Score of 4 1.52B 2
Edu Score of 5 5.50M 2
HuggingFaceFW/fineweb-edu Edu Score of 4 14.22B 1
HuggingFaceTB/smollm-corpus (Cosmopedia v2) All 25.0B 1
HuggingFaceTB/finemath Edu Score of 4 8.59B 1
Edu Score of 5 1.08B 1
allenai/big-reasoning-traces All 2.44B 1
allenai/math-meta-reasoning-filtered All 1.24B 2
nvidia/OpenScience All 9.87B 1

During this stage, the learning rate remains stable at 7e-4 for the entire duration of 47,500 steps.

Stage 3 (Stable+LinearDecay) Data Mixture

For this stage, 50% is Bengali text (15B tokens), 40% is synthetic English text (12.5B tokens), 1% is highly educational English text (0.27B tokens), 8% is reasoning-focused English text (2.4B tokens), and 1% is highly-educational math English text (1B tokens).

Dataset Name Subset Size (Tokens) Repetition Factor
Polygl0t/gigakriya-v1 Edu Score of 3 4.25B 3
Edu Score of 4 1.52B 2
Edu Score of 5 5.50M 3
HuggingFaceFW/fineweb-edu Edu Score of 5 0.27B 4
HuggingFaceTB/smollm-corpus (Cosmopedia v2) Half 12.5B 1
HuggingFaceTB/finemath Edu Score of 5 1.08B 1
allenai/big-reasoning-traces All 2.44B 1

During this stage, the learning rate starts at 7e-4 and remains stable for the first 3,000 steps. It then linearly decays to 0 over the remaining 12,000 steps. The decay phase covers approximately 25 billion tokens, about 10% of the total training tokens.