Data Mixtures for LilTii-v0.2
Stage 1 (Warmup+Stable) Data Mixture
For this stage, 40% is Bengali text (40B tokens), 35% is educational English text (35B tokens), 14.6% is reasoning-focused English text (14.6B tokens), and 9.5% is educational math English text (9.5B tokens). The detailed breakdown is as follows:
| Dataset Name | Subset | Size (Tokens) | Repetition Factor |
|---|---|---|---|
| Polygl0t/gigakriya-v1 | Edu Score of 1 | 5.87B | 2 |
| Edu Score of 2 | 8.62B | 2 | |
| Edu Score of 3 | 4.25B | 2 | |
| Edu Score of 4 | 1.52B | 2 | |
| Edu Score of 5 | 5.50M | 2 | |
| HuggingFaceFW/fineweb-edu | Edu Score of 3 | 35.00B | 1 |
| HuggingFaceTB/finemath | Edu Score of 4 | 8.59B | 1 |
| Edu Score of 5 | 1.08B | 1 | |
| allenai/big-reasoning-traces | All | 2.44B | 1 |
| allenai/math-meta-reasoning-filtered | All | 1.24B | 2 |
| nvidia/OpenScience | All | 9.87B | 1 |
During this stage, the learning rate follows a linear warmup for the first 2,000 steps, reaching a peak of 7e-4. It then remains stable at this peak for the next 47,500 steps before transitioning to the next stage.
Stage 2 (Stable) Data Mixture
For this stage, 40% is Bengali text (40B tokens), 25% is synthetic English text (25B tokens), 14% is educational English text (14B tokens), 14.6% is reasoning-focused English text (14.6B tokens), and 9.5% is educational math English text (9.5B tokens).
| Dataset Name | Subset | Size (Tokens) | Repetition Factor |
|---|---|---|---|
| Polygl0t/gigakriya-v1 | Edu Score of 1 | 5.87B | 2 |
| Edu Score of 2 | 8.62B | 2 | |
| Edu Score of 3 | 4.25B | 2 | |
| Edu Score of 4 | 1.52B | 2 | |
| Edu Score of 5 | 5.50M | 2 | |
| HuggingFaceFW/fineweb-edu | Edu Score of 4 | 14.22B | 1 |
| HuggingFaceTB/smollm-corpus (Cosmopedia v2) | All | 25.0B | 1 |
| HuggingFaceTB/finemath | Edu Score of 4 | 8.59B | 1 |
| Edu Score of 5 | 1.08B | 1 | |
| allenai/big-reasoning-traces | All | 2.44B | 1 |
| allenai/math-meta-reasoning-filtered | All | 1.24B | 2 |
| nvidia/OpenScience | All | 9.87B | 1 |
During this stage, the learning rate remains stable at 7e-4 for the entire duration of 47,500 steps.
Stage 3 (Stable+LinearDecay) Data Mixture
For this stage, 50% is Bengali text (15B tokens), 40% is synthetic English text (12.5B tokens), 1% is highly educational English text (0.27B tokens), 8% is reasoning-focused English text (2.4B tokens), and 1% is highly-educational math English text (1B tokens).
| Dataset Name | Subset | Size (Tokens) | Repetition Factor |
|---|---|---|---|
| Polygl0t/gigakriya-v1 | Edu Score of 3 | 4.25B | 3 |
| Edu Score of 4 | 1.52B | 2 | |
| Edu Score of 5 | 5.50M | 3 | |
| HuggingFaceFW/fineweb-edu | Edu Score of 5 | 0.27B | 4 |
| HuggingFaceTB/smollm-corpus (Cosmopedia v2) | Half | 12.5B | 1 |
| HuggingFaceTB/finemath | Edu Score of 5 | 1.08B | 1 |
| allenai/big-reasoning-traces | All | 2.44B | 1 |
During this stage, the learning rate starts at 7e-4 and remains stable for the first 3,000 steps. It then linearly decays to 0 over the remaining 12,000 steps. The decay phase covers approximately 25 billion tokens, about 10% of the total training tokens.