File size: 9,021 Bytes
c413b55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4785466
c413b55
 
 
 
4785466
 
 
 
 
 
 
c413b55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285b36e
 
 
 
 
 
 
 
 
 
 
c413b55
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- pretraining
- educational
- pedagogical
- sutra
- smollm2
- llama
pipeline_tag: text-generation
model-index:
- name: SmolLM2-70M
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: ai2_arc
      name: ARC-Easy
      config: ARC-Easy
    metrics:
    - type: acc_norm
      value: 33.00
      name: Normalized Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: ai2_arc
      name: ARC-Challenge
      config: ARC-Challenge
    metrics:
    - type: acc_norm
      value: 22.35
      name: Normalized Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: boolq
      name: BoolQ
    metrics:
    - type: acc
      value: 39.66
      name: Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: hellaswag
      name: HellaSwag
    metrics:
    - type: acc_norm
      value: 26.14
      name: Normalized Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: piqa
      name: PIQA
    metrics:
    - type: acc_norm
      value: 54.84
      name: Normalized Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: sciq
      name: SciQ
    metrics:
    - type: acc_norm
      value: 45.20
      name: Normalized Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: winogrande
      name: WinoGrande
    metrics:
    - type: acc
      value: 50.04
      name: Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: truthful_qa
      name: TruthfulQA MC2
    metrics:
    - type: acc
      value: 48.02
      name: Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: gsm8k
      name: GSM8K
    metrics:
    - type: exact_match
      value: 0.53
      name: Exact Match (5-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: cais/mmlu
      name: MMLU
    metrics:
    - type: acc
      value: 22.96
      name: Accuracy (0-shot)
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: openbookqa
      name: OpenBookQA
    metrics:
    - type: acc_norm
      value: 27.60
      name: Normalized Accuracy (0-shot)
base_model: HuggingFaceTB/SmolLM2-70M
datasets:
- codelion/sutra-10B
---

# SmolLM2-70M

A SmolLM2-70M model pretrained on the [Sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) pedagogical dataset for 3 epochs (~30.6B tokens total). This model demonstrates that a 69M parameter model can be trained to near-capacity performance using dense, curated educational data.

## Model Details

| Property | Value |
|----------|-------|
| Architecture | LlamaForCausalLM |
| Parameters | 69.2M |
| Hidden Size | 384 |
| Layers | 32 |
| Attention Heads | 6 (2 KV heads) |
| Context Length | 8,192 |
| Vocabulary | 49,152 |
| Precision | bfloat16 |
| Base Model | [SmolLM2-70M](https://huggingface.co/HuggingFaceTB/SmolLM2-70M) |
| Training Dataset | [Sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) (10.2B tokens) |

## Training

The model was trained for 3 epochs on the Sutra-10B dataset using a single NVIDIA L40S GPU (46GB). This checkpoint is the best perplexity checkpoint from epoch 3.

| Epoch | Tokens | Training Time | Learning Rate | Best Perplexity |
|-------|--------|---------------|---------------|-----------------|
| 1 | 10.2B | 25.82h | 3e-4 β†’ 3e-5 | 39.50 |
| 2 | 10.2B | 25.78h | 1e-4 β†’ 1e-5 | 37.81 |
| 3 | 10.2B | 26.16h | 3e-5 β†’ 3e-6 | 37.72 |
| **Total** | **30.6B** | **77.76h** | β€” | **37.72** |

Training configuration:
- Optimizer: AdamW (fused), weight decay 0.1
- Schedule: Cosine with warmup
- Batch size: 4 per device, gradient accumulation 8 (effective ~262K tokens/step)
- Sequence length: 8,192
- Flash Attention 2, TF32 matmul, torch.compile
- Throughput: ~110K tokens/sec

## Benchmark Results

All benchmarks evaluated using [lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11. All tasks are 0-shot except GSM8K (5-shot).

### This Model vs Training Progression

| Benchmark | **E3-best** | E3-final | E2-best | E2-final | E1-final |
|-----------|:-----------:|:--------:|:-------:|:--------:|:--------:|
| ARC-Easy | **33.00** | 33.16 | 32.83 | 33.12 | 33.46 |
| ARC-Challenge | **22.35** | 21.67 | 22.61 | 22.44 | 22.44 |
| BoolQ | **39.66** | 39.66 | 39.79 | 39.54 | 39.79 |
| HellaSwag | **26.14** | 26.03 | 26.08 | 25.91 | 26.03 |
| PIQA | **54.84** | 55.01 | 54.24 | 54.13 | 54.62 |
| SciQ | **45.20** | 46.30 | 44.10 | 45.50 | 43.60 |
| WinoGrande | **50.04** | 49.33 | 50.51 | 48.70 | 48.78 |
| TruthfulQA | **48.02** | 47.93 | 48.30 | 48.14 | 48.30 |
| GSM8K | **0.53** | 0.61 | 0.68 | 0.83 | 0.15 |
| MMLU | **22.96** | 22.87 | 23.00 | 22.98 | 22.99 |
| OpenBookQA | **27.60** | 27.60 | β€” | β€” | β€” |
| **Average (10)** | **34.27** | 34.26 | 34.21 | 34.13 | 34.02 |

### Comparison with 1B Token Baselines (SmolLM2-70M)

These are results from training the same SmolLM2-70M model on various 1B-token datasets from the [Pre-training Dataset Samples](https://huggingface.co/collections/codelion/pre-training-dataset-samples-686bd760abf1a43b0ce32829) collection for 1 epoch, showing that Sutra-10B at 3 epochs achieves the highest performance for this model size.

| Dataset (1B tokens) | HellaSwag | PIQA | WinoGrande | ARC-C | MMLU | TruthfulQA | GSM8K | Avg |
|---------------------|-----------|------|------------|-------|------|------------|-------|-----|
| **Sutra-10B (3 epochs)** | **26.14** | **54.84** | **50.04** | **22.35** | 22.96 | **48.02** | 0.53 | **34.27** |
| [Sutra-1B](https://huggingface.co/datasets/codelion/sutra-1B) | 25.43 | 53.86 | 49.41 | 23.04 | 22.91 | 49.09 | 1.14 | 32.13 |
| [FineWiki-1B](https://huggingface.co/datasets/HuggingFaceFW/finewiki) | 25.56 | 51.69 | 48.86 | 24.15 | **23.34** | 51.16 | 0.91 | 32.24 |
| [FinePDFs-1B](https://huggingface.co/datasets/HuggingFaceFW/FinePDFs) | 25.58 | 52.56 | 50.51 | 22.44 | 22.95 | 51.41 | 1.21 | 32.38 |
| [DCLM-Baseline-1B](https://huggingface.co/datasets/codelion/dclm-baseline-1B) | 25.85 | 55.17 | 50.20 | 21.08 | 22.97 | 49.21 | 0.68 | 32.16 |
| [FineWeb-Edu-1B](https://huggingface.co/datasets/codelion/fineweb-edu-1B) | 25.72 | 55.11 | 50.36 | 21.25 | 22.96 | 48.11 | 1.21 | 32.10 |
| [Essential-Web-1B](https://huggingface.co/datasets/sumukshashidhar-archive/essential-web-v1.0-sample-1B) | 26.02 | 55.44 | 48.30 | 20.99 | 22.95 | 49.59 | 1.29 | 32.08 |
| [Synth-1B](https://huggingface.co/datasets/codelion/synth-1B) | 26.63 | 50.98 | 48.78 | 21.93 | 23.24 | 47.10 | 1.29 | 31.42 |

## Key Findings

1. **Capacity ceiling**: The 70M parameter model reaches its capacity ceiling at approximately 10B tokens. Additional epochs (up to 30.6B total tokens) yield only marginal improvements in benchmark scores (+0.25 average from epoch 1 to epoch 3), despite continued perplexity improvement (39.50 β†’ 37.72).

2. **Perplexity vs benchmarks**: Perplexity continues to decrease across epochs, but downstream benchmark performance plateaus, suggesting the model's representational capacity is the bottleneck rather than data exposure.

3. **Data quality matters**: Even at 1B tokens, Sutra outperforms or matches larger web-crawled datasets (DCLM, FineWeb-Edu, Essential-Web) on average, demonstrating the value of curated pedagogical content.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("codelion/SmolLM2-70M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("codelion/SmolLM2-70M")

input_text = "The theory of relativity states that"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Limitations

- This is a 69M parameter base model (not instruction-tuned) β€” it generates completions, not conversational responses
- Performance is at the capacity ceiling for this model size; larger models would benefit more from the Sutra-10B dataset
- The model was trained primarily on English educational content

## Related Resources

- **Dataset**: [codelion/sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) β€” 10B token pedagogical pretraining dataset
- **Sutra Framework**: Generates structured educational content optimized for LLM pretraining

## Citation

```bibtex
@article{sharma2026sutra,
  title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens},
  author={Sharma, Asankhaya},
  year={2026},
  url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens}
}
```

## License

Apache 2.0