Update README.md

285b36e verified 1 day ago

9.02 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- pretraining
	- educational
	- pedagogical
	- sutra
	- smollm2
	- llama
	pipeline_tag: text-generation
	model-index:
	- name: SmolLM2-70M
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: ai2_arc
	name: ARC-Easy
	config: ARC-Easy
	metrics:
	- type: acc_norm
	value: 33.00
	name: Normalized Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: ai2_arc
	name: ARC-Challenge
	config: ARC-Challenge
	metrics:
	- type: acc_norm
	value: 22.35
	name: Normalized Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: boolq
	name: BoolQ
	metrics:
	- type: acc
	value: 39.66
	name: Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: hellaswag
	name: HellaSwag
	metrics:
	- type: acc_norm
	value: 26.14
	name: Normalized Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: piqa
	name: PIQA
	metrics:
	- type: acc_norm
	value: 54.84
	name: Normalized Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: sciq
	name: SciQ
	metrics:
	- type: acc_norm
	value: 45.20
	name: Normalized Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: winogrande
	name: WinoGrande
	metrics:
	- type: acc
	value: 50.04
	name: Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: truthful_qa
	name: TruthfulQA MC2
	metrics:
	- type: acc
	value: 48.02
	name: Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: gsm8k
	name: GSM8K
	metrics:
	- type: exact_match
	value: 0.53
	name: Exact Match (5-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: cais/mmlu
	name: MMLU
	metrics:
	- type: acc
	value: 22.96
	name: Accuracy (0-shot)
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: openbookqa
	name: OpenBookQA
	metrics:
	- type: acc_norm
	value: 27.60
	name: Normalized Accuracy (0-shot)
	base_model: HuggingFaceTB/SmolLM2-70M
	datasets:
	- codelion/sutra-10B
	---

	# SmolLM2-70M

	A SmolLM2-70M model pretrained on the [Sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) pedagogical dataset for 3 epochs (~30.6B tokens total). This model demonstrates that a 69M parameter model can be trained to near-capacity performance using dense, curated educational data.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| LlamaForCausalLM \|
	\| Parameters \| 69.2M \|
	\| Hidden Size \| 384 \|
	\| Layers \| 32 \|
	\| Attention Heads \| 6 (2 KV heads) \|
	\| Context Length \| 8,192 \|
	\| Vocabulary \| 49,152 \|
	\| Precision \| bfloat16 \|
	\| Base Model \| [SmolLM2-70M](https://huggingface.co/HuggingFaceTB/SmolLM2-70M) \|
	\| Training Dataset \| [Sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) (10.2B tokens) \|

	## Training

	The model was trained for 3 epochs on the Sutra-10B dataset using a single NVIDIA L40S GPU (46GB). This checkpoint is the best perplexity checkpoint from epoch 3.

	\| Epoch \| Tokens \| Training Time \| Learning Rate \| Best Perplexity \|
	\|-------\|--------\|---------------\|---------------\|-----------------\|
	\| 1 \| 10.2B \| 25.82h \| 3e-4 → 3e-5 \| 39.50 \|
	\| 2 \| 10.2B \| 25.78h \| 1e-4 → 1e-5 \| 37.81 \|
	\| 3 \| 10.2B \| 26.16h \| 3e-5 → 3e-6 \| 37.72 \|
	\| Total \| 30.6B \| 77.76h \| — \| 37.72 \|

	Training configuration:
	- Optimizer: AdamW (fused), weight decay 0.1
	- Schedule: Cosine with warmup
	- Batch size: 4 per device, gradient accumulation 8 (effective ~262K tokens/step)
	- Sequence length: 8,192
	- Flash Attention 2, TF32 matmul, torch.compile
	- Throughput: ~110K tokens/sec

	## Benchmark Results

	All benchmarks evaluated using [lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11. All tasks are 0-shot except GSM8K (5-shot).

	### This Model vs Training Progression

	\| Benchmark \| E3-best \| E3-final \| E2-best \| E2-final \| E1-final \|
	\|-----------\|:-----------:\|:--------:\|:-------:\|:--------:\|:--------:\|
	\| ARC-Easy \| 33.00 \| 33.16 \| 32.83 \| 33.12 \| 33.46 \|
	\| ARC-Challenge \| 22.35 \| 21.67 \| 22.61 \| 22.44 \| 22.44 \|
	\| BoolQ \| 39.66 \| 39.66 \| 39.79 \| 39.54 \| 39.79 \|
	\| HellaSwag \| 26.14 \| 26.03 \| 26.08 \| 25.91 \| 26.03 \|
	\| PIQA \| 54.84 \| 55.01 \| 54.24 \| 54.13 \| 54.62 \|
	\| SciQ \| 45.20 \| 46.30 \| 44.10 \| 45.50 \| 43.60 \|
	\| WinoGrande \| 50.04 \| 49.33 \| 50.51 \| 48.70 \| 48.78 \|
	\| TruthfulQA \| 48.02 \| 47.93 \| 48.30 \| 48.14 \| 48.30 \|
	\| GSM8K \| 0.53 \| 0.61 \| 0.68 \| 0.83 \| 0.15 \|
	\| MMLU \| 22.96 \| 22.87 \| 23.00 \| 22.98 \| 22.99 \|
	\| OpenBookQA \| 27.60 \| 27.60 \| — \| — \| — \|
	\| Average (10) \| 34.27 \| 34.26 \| 34.21 \| 34.13 \| 34.02 \|

	### Comparison with 1B Token Baselines (SmolLM2-70M)

	These are results from training the same SmolLM2-70M model on various 1B-token datasets from the [Pre-training Dataset Samples](https://huggingface.co/collections/codelion/pre-training-dataset-samples-686bd760abf1a43b0ce32829) collection for 1 epoch, showing that Sutra-10B at 3 epochs achieves the highest performance for this model size.

	\| Dataset (1B tokens) \| HellaSwag \| PIQA \| WinoGrande \| ARC-C \| MMLU \| TruthfulQA \| GSM8K \| Avg \|
	\|---------------------\|-----------\|------\|------------\|-------\|------\|------------\|-------\|-----\|
	\| Sutra-10B (3 epochs) \| 26.14 \| 54.84 \| 50.04 \| 22.35 \| 22.96 \| 48.02 \| 0.53 \| 34.27 \|
	\| [Sutra-1B](https://huggingface.co/datasets/codelion/sutra-1B) \| 25.43 \| 53.86 \| 49.41 \| 23.04 \| 22.91 \| 49.09 \| 1.14 \| 32.13 \|
	\| [FineWiki-1B](https://huggingface.co/datasets/HuggingFaceFW/finewiki) \| 25.56 \| 51.69 \| 48.86 \| 24.15 \| 23.34 \| 51.16 \| 0.91 \| 32.24 \|
	\| [FinePDFs-1B](https://huggingface.co/datasets/HuggingFaceFW/FinePDFs) \| 25.58 \| 52.56 \| 50.51 \| 22.44 \| 22.95 \| 51.41 \| 1.21 \| 32.38 \|
	\| [DCLM-Baseline-1B](https://huggingface.co/datasets/codelion/dclm-baseline-1B) \| 25.85 \| 55.17 \| 50.20 \| 21.08 \| 22.97 \| 49.21 \| 0.68 \| 32.16 \|
	\| [FineWeb-Edu-1B](https://huggingface.co/datasets/codelion/fineweb-edu-1B) \| 25.72 \| 55.11 \| 50.36 \| 21.25 \| 22.96 \| 48.11 \| 1.21 \| 32.10 \|
	\| [Essential-Web-1B](https://huggingface.co/datasets/sumukshashidhar-archive/essential-web-v1.0-sample-1B) \| 26.02 \| 55.44 \| 48.30 \| 20.99 \| 22.95 \| 49.59 \| 1.29 \| 32.08 \|
	\| [Synth-1B](https://huggingface.co/datasets/codelion/synth-1B) \| 26.63 \| 50.98 \| 48.78 \| 21.93 \| 23.24 \| 47.10 \| 1.29 \| 31.42 \|

	## Key Findings

	1. Capacity ceiling: The 70M parameter model reaches its capacity ceiling at approximately 10B tokens. Additional epochs (up to 30.6B total tokens) yield only marginal improvements in benchmark scores (+0.25 average from epoch 1 to epoch 3), despite continued perplexity improvement (39.50 → 37.72).

	2. Perplexity vs benchmarks: Perplexity continues to decrease across epochs, but downstream benchmark performance plateaus, suggesting the model's representational capacity is the bottleneck rather than data exposure.

	3. Data quality matters: Even at 1B tokens, Sutra outperforms or matches larger web-crawled datasets (DCLM, FineWeb-Edu, Essential-Web) on average, demonstrating the value of curated pedagogical content.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("codelion/SmolLM2-70M", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("codelion/SmolLM2-70M")

	input_text = "The theory of relativity states that"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Limitations

	- This is a 69M parameter base model (not instruction-tuned) — it generates completions, not conversational responses
	- Performance is at the capacity ceiling for this model size; larger models would benefit more from the Sutra-10B dataset
	- The model was trained primarily on English educational content

	## Related Resources

	- Dataset: [codelion/sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) — 10B token pedagogical pretraining dataset
	- Sutra Framework: Generates structured educational content optimized for LLM pretraining

	## Citation

	```bibtex
	@article{sharma2026sutra,
	title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens},
	author={Sharma, Asankhaya},
	year={2026},
	url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens}
	}
	```

	## License

	Apache 2.0