Update README.md
Browse files
README.md
CHANGED
|
@@ -238,6 +238,17 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
| 238 |
- **Dataset**: [codelion/sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) — 10B token pedagogical pretraining dataset
|
| 239 |
- **Sutra Framework**: Generates structured educational content optimized for LLM pretraining
|
| 240 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
## License
|
| 242 |
|
| 243 |
Apache 2.0
|
|
|
|
| 238 |
- **Dataset**: [codelion/sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) — 10B token pedagogical pretraining dataset
|
| 239 |
- **Sutra Framework**: Generates structured educational content optimized for LLM pretraining
|
| 240 |
|
| 241 |
+
## Citation
|
| 242 |
+
|
| 243 |
+
```bibtex
|
| 244 |
+
@article{sharma2026sutra,
|
| 245 |
+
title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens},
|
| 246 |
+
author={Sharma, Asankhaya},
|
| 247 |
+
year={2026},
|
| 248 |
+
url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens}
|
| 249 |
+
}
|
| 250 |
+
```
|
| 251 |
+
|
| 252 |
## License
|
| 253 |
|
| 254 |
Apache 2.0
|