codelion commited on
Commit
285b36e
·
verified ·
1 Parent(s): 4785466

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -238,6 +238,17 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
238
  - **Dataset**: [codelion/sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) — 10B token pedagogical pretraining dataset
239
  - **Sutra Framework**: Generates structured educational content optimized for LLM pretraining
240
 
 
 
 
 
 
 
 
 
 
 
 
241
  ## License
242
 
243
  Apache 2.0
 
238
  - **Dataset**: [codelion/sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) — 10B token pedagogical pretraining dataset
239
  - **Sutra Framework**: Generates structured educational content optimized for LLM pretraining
240
 
241
+ ## Citation
242
+
243
+ ```bibtex
244
+ @article{sharma2026sutra,
245
+ title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens},
246
+ author={Sharma, Asankhaya},
247
+ year={2026},
248
+ url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens}
249
+ }
250
+ ```
251
+
252
  ## License
253
 
254
  Apache 2.0