joelniklaus HF Staff commited on
Commit
5477bad
·
1 Parent(s): 420ea96

add more citations

Browse files
app/src/content/bibliography.bib CHANGED
@@ -57,6 +57,16 @@
57
  note = {Blog post}
58
  }
59
 
 
 
 
 
 
 
 
 
 
 
60
  % Synthetic data methods
61
  @inproceedings{wrap,
62
  title = {Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling},
@@ -110,6 +120,14 @@
110
  url = {https://arxiv.org/abs/2502.02737}
111
  }
112
 
 
 
 
 
 
 
 
 
113
  @misc{glm45,
114
  title = {GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
115
  author = { 5 Team and Aohan Zeng and Xin Lv and Qinkai Zheng and Zhenyu Hou and Bin Chen and Chengxing Xie and Cunxiang Wang and Da Yin and Hao Zeng and Jiajie Zhang and Kedong Wang and Lucen Zhong and Mingdao Liu and Rui Lu and Shulin Cao and Xiaohan Zhang and Xuancheng Huang and Yao Wei and Yean Cheng and Yifan An and Yilin Niu and Yuanhao Wen and Yushi Bai and Zhengxiao Du and Zihan Wang and Zilin Zhu and Bohan Zhang and Bosi Wen and Bowen Wu and Bowen Xu and Can Huang and Casey Zhao and Changpeng Cai and Chao Yu and Chen Li and Chendi Ge and Chenghua Huang and Chenhui Zhang and Chenxi Xu and Chenzheng Zhu and Chuang Li and Congfeng Yin and Daoyan Lin and Dayong Yang and Dazhi Jiang and Ding Ai and Erle Zhu and Fei Wang and Gengzheng Pan and Guo Wang and Hailong Sun and Haitao Li and Haiyang Li and Haiyi Hu and Hanyu Zhang and Hao Peng and Hao Tai and Haoke Zhang and Haoran Wang and Haoyu Yang and He Liu and He Zhao and Hongwei Liu and Hongxi Yan and Huan Liu and Huilong Chen and Ji Li and Jiajing Zhao and Jiamin Ren and Jian Jiao and Jiani Zhao and Jianyang Yan and Jiaqi Wang and Jiayi Gui and Jiayue Zhao and Jie Liu and Jijie Li and Jing Li and Jing Lu and Jingsen Wang and Jingwei Yuan and Jingxuan Li and Jingzhao Du and Jinhua Du and Jinxin Liu and Junkai Zhi and Junli Gao and Ke Wang and Lekang Yang and Liang Xu and Lin Fan and Lindong Wu and Lintao Ding and Lu Wang and Man Zhang and Minghao Li and Minghuan Xu and Mingming Zhao and Mingshu Zhai and Pengfan Du and Qian Dong and Shangde Lei and Shangqing Tu and Shangtong Yang and Shaoyou Lu and Shijie Li and Shuang Li and Shuang-Li and Shuxun Yang and Sibo Yi and Tianshu Yu and Wei Tian and Weihan Wang and Wenbo Yu and Weng Lam Tam and Wenjie Liang and Wentao Liu and Xiao Wang and Xiaohan Jia and Xiaotao Gu and Xiaoying Ling and Xin Wang and Xing Fan and Xingru Pan and Xinyuan Zhang and Xinze Zhang and Xiuqing Fu and Xunkai Zhang and Yabo Xu and Yandong Wu and Yida Lu and Yidong Wang and Yilin Zhou and Yiming Pan and Ying Zhang and Yingli Wang and Yingru Li and Yinpei Su and Yipeng Geng and Yitong Zhu and Yongkun Yang and Yuhang Li and Yuhao Wu and Yujiang Li and Yunan Liu and Yunqing Wang and Yuntao Li and Yuxuan Zhang and Zezhen Liu and Zhen Yang and Zhengda Zhou and Zhongpei Qiao and Zhuoer Feng and Zhuorui Liu and Zichen Zhang and Zihan Wang and Zijun Yao and Zikang Wang and Ziqiang Liu and Ziwei Chai and Zixuan Li and Zuodong Zhao and Wenguang Chen and Jidong Zhai and Bin Xu and Minlie Huang and Hongning Wang and Juanzi Li and Yuxiao Dong and Jie Tang},
@@ -130,14 +148,14 @@
130
  url = {https://arxiv.org/abs/2512.20856}
131
  }
132
 
133
- @misc{qwen3,
134
- title = {Qwen3 Technical Report},
135
- author = {An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Junyang Lin and Jingren Zhou},
136
- year = {2025},
137
- eprint = {2505.09388},
138
  archiveprefix = {arXiv},
139
  primaryclass = {cs.CL},
140
- url = {https://arxiv.org/abs/2505.09388}
141
  }
142
 
143
  @misc{qwen2,
@@ -150,6 +168,26 @@
150
  url = {https://arxiv.org/abs/2407.10671}
151
  }
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  @misc{phi4,
154
  title = {Phi-4 Technical Report},
155
  author = {Marah Abdin and Sahaj Agarwal and Ahmed Awadallah and Vidhisha Balachandran and Harkirat Behl and Lingjiao Chen and Gustavo de Rosa and Suriya Gunasekar and Mojan Javaheripi and Neel Jain and Piero Kauffmann and Yin Tat Lee and Yuanzhi Li and Anh Nguyen and Olatunji Ruwase and Olli Saarikivi and Adil Salim and Shital Shah and Michael Santacroce and Harsha Nori and Xin Wang and Rachel Ward and Philipp Witte and Cyril Zhang and Yi Zhang},
@@ -246,6 +284,28 @@
246
  url = {https://arxiv.org/abs/2508.10925}
247
  }
248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  % Inference
250
  @inproceedings{vllm,
251
  title = {Efficient Memory Management for Large Language Model Serving with PagedAttention},
@@ -279,6 +339,27 @@
279
  url = {https://arxiv.org/abs/2307.08691}
280
  }
281
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
282
  @misc{dflash,
283
  title = {DFlash: Block Diffusion for Flash Speculative Decoding},
284
  author = {Jian Chen and Yesheng Liang and Zhijian Liu},
@@ -289,6 +370,28 @@
289
  url = {https://arxiv.org/abs/2602.06036}
290
  }
291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
292
  % Tools
293
  @inproceedings{dspy,
294
  title = {DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines},
@@ -299,4 +402,132 @@
299
  archiveprefix = {arXiv},
300
  primaryclass = {cs.CL},
301
  url = {https://arxiv.org/abs/2310.03714}
302
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  note = {Blog post}
58
  }
59
 
60
+ @misc{s1k,
61
+ title = {s1: Simple Test-Time Scaling},
62
+ author = {Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},
63
+ year = {2025},
64
+ eprint = {2501.19393},
65
+ archiveprefix = {arXiv},
66
+ primaryclass = {cs.CL},
67
+ url = {https://arxiv.org/abs/2501.19393}
68
+ }
69
+
70
  % Synthetic data methods
71
  @inproceedings{wrap,
72
  title = {Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling},
 
120
  url = {https://arxiv.org/abs/2502.02737}
121
  }
122
 
123
+ @misc{smollm3,
124
+ title = {SmolLM3: Smol, Multilingual, Long-Context Reasoner},
125
+ author = {{Hugging Face}},
126
+ year = {2025},
127
+ url = {https://huggingface.co/blog/smollm3},
128
+ note = {Blog post}
129
+ }
130
+
131
  @misc{glm45,
132
  title = {GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
133
  author = { 5 Team and Aohan Zeng and Xin Lv and Qinkai Zheng and Zhenyu Hou and Bin Chen and Chengxing Xie and Cunxiang Wang and Da Yin and Hao Zeng and Jiajie Zhang and Kedong Wang and Lucen Zhong and Mingdao Liu and Rui Lu and Shulin Cao and Xiaohan Zhang and Xuancheng Huang and Yao Wei and Yean Cheng and Yifan An and Yilin Niu and Yuanhao Wen and Yushi Bai and Zhengxiao Du and Zihan Wang and Zilin Zhu and Bohan Zhang and Bosi Wen and Bowen Wu and Bowen Xu and Can Huang and Casey Zhao and Changpeng Cai and Chao Yu and Chen Li and Chendi Ge and Chenghua Huang and Chenhui Zhang and Chenxi Xu and Chenzheng Zhu and Chuang Li and Congfeng Yin and Daoyan Lin and Dayong Yang and Dazhi Jiang and Ding Ai and Erle Zhu and Fei Wang and Gengzheng Pan and Guo Wang and Hailong Sun and Haitao Li and Haiyang Li and Haiyi Hu and Hanyu Zhang and Hao Peng and Hao Tai and Haoke Zhang and Haoran Wang and Haoyu Yang and He Liu and He Zhao and Hongwei Liu and Hongxi Yan and Huan Liu and Huilong Chen and Ji Li and Jiajing Zhao and Jiamin Ren and Jian Jiao and Jiani Zhao and Jianyang Yan and Jiaqi Wang and Jiayi Gui and Jiayue Zhao and Jie Liu and Jijie Li and Jing Li and Jing Lu and Jingsen Wang and Jingwei Yuan and Jingxuan Li and Jingzhao Du and Jinhua Du and Jinxin Liu and Junkai Zhi and Junli Gao and Ke Wang and Lekang Yang and Liang Xu and Lin Fan and Lindong Wu and Lintao Ding and Lu Wang and Man Zhang and Minghao Li and Minghuan Xu and Mingming Zhao and Mingshu Zhai and Pengfan Du and Qian Dong and Shangde Lei and Shangqing Tu and Shangtong Yang and Shaoyou Lu and Shijie Li and Shuang Li and Shuang-Li and Shuxun Yang and Sibo Yi and Tianshu Yu and Wei Tian and Weihan Wang and Wenbo Yu and Weng Lam Tam and Wenjie Liang and Wentao Liu and Xiao Wang and Xiaohan Jia and Xiaotao Gu and Xiaoying Ling and Xin Wang and Xing Fan and Xingru Pan and Xinyuan Zhang and Xinze Zhang and Xiuqing Fu and Xunkai Zhang and Yabo Xu and Yandong Wu and Yida Lu and Yidong Wang and Yilin Zhou and Yiming Pan and Ying Zhang and Yingli Wang and Yingru Li and Yinpei Su and Yipeng Geng and Yitong Zhu and Yongkun Yang and Yuhang Li and Yuhao Wu and Yujiang Li and Yunan Liu and Yunqing Wang and Yuntao Li and Yuxuan Zhang and Zezhen Liu and Zhen Yang and Zhengda Zhou and Zhongpei Qiao and Zhuoer Feng and Zhuorui Liu and Zichen Zhang and Zihan Wang and Zijun Yao and Zikang Wang and Ziqiang Liu and Ziwei Chai and Zixuan Li and Zuodong Zhao and Wenguang Chen and Jidong Zhai and Bin Xu and Minlie Huang and Hongning Wang and Juanzi Li and Yuxiao Dong and Jie Tang},
 
148
  url = {https://arxiv.org/abs/2512.20856}
149
  }
150
 
151
+ @misc{qwen,
152
+ title = {Qwen Technical Report},
153
+ author = {Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
154
+ year = {2023},
155
+ eprint = {2309.16609},
156
  archiveprefix = {arXiv},
157
  primaryclass = {cs.CL},
158
+ url = {https://arxiv.org/abs/2309.16609}
159
  }
160
 
161
  @misc{qwen2,
 
168
  url = {https://arxiv.org/abs/2407.10671}
169
  }
170
 
171
+ @misc{qwen25,
172
+ title = {Qwen2.5 Technical Report},
173
+ author = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Junyang Lin and Jingren Zhou},
174
+ year = {2024},
175
+ eprint = {2412.15115},
176
+ archiveprefix = {arXiv},
177
+ primaryclass = {cs.CL},
178
+ url = {https://arxiv.org/abs/2412.15115}
179
+ }
180
+
181
+ @misc{qwen3,
182
+ title = {Qwen3 Technical Report},
183
+ author = {An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Junyang Lin and Jingren Zhou},
184
+ year = {2025},
185
+ eprint = {2505.09388},
186
+ archiveprefix = {arXiv},
187
+ primaryclass = {cs.CL},
188
+ url = {https://arxiv.org/abs/2505.09388}
189
+ }
190
+
191
  @misc{phi4,
192
  title = {Phi-4 Technical Report},
193
  author = {Marah Abdin and Sahaj Agarwal and Ahmed Awadallah and Vidhisha Balachandran and Harkirat Behl and Lingjiao Chen and Gustavo de Rosa and Suriya Gunasekar and Mojan Javaheripi and Neel Jain and Piero Kauffmann and Yin Tat Lee and Yuanzhi Li and Anh Nguyen and Olatunji Ruwase and Olli Saarikivi and Adil Salim and Shital Shah and Michael Santacroce and Harsha Nori and Xin Wang and Rachel Ward and Philipp Witte and Cyril Zhang and Yi Zhang},
 
284
  url = {https://arxiv.org/abs/2508.10925}
285
  }
286
 
287
+ % Architecture
288
+ @inproceedings{gqa,
289
+ title = {GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints},
290
+ author = {Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebrón and Sumit Sanghai},
291
+ booktitle = {Conference on Empirical Methods in Natural Language Processing},
292
+ year = {2023},
293
+ eprint = {2305.13245},
294
+ archiveprefix = {arXiv},
295
+ primaryclass = {cs.CL},
296
+ url = {https://arxiv.org/abs/2305.13245}
297
+ }
298
+
299
+ @article{rope,
300
+ title = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
301
+ author = {Jianlin Su and Murtadha Ahmed and Yu Lu and Shengfeng Pan and Wen Bo and Yunfeng Liu},
302
+ journal = {Neurocomputing},
303
+ volume = {568},
304
+ pages = {127063},
305
+ year = {2024},
306
+ doi = {10.1016/j.neucom.2023.127063}
307
+ }
308
+
309
  % Inference
310
  @inproceedings{vllm,
311
  title = {Efficient Memory Management for Large Language Model Serving with PagedAttention},
 
339
  url = {https://arxiv.org/abs/2307.08691}
340
  }
341
 
342
+ @misc{flashinfer,
343
+ title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},
344
+ author = {Zihao Ye and Lequn Chen and Ruihang Lai and Yilong Zhao and Size Zheng and Junru Shao and Bohan Hou and Hongyi Jin and Yifei Zuo and Liangsheng Yin and Tianqi Chen and Luis Ceze},
345
+ year = {2025},
346
+ eprint = {2501.01005},
347
+ archiveprefix = {arXiv},
348
+ primaryclass = {cs.DC},
349
+ url = {https://arxiv.org/abs/2501.01005}
350
+ }
351
+
352
+ @inproceedings{speculativedecoding,
353
+ title = {Fast Inference from Transformers via Speculative Decoding},
354
+ author = {Yaniv Leviathan and Matan Kalman and Yossi Matias},
355
+ booktitle = {International Conference on Machine Learning},
356
+ year = {2023},
357
+ eprint = {2211.17192},
358
+ archiveprefix = {arXiv},
359
+ primaryclass = {cs.LG},
360
+ url = {https://arxiv.org/abs/2211.17192}
361
+ }
362
+
363
  @misc{dflash,
364
  title = {DFlash: Block Diffusion for Flash Speculative Decoding},
365
  author = {Jian Chen and Yesheng Liang and Zhijian Liu},
 
370
  url = {https://arxiv.org/abs/2602.06036}
371
  }
372
 
373
+ % Training
374
+ @inproceedings{adamw,
375
+ title = {Decoupled Weight Decay Regularization},
376
+ author = {Ilya Loshchilov and Frank Hutter},
377
+ booktitle = {International Conference on Learning Representations},
378
+ year = {2019},
379
+ eprint = {1711.05101},
380
+ archiveprefix = {arXiv},
381
+ primaryclass = {cs.LG},
382
+ url = {https://arxiv.org/abs/1711.05101}
383
+ }
384
+
385
+ @misc{minicpm,
386
+ title = {MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies},
387
+ author = {Shengding Hu and Yuge Tu and Xu Han and Chaoqun He and Ganqu Cui and Xiang Long and Zhi Zheng and Yewei Fang and Yuxiang Huang and Weilin Zhao and Xinrong Zhang and Zheng Leng Thai and Kaihuo Zhang and Chongyi Wang and Yuan Yao and Chenyang Zhao and Jie Zhou and Jie Cai and Zhongwu Zhai and Ning Ding and Chao Jia and Guoyang Zeng and Dahai Li and Zhiyuan Liu and Maosong Sun},
388
+ year = {2024},
389
+ eprint = {2404.06395},
390
+ archiveprefix = {arXiv},
391
+ primaryclass = {cs.CL},
392
+ url = {https://arxiv.org/abs/2404.06395}
393
+ }
394
+
395
  % Tools
396
  @inproceedings{dspy,
397
  title = {DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines},
 
402
  archiveprefix = {arXiv},
403
  primaryclass = {cs.CL},
404
  url = {https://arxiv.org/abs/2310.03714}
405
+ }
406
+
407
+ @software{datatrove,
408
+ title = {DataTrove: Large Scale Data Processing},
409
+ author = {Guilherme Penedo and Hynek Kydlíček and Thomas Wolf and Leandro von Werra},
410
+ year = {2024},
411
+ url = {https://github.com/huggingface/datatrove},
412
+ note = {GitHub repository}
413
+ }
414
+
415
+ % Benchmarks
416
+ @misc{arc,
417
+ title = {Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
418
+ author = {Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
419
+ year = {2018},
420
+ eprint = {1803.05457},
421
+ archiveprefix = {arXiv},
422
+ primaryclass = {cs.AI},
423
+ url = {https://arxiv.org/abs/1803.05457}
424
+ }
425
+
426
+ @misc{hellaswag,
427
+ title = {HellaSwag: Can a Machine Really Finish Your Sentence?},
428
+ author = {Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
429
+ year = {2019},
430
+ eprint = {1905.07830},
431
+ archiveprefix = {arXiv},
432
+ primaryclass = {cs.CL},
433
+ url = {https://arxiv.org/abs/1905.07830}
434
+ }
435
+
436
+ @misc{mmluredux,
437
+ title = {Are We Done with MMLU?},
438
+ author = {Aryo Pradipta Gema and Joshua Ong Jun Leang and Giwon Hong and Alessio Devoto and Alberto Carlo Maria Mancino and Rohit Saxena and Xuanli He and Yu Zhao and Xiaotang Du and Mohammad Reza Ghasemi Madani and Claire Barale and Robert McHardy and Joshua Harris and Jean Kaddour and Emile van Krieken and Pasquale Minervini},
439
+ year = {2024},
440
+ eprint = {2406.04127},
441
+ archiveprefix = {arXiv},
442
+ primaryclass = {cs.CL},
443
+ url = {https://arxiv.org/abs/2406.04127}
444
+ }
445
+
446
+ @inproceedings{xcsqa,
447
+ title = {Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning},
448
+ author = {Bill Yuchen Lin and Seyeon Lee and Xiaoyang Qiao and Xiang Ren},
449
+ booktitle = {Annual Meeting of the Association for Computational Linguistics},
450
+ year = {2021},
451
+ url = {https://aclanthology.org/2021.acl-long.102/}
452
+ }
453
+
454
+ @misc{openbookqa,
455
+ title = {Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
456
+ author = {Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
457
+ year = {2018},
458
+ eprint = {1809.02789},
459
+ archiveprefix = {arXiv},
460
+ primaryclass = {cs.CL},
461
+ url = {https://arxiv.org/abs/1809.02789}
462
+ }
463
+
464
+ @misc{winogrande,
465
+ title = {WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
466
+ author = {Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
467
+ year = {2019},
468
+ eprint = {1907.10641},
469
+ archiveprefix = {arXiv},
470
+ primaryclass = {cs.AI},
471
+ url = {https://arxiv.org/abs/1907.10641}
472
+ }
473
+
474
+ @misc{piqa,
475
+ title = {PIQA: Reasoning about Physical Intuition by Question Answering},
476
+ author = {Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi},
477
+ year = {2019},
478
+ eprint = {1911.11641},
479
+ archiveprefix = {arXiv},
480
+ primaryclass = {cs.CL},
481
+ url = {https://arxiv.org/abs/1911.11641}
482
+ }
483
+
484
+ @misc{squad2,
485
+ title = {Know What You Don't Know: Unanswerable Questions for SQuAD},
486
+ author = {Pranav Rajpurkar and Robin Jia and Percy Liang},
487
+ year = {2018},
488
+ eprint = {1806.03822},
489
+ archiveprefix = {arXiv},
490
+ primaryclass = {cs.CL},
491
+ url = {https://arxiv.org/abs/1806.03822}
492
+ }
493
+
494
+ @misc{drop,
495
+ title = {DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
496
+ author = {Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
497
+ year = {2019},
498
+ eprint = {1903.00161},
499
+ archiveprefix = {arXiv},
500
+ primaryclass = {cs.CL},
501
+ url = {https://arxiv.org/abs/1903.00161}
502
+ }
503
+
504
+ @inproceedings{wikitablequestions,
505
+ title = {Compositional Semantic Parsing on Semi-Structured Tables},
506
+ author = {Panupong Pasupat and Percy Liang},
507
+ booktitle = {Annual Meeting of the Association for Computational Linguistics},
508
+ year = {2015},
509
+ eprint = {1508.00305},
510
+ archiveprefix = {arXiv},
511
+ primaryclass = {cs.CL},
512
+ url = {https://arxiv.org/abs/1508.00305}
513
+ }
514
+
515
+ @misc{triviaqa,
516
+ title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},
517
+ author = {Mandar Joshi and Eunsol Choi and Daniel Weld and Luke Zettlemoyer},
518
+ year = {2017},
519
+ eprint = {1705.03551},
520
+ archiveprefix = {arXiv},
521
+ primaryclass = {cs.CL},
522
+ url = {https://arxiv.org/abs/1705.03551}
523
+ }
524
+
525
+ @misc{gsm8k,
526
+ title = {Training Verifiers to Solve Math Word Problems},
527
+ author = {Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
528
+ year = {2021},
529
+ eprint = {2110.14168},
530
+ archiveprefix = {arXiv},
531
+ primaryclass = {cs.LG},
532
+ url = {https://arxiv.org/abs/2110.14168}
533
+ }
app/src/content/chapters/appendix.mdx CHANGED
@@ -2,7 +2,7 @@
2
 
3
  ### Details on the experiments
4
 
5
- For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer ( `hynky/Llama-3.2-1B-no-bos` ) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
6
 
7
  ### Prompts
8
 
@@ -212,7 +212,7 @@ Original Draft: [TEXT]
212
 
213
  ### Decay vs Scratch
214
 
215
- We explored two distinct training paradigms. In the **from-scratch** setup ( `decay_exp=false` ), models were trained for the full 10,000 steps (~21B tokens) on a single dataset or mixture of datasets. In contrast, the **decay** experiments ( `decay_exp=true` ) aimed to obtain quicker signal with fewer rephrased tokens by leveraging a two-stage training approach. These decay experiments resumed training from a checkpoint at step 9,000 of a model previously trained on lower-quality data ( `fw_edu_lq` ), then continued training with a new dataset (or mixture) for the final 1,000 steps (~2B tokens) during the learning rate decay phase. We selected the low quality fineweb dataset for the first training phase so we can see effects of the ablated data mixtures more clearly. This design allowed us to evaluate the impact of high-quality rephrased or synthetic data more efficiently, requiring around 2B rephrased tokens rather than the full 21B needed for from-scratch training, thus reducing computational costs by 90% per experimental condition while still providing meaningful signal about data quality effects. To enable the decay experiments, we used a warmup-stable-decay (WSD) learning rate schedule with 1% warmup (100 steps), 89% stable training, and 10% linear decay (1,000 steps) to a minimum of 5×10⁻⁵.
216
 
217
  #### Variance across seeds and data seeds
218
 
 
2
 
3
  ### Details on the experiments
4
 
5
+ For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer ( `hynky/Llama-3.2-1B-no-bos` ) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
6
 
7
  ### Prompts
8
 
 
212
 
213
  ### Decay vs Scratch
214
 
215
+ We explored two distinct training paradigms. In the **from-scratch** setup ( `decay_exp=false` ), models were trained for the full 10,000 steps (~21B tokens) on a single dataset or mixture of datasets. In contrast, the **decay** experiments ( `decay_exp=true` ) aimed to obtain quicker signal with fewer rephrased tokens by leveraging a two-stage training approach. These decay experiments resumed training from a checkpoint at step 9,000 of a model previously trained on lower-quality data ( `fw_edu_lq` ), then continued training with a new dataset (or mixture) for the final 1,000 steps (~2B tokens) during the learning rate decay phase. We selected the low quality fineweb dataset for the first training phase so we can see effects of the ablated data mixtures more clearly. This design allowed us to evaluate the impact of high-quality rephrased or synthetic data more efficiently, requiring around 2B rephrased tokens rather than the full 21B needed for from-scratch training, thus reducing computational costs by 90% per experimental condition while still providing meaningful signal about data quality effects. To enable the decay experiments, we used a warmup-stable-decay (WSD) [@minicpm] learning rate schedule with 1% warmup (100 steps), 89% stable training, and 10% linear decay (1,000 steps) to a minimum of 5×10⁻⁵.
216
 
217
  #### Variance across seeds and data seeds
218
 
app/src/content/chapters/experiments.mdx CHANGED
@@ -314,7 +314,7 @@ We hypothesize that the consistently strong performance of SmolLM2 originates fr
314
  So the model family clearly seems to matter. However, SmolLM2 is already a year old. Are newer models better than older ones?
315
  #### Does the model generation matter?
316
 
317
- We compare rephrasing with Qwen models from versions 1.5, 2, 2.5 and 3 using the tutorial prompt, one of the prompts that outperformed the DCLM baseline. While the differences are small we find a trend that newer versions lead to higher evaluation performance.
318
 
319
  <HtmlEmbed
320
  id="model-generation"
 
314
  So the model family clearly seems to matter. However, SmolLM2 is already a year old. Are newer models better than older ones?
315
  #### Does the model generation matter?
316
 
317
+ We compare rephrasing with Qwen models from versions 1.5 [@qwen], 2, 2.5 [@qwen25] and 3 using the tutorial prompt, one of the prompts that outperformed the DCLM baseline. While the differences are small we find a trend that newer versions lead to higher evaluation performance.
318
 
319
  <HtmlEmbed
320
  id="model-generation"
app/src/content/chapters/infrastructure.mdx CHANGED
@@ -10,11 +10,11 @@ Synthetic data has emerged as a key ingredient in training modern LLMs, providin
10
 
11
  <Image src={SyDLepVveg_2f81384e_bcac_806f_acb7_fd65c71dd9df} alt="Image" />
12
 
13
- Synthetic data also plays a central role in post-training via *distillation* , where a capable model is used to generate high-quality responses for targeted domains such as reasoning, instruction-following, and tool-use. This data can then be used for supervised fine-tuning or preference optimization, allowing developers to shape a model's behaviour with labels that would be expensive or impractical to obtain from humans. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) was post-trained almost entirely on a few billion tokens of data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
14
 
15
  So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], it turns out that the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
16
 
17
- Today we're excited to announce major extensions to [DataTrove](https://github.com/huggingface/datatrove) to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
18
 
19
  In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
20
 
@@ -33,7 +33,7 @@ DataTrove provides two modes to generate synthetic data:
33
  - **Local execution** : Run on a single machine with multiple workers for development and small-scale generation
34
  - **Slurm cluster** : Distribute processing across multiple nodes for large-scale production workloads
35
 
36
- Here's a simple example of local execution on a node of 8 GPUs to generate solutions to math problems from the [s1K dataset](https://huggingface.co/datasets/simplescaling/s1K-1.1) using `Qwen3-4B-Thinking-2507` :
37
 
38
  ```shell
39
  python examples/inference/benchmark/generate_data.py \
@@ -214,7 +214,7 @@ datacard_pipeline = [InferenceDatasetCardGenerator(params=params)]
214
 
215
  For synthetic data generation, we may run language model inference for millions of GPU hours. Finding a configuration that maximizes throughput is critical, as it could accelerate generation by days and save thousands of dollars. In this section, we describe our experiments to identify optimal parameters for a selection of popular models. We run the experiments once for a pre-training dataset and once for a post-training example.
216
 
217
- The Flash-Attn VLLM backend is more than 50% faster than FlashInfer across setups.
218
 
219
  #### Pre-training
220
 
@@ -257,6 +257,6 @@ We also explored what would be required to generate 1T tokens in a day. We belie
257
 
258
  You can find the benchmarking code [here](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark) together with the [yaml config](https://github.com/huggingface/datatrove/blob/main/examples/inference/benchmark/sample_benchmark_config.yaml).
259
 
260
- We experimented with speculative decoding using the [ngram method](https://docs.vllm.ai/en/stable/features/spec_decode.html?h=specula#speculating-by-matching-n-grams-in-the-prompt) but found no consistent speedups. We hypothesize this approach is unhelpful because the input in our benchmarking dataset is relatively short compared to the thinking tokens and output. We expect greater gains for tasks involving more copying from the input.
261
 
262
  TODO: Optimize this section for pretraining: use that prompt and seq length configuration but mention in the end that for post training we can easily rerun this experiment with different prompts and datasets
 
10
 
11
  <Image src={SyDLepVveg_2f81384e_bcac_806f_acb7_fd65c71dd9df} alt="Image" />
12
 
13
+ Synthetic data also plays a central role in post-training via *distillation* , where a capable model is used to generate high-quality responses for targeted domains such as reasoning, instruction-following, and tool-use. This data can then be used for supervised fine-tuning or preference optimization, allowing developers to shape a model's behaviour with labels that would be expensive or impractical to obtain from humans. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on a few billion tokens of data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
14
 
15
  So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], it turns out that the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
16
 
17
+ Today we're excited to announce major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
18
 
19
  In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
20
 
 
33
  - **Local execution** : Run on a single machine with multiple workers for development and small-scale generation
34
  - **Slurm cluster** : Distribute processing across multiple nodes for large-scale production workloads
35
 
36
+ Here's a simple example of local execution on a node of 8 GPUs to generate solutions to math problems from the [s1K dataset](https://huggingface.co/datasets/simplescaling/s1K-1.1) [@s1k] using `Qwen3-4B-Thinking-2507` :
37
 
38
  ```shell
39
  python examples/inference/benchmark/generate_data.py \
 
214
 
215
  For synthetic data generation, we may run language model inference for millions of GPU hours. Finding a configuration that maximizes throughput is critical, as it could accelerate generation by days and save thousands of dollars. In this section, we describe our experiments to identify optimal parameters for a selection of popular models. We run the experiments once for a pre-training dataset and once for a post-training example.
216
 
217
+ The Flash-Attn VLLM backend is more than 50% faster than FlashInfer [@flashinfer] across setups.
218
 
219
  #### Pre-training
220
 
 
257
 
258
  You can find the benchmarking code [here](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark) together with the [yaml config](https://github.com/huggingface/datatrove/blob/main/examples/inference/benchmark/sample_benchmark_config.yaml).
259
 
260
+ We experimented with speculative decoding [@speculativedecoding] using the [ngram method](https://docs.vllm.ai/en/stable/features/spec_decode.html?h=specula#speculating-by-matching-n-grams-in-the-prompt) but found no consistent speedups. We hypothesize this approach is unhelpful because the input in our benchmarking dataset is relatively short compared to the thinking tokens and output. We expect greater gains for tasks involving more copying from the input.
261
 
262
  TODO: Optimize this section for pretraining: use that prompt and seq length configuration but mention in the end that for post training we can easily rerun this experiment with different prompts and datasets
app/src/content/chapters/setup.mdx CHANGED
@@ -53,7 +53,7 @@ We compare against several baseline datasets for pretraining and data rephrasing
53
 
54
  ### Ablation Setup
55
 
56
- For our ablations we train a 1.2B parameter language model using a Qwen2-style architecture (see more details in the Appendix). We evaluate our model on a diverse set of 12 benchmark tasks spanning multiple reasoning and knowledge domains. For reasoning capabilities, we assess performance on ARC, HellaSwag, MMLU Redux, Cross-lingual CommonsenseQA (XCSQA), OpenBookQA, Winogrande, and PIQA. Question answering capabilities are evaluated using SQuAD v2, DROP, WikiTableQuestions, and TriviaQA. Mathematical reasoning is assessed via GSM8K. Given that our model is relatively small and trained on only 20 billion tokens, we employ the continuation format (CF) for most tasks rather than the standard multiple-choice format. The CF setup, which frames evaluation as a next-token prediction task, has been shown to provide more reliable assessments for smaller or less extensively trained models that may struggle with complex instruction following or multiple-choice formatting conventions. All evaluations are conducted using 3-shot prompting with a single seed to ensure reproducibility.
57
 
58
  #### Naming
59
 
 
53
 
54
  ### Ablation Setup
55
 
56
+ For our ablations we train a 1.2B parameter language model using a Qwen2-style architecture (see more details in the Appendix). We evaluate our model on a diverse set of 12 benchmark tasks spanning multiple reasoning and knowledge domains. For reasoning capabilities, we assess performance on ARC [@arc], HellaSwag [@hellaswag], MMLU Redux [@mmluredux], Cross-lingual CommonsenseQA (XCSQA) [@xcsqa], OpenBookQA [@openbookqa], Winogrande [@winogrande], and PIQA [@piqa]. Question answering capabilities are evaluated using SQuAD v2 [@squad2], DROP [@drop], WikiTableQuestions [@wikitablequestions], and TriviaQA [@triviaqa]. Mathematical reasoning is assessed via GSM8K [@gsm8k]. Given that our model is relatively small and trained on only 20 billion tokens, we employ the continuation format (CF) for most tasks rather than the standard multiple-choice format. The CF setup, which frames evaluation as a next-token prediction task, has been shown to provide more reliable assessments for smaller or less extensively trained models that may struggle with complex instruction following or multiple-choice formatting conventions. All evaluations are conducted using 3-shot prompting with a single seed to ensure reproducibility.
57
 
58
  #### Naming
59