view article Article The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix Nov 3, 2025 • 56
Awesome SFT datasets Collection A curated list of interesting datasets to fine-tune language models with. • 43 items • Updated Apr 12, 2024 • 147
Ministral 3 Collection A collection of edge models, with Base, Instruct and Reasoning variants, in 3 different sizes: 3B, 8B and 14B. All with vision capabilities. • 9 items • Updated Dec 2, 2025 • 149
Craw4LLM: Efficient Web Crawling for LLM Pretraining Paper • 2502.13347 • Published Feb 19, 2025 • 30
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs Mar 20, 2024 • 32
Essential-Web v1.0: 24T tokens of organized web data Paper • 2506.14111 • Published Jun 17, 2025 • 46
view article Article 🥬 TinyLettuce: Efficient Hallucination Detection with 17–68M Encoders Aug 31, 2025 • 15