Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6, 2025 • 21
SmolLM3 pretraining datasets Collection datasets used in SmolLM3 pretraining • 15 items • Updated Aug 12, 2025 • 44
view article Article Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face +3 Jul 29, 2025 • 211
olmOCR Collection olmOCR is a document recognition pipeline for efficiently converting documents into plain text. olmocr.allenai.org • 12 items • Updated Dec 23, 2025 • 145