Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 24
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano and Super v3. • 26 items • Updated 3 days ago • 88
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Paper • 2506.01732 • Published Jun 2, 2025 • 6
open-sci-ref-0.01 Collection Research baseline models trained on various open reference datasets. ArXiv: https://arxiv.org/abs/2509.09009 • 8 items • Updated 5 days ago • 4
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs Mar 20, 2024 • 32
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26, 2025 • 77
view article Article Assisted Generation: a new direction toward low-latency text generation May 11, 2023 • 77
Common Models Collection The first generation of models pretrained on Common Corpus. • 5 items • Updated Dec 5, 2024 • 42
Qwen2.5 Collection Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 43 items • Updated 12 days ago • 701
GEITje 7B: A Large Open Dutch Language Model Collection All models and datasets relating to GEITje • 8 items • Updated Jan 25, 2025 • 5