view article Article MTEB Leaderboard: From a slow demo to feature-rich leaderboard Samoed • 14 days ago • 22
jina-embeddings-v5-text: Task-Targeted Embedding Distillation Paper • 2602.15547 • Published Feb 17 • 31
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale Paper • 2602.12414 • Published Feb 12 • 4
MTEB-NL Collection Massive Text Embedding Benchmark for Dutch. Check https://github.com/nikolay-banar/mteb-nl-dev to evaluate your models. • 26 items • Updated Nov 7, 2025 • 4
GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training Paper • 2604.00920 • Published Apr 1 • 1
GPT-NL Pretraining Corpus Collection Public Corpus data + Private Corpus metadata • 2 items • Updated Nov 18, 2025 • 6
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 24
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano, Super, and Ultra v3. • 50 items • Updated 14 days ago • 167
view article Article We Got Claude to Fine-Tune an Open Source LLM burtenshaw, evalstate • Dec 4, 2025 • 630
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Paper • 2506.01732 • Published Jun 2, 2025 • 6
view article Article mmBERT: ModernBERT goes Multilingual +4 mmarone, orionweller, will-fleshman, eugene-yang, dlawrie, vandurme • Sep 9, 2025 • 148
open-sci-ref-0.01 Collection Research baseline models trained on various open reference datasets. ArXiv: https://arxiv.org/abs/2509.09009 • 8 items • Updated Mar 9 • 4