view article Article Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers tomaarsen • Apr 16 • 71
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale Paper • 2602.12414 • Published Feb 12 • 2
Lost in Backpropagation: The LM Head is a Gradient Bottleneck Paper • 2603.10145 • Published Mar 10 • 13
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 24
Gaperon: A Peppered English-French Generative Language Model Suite Paper • 2510.25771 • Published Oct 29, 2025 • 16
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection Paper • 2411.08868 • Published Nov 13, 2024 • 13
Awesome Document AI Collection A collection of open-source document AI 📄 📝 📈 • 27 items • Updated Mar 11, 2024 • 80
Harvesting Textual and Structured Data from the HAL Publication Repository Paper • 2407.20595 • Published Jul 30, 2024 • 22
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Paper • 2406.08707 • Published Jun 13, 2024 • 17