FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published 1 day ago • 4
SindBERT, the Sailor: Charting the Seas of Turkish NLP Paper • 2510.21364 • Published Oct 24, 2025 • 1
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources Paper • 2509.25531 • Published Sep 29, 2025 • 9
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published Oct 9, 2025 • 38
BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications Paper • 2509.24908 • Published Sep 29, 2025 • 3
EmbeddingGemma: Powerful and Lightweight Text Representations Paper • 2509.20354 • Published Sep 24, 2025 • 45
Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratings Paper • 2509.14405 • Published Sep 17, 2025 • 2
Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans Paper • 2506.22439 • Published May 29, 2025 • 3
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments Paper • 2509.14233 • Published Sep 17, 2025 • 15
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America Paper • 2507.00999 • Published Jul 1, 2025 • 1
Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian Paper • 2509.05668 • Published Sep 6, 2025 • 6
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Paper • 2507.06261 • Published Jul 7, 2025 • 66
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Paper • 2507.06261 • Published Jul 7, 2025 • 66
Multilingual State Space Models for Structured Question Answering in Indic Languages Paper • 2502.01673 • Published Feb 1, 2025 • 2
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26, 2025 • 77
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Paper • 2506.01844 • Published Jun 2, 2025 • 148
Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation Paper • 2504.07072 • Published Apr 9, 2025 • 9