Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling Paper • 2604.28075 • Published 12 days ago • 19
FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published Jan 29 • 11
Running Agents 29 Ukrainian LLM Leaderboard 👁 29 Measuring LLM capabilities to process Ukrainian texts
Running Agents 29 Ukrainian LLM Leaderboard 👁 29 Measuring LLM capabilities to process Ukrainian texts
SindBERT, the Sailor: Charting the Seas of Turkish NLP Paper • 2510.21364 • Published Oct 24, 2025 • 2
lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k Zero-Shot Image Classification • Updated Oct 17, 2025 • 2
lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k Zero-Shot Image Classification • Updated Oct 17, 2025 • 2
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k Zero-Shot Image Classification • Updated Oct 17, 2025 • 2
Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction Paper • 2509.14504 • Published Sep 18, 2025
OmniGEC Collection This is a collection of multilingual silver-standard datasets and models for the task of Grammatical Error Correction (GEC). • 9 items • Updated Sep 19, 2025 • 8