Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling Paper • 2604.28075 • Published 10 days ago • 18
German LLM Benchmarks Collection Improved German versions of widely used LLM benchmarks • 4 items • Updated 5 days ago • 1
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models Paper • 1511.09249 • Published Nov 30, 2015 • 1
BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs Paper • 2604.02045 • Published Apr 2 • 35
Decoding Text Spans for Efficient and Accurate Named-Entity Recognition Paper • 2604.20447 • Published 18 days ago • 2
GlotSuite Collection GlotSuite: Paving the Way for Bringing Generative AI to Underserved Communities • 17 items • Updated 24 days ago • 3
fiNERweb Collection A multilingual dataset for NER covering 91 langauges and 25 scripts • 3 items • Updated Dec 16, 2025 • 3
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World Paper • 2603.19223 • Published Mar 19 • 31
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano and Super v3. • 28 items • Updated about 15 hours ago • 136
Nemotron-Cascade 2 Collection Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation • 4 items • Updated about 15 hours ago • 49