Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper β’ 2602.22207 β’ Published Feb 25 β’ 43
Running on CPU Upgrade 216 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens π 216 Explore synthetic data experiments as an interactive bookshelf
Lapa v0.1.2 Release Collection Release of SOTA Ukrainian LLM and Datasets β’ 18 items β’ Updated Nov 13, 2025 β’ 28
Running on CPU Upgrade Featured 3.09k The Smol Training Playbook π 3.09k The secrets to building world-class LLMs
Paused 4 INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 π 4 Chat with INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0
view article Article Welcome GPT OSS, the new open-source model family from OpenAI! +10 Aug 5, 2025 β’ 513
OmniGEC Collection This is a collection of multilingual silver-standard datasets and models for the task of Grammatical Error Correction (GEC). β’ 9 items β’ Updated Sep 19, 2025 β’ 8
view article Article Announcing MamayLM, an efficient state-of-the-art Ukrainian LLM Apr 23, 2025 β’ 63
Running Featured 647 The Tokenizer Playground π 647 Experiment with and compare different tokenizers