Mimir: Large-scale Multilingual Concept Modeling
Abstract
Researchers propose concept modeling as an alternative to traditional token-based language modeling, introducing Mimir - a 1.6 billion parameter multilingual concept model trained on massive datasets across 46 languages.
Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures. However, recent works have not only begun to question how language models process and understand meaning from tokens, but also to question whether using higher levels of granularity could advance the research field. This led to the idea of Concept Modeling, that is, to directly train models for next-concept prediction rather than next-token prediction. The goal is to change the input from tokens to concepts, forcing the underlying language model to shift its granularity from fine-grained tokens to broad concepts. In this work, we introduce Mimir, a 1.6B Large Concept Model trained for multilingual concept understanding and generation. We leverage a large-scale multilingual pre-training corpus (38,883,987,240 sentences) spanning 46 languages and a large-scale multi-turn and multilingual instruction-tuning dataset (66,816,428 sentences) covering a total of 35 languages. We extensively evaluate model performance against a language model with a comparable number of parameters.
Get this paper in your agent:
hf papers read 2605.25263 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
mimir-lcm/Mimir-1.6B-Instruct
Datasets citing this paper 2
mimir-lcm/fineweb-2-sentence-split
mimir-lcm/fineweb-edu-350BT-sentence-split
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper