The exact vocab size of the model

#22
by abdullahamlwakeb - opened

I have read the paper and this part of the paper suggests that the model is trained on a vocabulary less thank 151K (of the original qwe3). And this part of the paper make be more confused:

The Qwen3 decoder uses a 151,936-token multilingual vocabulary, much of which is unused for language-specific
OCR. We investigate frequency-based vocabulary pruning for English/French documents, reducing to 51k, 32k, and
16k tokens while preserving tokenizer integrity through recursive sub-token frequency propagation.
Table 4 summarizes the trade-offs. Pruning to 16k tokens reduces parameters by 13.8% with minimal OCR degradation on English benchmarks (75.4% vs 76.1% on OlmOCR-Bench). The 32k variant achieves the best speedaccuracy balance: 11.6% faster inference while retaining 96% of base performance. However, non-Latin scripts
(Arabic, Chinese) experience ∼3× token count inflation as script-specific tokens are removed. These experiments were conducted on LightOnOCR-1; we release the pruned variants as LightOnOCR-0.9B-32k-10251 andLightOnOCR-0.9B-16k-10252.

LightOn AI org

Hi,
Vocabulary pruning was used for LightOnOCR-1 to show the tradeoffs of speed/perf depending on target languages. For v2, we simply kept the full vocabulary to support all languages.
More details in the v1 blogpost : https://huggingface.co/blog/lightonai/lightonocr#vocabulary-pruning

Thank you so much for you awesome work @staghado

Sign up or log in to comment