Fix bos/eos token IDs (config.json + tokenizer_config.json)
#5
by
KristianS7 - opened
Problem
Both bos_token and eos_token are set to <|endoftext|> (id=0), but Ouro uses ChatML format where:
bos_tokenshould be<|im_start|>(id=1)eos_tokenshould be<|im_end|>(id=2)
This causes issues with:
- Generation stopping: model never sees a proper EOS signal
- Tokenizer
add_special_tokens: wrong BOS is prepended - Downstream tools (vLLM, lm-eval-harness) that rely on
eos_token_idfor stop conditions
Fix
config.json:
bos_token_id: 0 β 1eos_token_id: 0 β 2
tokenizer_config.json:
bos_token:<|endoftext|>β<|im_start|>eos_token:<|endoftext|>β<|im_end|>
See also: same fix merged for the Thinking variant β https://huggingface.co/ByteDance/Ouro-1.4B-Thinking/discussions/4