Fix bos/eos token IDs (config.json + tokenizer_config.json)

#5
by KristianS7 - opened

Problem

Both bos_token and eos_token are set to <|endoftext|> (id=0), but Ouro uses ChatML format where:

  • bos_token should be <|im_start|> (id=1)
  • eos_token should be <|im_end|> (id=2)

This causes issues with:

  • Generation stopping: model never sees a proper EOS signal
  • Tokenizer add_special_tokens: wrong BOS is prepended
  • Downstream tools (vLLM, lm-eval-harness) that rely on eos_token_id for stop conditions

Fix

config.json:

  • bos_token_id: 0 β†’ 1
  • eos_token_id: 0 β†’ 2

tokenizer_config.json:

  • bos_token: <|endoftext|> β†’ <|im_start|>
  • eos_token: <|endoftext|> β†’ <|im_end|>

See also: same fix merged for the Thinking variant β€” https://huggingface.co/ByteDance/Ouro-1.4B-Thinking/discussions/4

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment