fix: set `clean_up_tokenization_spaces` to `false`

#27
by maxsloef - opened

clean_up_tokenization_spaces=true causes tokenizer.decode() to silently strip spaces before punctuation, producing incorrect decoded text for Llama 3's BPE tokenizer. This was inherited from a HuggingFace transformers library default — Llama 2 had it set to false, and Llama 4 already ships with false.

See the full writeup with reproduction, impact analysis, and history: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/discussions/356

The fix is a one-line change in tokenizer_config.json.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment