fix: set `clean_up_tokenization_spaces` to `false`
#7
by maxsloef - opened
clean_up_tokenization_spaces=true causes tokenizer.decode() to silently strip spaces before punctuation, producing incorrect decoded text for Llama 3's BPE tokenizer. This was inherited from a HuggingFace transformers library default β Llama 2 had it set to false, and Llama 4 already ships with false.
See the full writeup with reproduction, impact analysis, and history: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/discussions/356
The fix is a one-line change in tokenizer_config.json.