Fast tokenizer token-ID parity with previous slow tokenizer?

#39
by bullpoint - opened

I’m testing commit 81bcaaa with vLLM / transformers 5.x and noticed the new fast tokenizer does not always produce the same token IDs as the previous tokenization_kimi.py tokenizer.

For example, with the old slow tokenizer:

1234 -> ["123", "4"]

With the new TikTokenTokenizerFast / tokenizer.json:

1234 -> ["12", "34"]

Setting fix_mistral_regex=True appears to make Kimi digit tokenization even less similar to the old tokenizer, e.g. splitting 1234 into individual digits.

1.) Can you confirm whether the new fast tokenizer is intended to be token-ID compatible with the previous slow tokenizer?
2.) If exact compatibility is expected, is the numeric tokenization difference a bug in tokenizer.json or in how Transformers detects/applies the regex patch?
3.) If compatibility is not expected, should users prefer the fast tokenizer despite changed tokenization for numeric/code-heavy prompts?

Thanks!

Jon

Moonshot AI org

@bullpoint thank you for reporting this! We have temporarily reverted this MR and will conduct additional testing.

将tokenizer.json pre_tokenizer下的String字段改为Regex可修复 @bigmoyan
具体:
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "[\p{Han}]+|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]][\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]](?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
},

Moonshot AI org

将tokenizer.json pre_tokenizer下的String字段改为Regex可修复 @bigmoyan
具体:
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "[\p{Han}]+|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]][\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]](?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
},

yes, see: https://huggingface.co/moonshotai/Kimi-K2.6/discussions/40

Sign up or log in to comment