Error and fix with tokenizer.json and PyDecoderWrapper
While trying to load and test the model I encountered the following Error:
"Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 88 column 3"
I don´t know what exactly the problem here is, but some vibe-bugfixing with the help of ChatGPT yielded a potential fix.
I replaced lines 84 and 167 in tokenizer.json from "prepend_scheme": "always", to "add_prefix_space": true, - which seemed to resolve the issue in my instance. The culprits were the following code-snippets it seemed:
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Metaspace",
"replacement": "▁",
"add_prefix_space": true, #formerly "prepend_scheme": "always"
"split": true
}
]
},
and
"decoder": {
"type": "Metaspace",
"replacement": "▁",
"add_prefix_space": true, #formerly "prepend_scheme": "always"
"split": true
},
According to ChatGPT the problem is: "The field prepend_scheme: "always" is not recognized by Rust. Rust expects add_prefix_space (boolean) instead."
If you need any clarification, feel free to respond and reach out. My problems might be due to my inexperience and some local problems with the setup, but other fixes like updating transformers as suggested online did not help. Also when trying to initialize the tokenizer with use_fast: False my kernel crashed.
Hope this proves helpful!
Best regards and thanks for your work,
Lasse