https://huggingface.co/ewald1976/oracle-omega-24b

#2467
by ewald1976 - opened

NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

sadly broken for some reason, did you use correct tokenizer?

Hello,
The weights should be healthy, I already converted and quantized the model myself via gguf-my-repo without issues, so this isn't a broken-merge situation. The problem might be the tokenizer hash. The model uses the standard Mistral Small 3.2 Tekken tokenizer, but the tokenizer.json is a transformers-converted Tekken file, so its chkhsh isn't in llama.cpp's known list. I tried to swap in a canonical tokenizer.json, but the official mistralai/Mistral-Small-3.2-24B-Instruct-2506 repo only ships tekken.json β€” there's no upstream tokenizer.json with a registered hash to copy in.

Given that, would you be willing to map the pre-tokenizer to "tekken" for this model so the conversion can complete? If there's anything I can do on my side to make it easier β€” adjust files in the repo, provide additional info β€” just let me know.
No urgency whatsoever, and thank you again for all the work you share with everyone.

If there is still a problem, just close the request and I will try to remerge.

Thank you very much.

Hi, adjusting to known tokenizer would definitely help as it is the step that breaks it, is it possible to do in repo ? We only work with main llamacpp, so not able to adjust anything on my side

Done β€” I've replaced tokenizer.json and tokenizer_config.json with the ones from unsloth/Mistral-Small-3.2-24B-Instruct-2506. Both committed cleanly (the tokenizer.json has a different hash than before), so mainline llama.cpp should recognize it now. Could you give it another try when you have a moment? Thank you!

queued with high priority while we are both here just to see if it works or not =)

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#oracle-omega-24b-GGUF for quants to appear.

Thank you so much!

Is there a problem with newer MergeKit versions? Because there have been numerous failures for Mistral 24B based models recently.

Perhaps mergekit saves with different tokenizer, because someone requested (I dont remember, possible mergekit mistral too), he had to replace tokenizer with known tokenizer and it might have worked. So perhaps all it needs is that mergekit fixes tokenizer saving

@Naphula you are the expert for merging. Does @RichardErkhov 's helpful suggestion make sense? I remember you have been experimenting with post-merge tokenisation healing.

Sign up or log in to comment