Can you please tell why there is no tokenizer.json file for some models ??
I want to use some of the light weight cross encoders but i need tokenizer.json file for that but it is not there for some models.
could you please tell how can i generate the tokenizer.json file.
Huh, that's odd. You can generate it like so:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L6-v2")
print(tokenizer)
tokenizer.save_pretrained("tmp")
that produces special_tokens_map.json, tokenizer_config.json, tokenizer.json, and vocab.txt. It looks like this repository has all except the tokenizer.json. Out of curiosity, what do you need the tokenizer.json file for exactly?
I'm looking into why this file was missing now.
- Tom Aarsen
Resolved via #8, also resolved on all other models under https://huggingface.co/cross-encoder
Thank you for reporting!
- Tom Aarsen
I want to use some light weight cross encoders along with qdrant vectorDB so there is a library fastembed which need all these 4 files including tokenizer.json to use the onnx model locally.
Thanks for sharing. It should work now!