Can you please tell why there is no tokenizer.json file for some models ??

by learnerX - opened Apr 8, 2025

Apr 8, 2025

I want to use some of the light weight cross encoders but i need tokenizer.json file for that but it is not there for some models.
could you please tell how can i generate the tokenizer.json file.

tomaarsen

Sentence Transformers - Cross-Encoders org Apr 8, 2025

Huh, that's odd. You can generate it like so:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L6-v2")
print(tokenizer)
tokenizer.save_pretrained("tmp")

that produces special_tokens_map.json, tokenizer_config.json, tokenizer.json, and vocab.txt. It looks like this repository has all except the tokenizer.json. Out of curiosity, what do you need the tokenizer.json file for exactly?
I'm looking into why this file was missing now.

Tom Aarsen

tomaarsen

Sentence Transformers - Cross-Encoders org Apr 8, 2025

Resolved via #8, also resolved on all other models under https://huggingface.co/cross-encoder
Thank you for reporting!

Tom Aarsen

tomaarsen changed discussion status to closed Apr 8, 2025

learnerX

Apr 8, 2025

I want to use some light weight cross encoders along with qdrant vectorDB so there is a library fastembed which need all these 4 files including tokenizer.json to use the onnx model locally.

tomaarsen

Sentence Transformers - Cross-Encoders org Apr 8, 2025

Thanks for sharing. It should work now!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment