How was the `tokenizer.json` created?

by VaishalBusiness - opened Oct 22, 2025

Oct 22, 2025

•

edited Oct 22, 2025

Hi Xenova team,

I'm trying to understand how you generated the tokenizer.json file used in your models. Was it directly exported from a SentencePiece model, converted via Hugging Face’s transformers tools, or created through a custom process?

Specifically, I'm interested in reproducing the same structure for a custom SentencePiece model so it works with your ONNX/transformers.js pipelines.

Could you please share how you built or converted it — and which tools or scripts were used?

Thanks in advance!

VaishalBusiness changed discussion status to closed Oct 31, 2025

VaishalBusiness changed discussion status to open Oct 31, 2025

VaishalBusiness

Nov 27, 2025

•

edited Feb 3

Hi Xenova team,

I hope you’re doing well. I sent the message above on October 22 regarding how the tokenizer.json file was generated for your models, but I haven’t heard back yet.

I’m still very interested in understanding whether it was exported directly from a SentencePiece model, converted via Hugging Face tools, or created through a custom process — and any guidance for reproducing the same structure for a custom SentencePiece model to work with your ONNX/transformers.js pipelines.

I’d greatly appreciate any insight or pointers whenever you have a chance.

Thank you very much!

VaishalBusiness

Feb 3

Hi Xenova,
I have been working on this for a quite long time it would be great if you answered my question. Thank you.

Xenova

Owner Feb 3

Hi there. Sorry for the delayed response. I wrote a very simple conversion script a couple of years ago for it: https://github.com/huggingface/transformers.js/blob/b125e82b86f62cf9f0f77217601e04a5ff7c6e7f/scripts/extra/marian.py

I haven't tested or used it in a long time, so hopefully it still works 😅

VaishalBusiness

Feb 4

Hi Xenova,

Thank you so much for sharing the script — I tried it out and it worked perfectly for generating the tokenizer.json.

Best regards,
Vaishal

VaishalBusiness changed discussion status to closed Feb 4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment