Unable to convert BioGpt slow tokenizer to fast: token out of vocabulary

#13

by Seantaud - opened Mar 7, 2023

Mar 7, 2023

I would like to construct a fast tokenizer class based on the BioGptTokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, it failed.

Seantaud

Mar 7, 2023

System Info
I was trying to use BioGpt model in my code for fine-tuning. I would like to construct a fast tokenizer class based on the BioGptTokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by convert_slow_tokenizer, following error occurs: Error while initializing BPE: Token -@</w> out of vocabulary.

Reproduction
I copy the code related to colab.This is the link : https://colab.research.google.com/drive/1IMhiDz45GiarBLgXG9B2rA_u0ZOmmjJS?usp=sharing

Expected behavior
According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens in vocab.json or merge.txt. Could you please check it? Thank you very much!

tekeshwarhirwani

Jun 14, 2023

Dude any update ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment