Dhurmir commited on
Commit
2f87a05
·
verified ·
1 Parent(s): 20b2d1b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -12,7 +12,7 @@ Around 2 years ago i reported an issue in a [forum question](https://huggingface
12
  so in an effort to help everyone else that might come around this issue I'm submiting the patched sentencepiece model in here.
13
 
14
  Anyway, the issue lies in the tokenizer not properly tokenizing the sentinel tokens. Resulting in the breaking of the sentinel tokens into smaller tokens when they should have been kept intact. Long story short, the sentinel tokens were saved in the sentencepiece model with an empty space prefix (e.g., " \<extra_id_N\>" instead of "\<extra_id_N\>") resulting in all this mess, this meant that I had to manually change the sentencepiece model sentinel tokens into the proper format.
15
- I'm not sure if this might break something else but i'be done some experiments and everything seems fine with the patched model.
16
 
17
  I've tried to contact the MT5 repo maintainers several times to no avail in an effort that they publish the fixed model. Anyway, I put this in here in an effort to help others that may encounter the same problem as myself and so that they may know that it is a misconfiguration issue, not necessarily a model performance issue.
18
  If you are a member of the team and see this hey! :wave: please look up this issue an see if this fixed model works for you too!
 
12
  so in an effort to help everyone else that might come around this issue I'm submiting the patched sentencepiece model in here.
13
 
14
  Anyway, the issue lies in the tokenizer not properly tokenizing the sentinel tokens. Resulting in the breaking of the sentinel tokens into smaller tokens when they should have been kept intact. Long story short, the sentinel tokens were saved in the sentencepiece model with an empty space prefix (e.g., " \<extra_id_N\>" instead of "\<extra_id_N\>") resulting in all this mess, this meant that I had to manually change the sentencepiece model sentinel tokens into the proper format.
15
+ I'm not sure if this might break something else but i've done some experiments and everything seems fine with the patched model.
16
 
17
  I've tried to contact the MT5 repo maintainers several times to no avail in an effort that they publish the fixed model. Anyway, I put this in here in an effort to help others that may encounter the same problem as myself and so that they may know that it is a misconfiguration issue, not necessarily a model performance issue.
18
  If you are a member of the team and see this hey! :wave: please look up this issue an see if this fixed model works for you too!