Strange behaviour of the tokenizer

#58
by andercorral - opened

Inspecting the tokenizer config in 'tokenizer.json' I've noticed that the normalizer replaces blank spaces with the special character '▁'. Then, the pre-tokenizer is applied but it will never split on blank spaces as the resulting string from the normalizer will be a single string of '▁' concatenated words.

For example:

Original text: 'Hello World!"
Normalized text: "Hello▁World!"
Pretokenized text: "Hello▁World!" (no split at all)

"normalizer": {
    "type": "Replace",
    "pattern": {
      "String": " "
    },
    "content": "▁"
  },
  "pre_tokenizer": {
    "type": "Split",
    "pattern": {
      "String": " "
    },
    "behavior": "MergedWithPrevious",
    "invert": false
  }

Is this something expected? What is the pre-tokenizer for then?

Hi @andercorral ,

Apologies for late reply,
Thanks for bringing this to our attention. I was able to reproduce the issue for this particular edge case and have shared it with our internal team for further review.

why not replace it with 123x123x123 so even if ai wrote _ , the tokenizer would not replace the _ into space or blank

Sign up or log in to comment