Problems with latex tokenization

by DimOgu - opened Mar 27, 2023

Mar 27, 2023

I would like to report a bug when updating the version of the transformers library (transormers 4.16.2 -> 4.20.1), the version of the tokenizer library has also changed (tokenizer == 0.10.3 -> 0.12.1), which entailed changes when applying the tokenizer.
Consider an example.
This figure shows the operation of the tokenizer with tokenizer version 0.10.3

This figure shows the operation of the tokenizer with tokenizer version 0.12.1

The difference in this case is the separation of "" into a separate token.

There are also problems with the allocation of such latex "words" as "\cite" "\Omega" and so on into single tokens, in both versions of the tokenizer.

michal-stefanik

Mar 29, 2023

Hi @DimOgu ,
Please note that mathberta has been trained with transformers==4.18.0, which requires tokenizers>=0.11.1,!=0.11.3,<0.13. Therefore, we recommend not using mathberta with older versions of transformers than 4.18.0 and we recommend using it with transformers==4.20.1 due to an issue that we fixed in the meantime.

If you still need to use it with older version, check that the resulting wordpieces from calling the tokenizer are the same:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("witiko/mathberta")
text = "This \emph{Extended Patience Sorting Algorithm} is similar."

# on transformers==4.20.1 + tokenizers==0.12.1:
t(text).input_ids
>>> [0, 152, 1437, 57042, 50619, 1437, 11483, 6228, 3769, 11465, 208, 23817, 83, 53143, 50, 3432, 54598, 16, 16207, 4882, 55021, 2]

If the input_ids on your desired versions of libraries match input_ids in the supported version (transformers==4.20.1), you do not have to care about the spaces in decoding (normally done by calling decode or batch_decode) and you can still use the model without doubts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment