Instructions to use witiko/mathberta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use witiko/mathberta with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="witiko/mathberta")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("witiko/mathberta") model = AutoModelForMaskedLM.from_pretrained("witiko/mathberta") - Inference
- Notebooks
- Google Colab
- Kaggle
Problems with latex tokenization
I would like to report a bug when updating the version of the transformers library (transormers 4.16.2 -> 4.20.1), the version of the tokenizer library has also changed (tokenizer == 0.10.3 -> 0.12.1), which entailed changes when applying the tokenizer.
Consider an example.
This figure shows the operation of the tokenizer with tokenizer version 0.10.3
This figure shows the operation of the tokenizer with tokenizer version 0.12.1
The difference in this case is the separation of "" into a separate token.
There are also problems with the allocation of such latex "words" as "\cite" "\Omega" and so on into single tokens, in both versions of the tokenizer.
Hi @DimOgu ,
Please note that mathberta has been trained with transformers==4.18.0, which requires tokenizers>=0.11.1,!=0.11.3,<0.13. Therefore, we recommend not using mathberta with older versions of transformers than 4.18.0 and we recommend using it with transformers==4.20.1 due to an issue that we fixed in the meantime.
If you still need to use it with older version, check that the resulting wordpieces from calling the tokenizer are the same:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("witiko/mathberta")
text = "This \emph{Extended Patience Sorting Algorithm} is similar."
# on transformers==4.20.1 + tokenizers==0.12.1:
t(text).input_ids
>>> [0, 152, 1437, 57042, 50619, 1437, 11483, 6228, 3769, 11465, 208, 23817, 83, 53143, 50, 3432, 54598, 16, 16207, 4882, 55021, 2]
If the input_ids on your desired versions of libraries match input_ids in the supported version (transformers==4.20.1), you do not have to care about the spaces in decoding (normally done by calling decode or batch_decode) and you can still use the model without doubts.

