witiko
/

mathberta

+---
+language: en
+license: mit
+datasets:
+- arxmliv
+- math-stackexchange
+---
+# MathBERTa base model
+Pretrained model on English language using a masked language modeling (MLM)
+objective. It was developed for [the ARQMath-3 shared task evaluation][1] at
+CLEF 2022 and first released in [this repository][2]. This model is case-sensitive:
+it makes a difference between english and English.
+ [1]: https://www.cs.rit.edu/~dprl/ARQMath/
+ [2]: https://github.com/witiko/scm-at-arqmath3
+## Model description
+MathBERTa is [the RoBERTa base transformer model][3] whose tokenizer has been
+extended with LaTeX math symbols and which has been fine-tuned on a large
+corpus of English mathematical texts.
+Like RoBERTa, MathBERTa has been fine-tuned with the Masked language modeling
+(MLM) objective. Taking a sentence, the model randomly masks 15% of the words
+and math symbols in the input then run the entire masked sentence through the
+model and has to predict the masked words and symbols. This way, the model
+learns an inner representation of the English language and the language of
+LaTeX that can then be used to extract features useful for downstream tasks.
+ [3]: https://huggingface.co/roberta-base
+## Intended uses & limitations
+You can use the raw model for masked language modeling, but it's mostly
+intended to be fine-tuned on a downstream task.  See the [model
+hub][4] to look for fine-tuned versions on a task that interests you.
+Note that this model is primarily aimed at being fine-tuned on tasks that use
+the whole sentence (potentially masked) to make decisions, such as sequence
+classification, token classification or question answering. For tasks such as
+text generation you should look at model like GPT2.
+ [4]: https://huggingface.co/models?filter=roberta
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='witiko/mathberta')
+>>> unmasker(r"If [MATH] \theta = \pi [/MATH] , then [MATH] \sin(\theta) [/MATH] is <mask>.")
+[{'sequence': ' If \theta = \\pi, then\\sin( \theta) is zero.'
+  'score': 0.20843125879764557,
+  'token': 4276,
+  'token_str': ' zero'},
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is 0.'
+  'score': 0.15149112045764923,
+  'token': 321,
+  'token_str': ' 0'},
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is undefined.'
+  'score': 0.10619527101516724,
+  'token': 45436,
+  'token_str': ' undefined'},
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is 1.'
+  'score': 0.09486620128154755,
+  'token': 112,
+  'token_str': ' 1'},
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is even.'
+  'score': 0.05402865260839462,
+  'token': 190,
+  'token_str': ' even'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta')
+model = AutoModel.from_pretrained('witiko/mathberta')
+text = r"Replace me by any text and [MATH] \text{math} [/MATH] you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+## Training data
+The RoBERTa model was fine-tuned on two datasets:
+- [ArXMLiv 2020][5], a dataset consisting of 1,581,037 ArXiv documents.
+- [Math StackExchange][6], a dataset of  2,466,080 questions and answers.
+Together theses datasets weight 52GB of text and LaTeX.
+ [5]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
+ [6]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html