PyTorch
English
bert
jdrechsel's picture
Update README.md
6a04bb2 verified
|
raw
history blame
4.6 kB
metadata
license: apache-2.0
datasets:
  - ddrg/math_text
  - ddrg/math_formulas
  - ddrg/math_formula_retrieval
  - ddrg/named_math_formulas
language:
  - en
base_model:
  - google-bert/bert-base-cased

MAMUT-Bert (Math Mutator BERT)

MAMUT-BERT is a pretrained language model based on bert-base-cased, further pretrained on mathematical texts and formulas. It was introduced in MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training.

The model aims to provide improved mathematical understanding by extending BERT with domain-specific knowledge from mathematical LaTeX formulas and terminology.

Model Details

Overview

MAMUT-BERT was pretrained on four math-specific tasks across four datasets.

  • Mathematical Formulas (MF): A Masked Language Modeling (MLM) task on math formulas written in LaTeX.
  • Mathematical Texts (MT): An MLM task on natural language text containing inline LaTeX math (mathematical texts). The masking probability was biased toward mathematical tokens (inside math environment $...$) and domain-specific terms (e.g., sum, one, ...)
  • Named Math Formulas (NMF): A Next-Sentence-Prediction (NSP)-style task: given a formula and the name of a mathematical identity (e.g., Pythagorean Theorem), classify whether they match.
  • Math Formula Retrieval (MFR): Another NSP-style task to decide if two formulas describe the same mathematical identity or concept.

Training Overview

To support mathematical syntax, 300 additional mathematical LaTeX-specific tokens were added to the tokenizer, e.g., \sum, \frac, and pmatrix.

Model Sources

Uses

MAMUT-BERT is intended for downstream tasks that require improved mathematical understanding, such as:

  • Formula classification
  • Retrieval of semantically similar formulas
  • Math-related question answering

Note: This model was saved without the MLM or NSP heads and requires fine-tuning before use in downstream tasks.

Similarly trained models are MAMUT-MathBERT based on tbs17/MathBERT and MAMUT-MPBERT based on AnReu/math_structure_bert (best of the three models according to our evaluation).

Training Details

Training configurations are described in Appendix C of the MAMUT paper.

Evaluation

The model is evaluated in Section 7 and Appendix C.4 of the MAMUT paper (MAMUT-BERT).

Environmental Impact

  • Hardware Type: 8xA100
  • Hours used: 48
  • Compute Region: Germany

Citation

BibTeX:

@misc{drechsel2025mamutnovelframeworkmodifying,
      title={{MAMUT}: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training}, 
      author={Jonathan Drechsel and Anja Reusch and Steffen Herbold},
      year={2025},
      eprint={2502.20855},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.20855}, 
}