ManufactuBERT

Specialized RoBERTa-base model continually pretrained on a high-quality corpus specifically curated for the manufacturing domain. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models.

Model Details

  • Developed by: Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List)
  • Language: English
  • Domain: Manufacturing and Industrial Engineering

Training Data

The first ManufactuBERT model was continually pretrained on Manu-FineWeb, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using MinHash and SemDeDup to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version.

Named Entity Recognition (NER)

Model Mat. Syn. FabNER SOFC SOFC Slot MatScholar ChemdNER Avg.
RoBERTa 73.12 82.48 82.54 69.52 84.04 90.50 80.37
MatSciBERT 76.50 83.88 82.10 72.60 85.88 92.00 82.16
ManufactuBERT 75.04 84.00 84.40 73.68 86.76 91.92 82.63

Sentence Classification & Relation Extraction

Model Mat. Syn. (RE) SOFC (SC) Big Patent (SC) Avg.
RoBERTa 94.32 94.32 63.58 84.07
MatSciBERT 95.48 94.72 63.80 84.67
ManufactuBERT 94.62 94.68 65.80 85.03

General Domain Evaluation (GLUE)

The model preserves general language understanding capabilities significantly better than other domain-specific models.

Model GLUE Avg.
RoBERTa 86.35
MatSciBERT 77.00
SciBERT 78.13
ManufactuBERT 81.78

Citation

If you use ManufactuBERT in your research, please cite:

@misc{armingaud2025manufactubertefficientcontinualpretraining,
      title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing}, 
      author={Robin Armingaud and Romaric Besançon},
      year={2025},
      eprint={2511.05135},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.05135}, 
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rarmingaud/ManufactuBERT

Finetuned
(2131)
this model

Dataset used to train rarmingaud/ManufactuBERT

Paper for rarmingaud/ManufactuBERT