ManufactuBERT / README.md
rarmingaud's picture
Duplicate from cea-list-ia/ManufactuBERT
a46d002
metadata
license: mit
datasets:
  - HuggingFaceFW/fineweb
language:
  - en
metrics:
  - f1
base_model:
  - FacebookAI/roberta-base
library_name: transformers
tags:
  - manufacturing

ManufactuBERT

Specialized RoBERTa-base model continually pretrained on a high-quality corpus specifically curated for the manufacturing domain. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models.

Model Details

  • Developed by: Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List)
  • Language: English
  • Domain: Manufacturing and Industrial Engineering

Training Data

The first ManufactuBERT model was continually pretrained on Manu-FineWeb, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using MinHash and SemDeDup to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version.

Named Entity Recognition (NER)

Model Mat. Syn. FabNER SOFC SOFC Slot MatScholar ChemdNER Avg.
RoBERTa 73.12 82.48 82.54 69.52 84.04 90.50 80.37
MatSciBERT 76.50 83.88 82.10 72.60 85.88 92.00 82.16
ManufactuBERT 75.04 84.00 84.40 73.68 86.76 91.92 82.63

Sentence Classification & Relation Extraction

Model Mat. Syn. (RE) SOFC (SC) Big Patent (SC) Avg.
RoBERTa 94.32 94.32 63.58 84.07
MatSciBERT 95.48 94.72 63.80 84.67
ManufactuBERT 94.62 94.68 65.80 85.03

General Domain Evaluation (GLUE)

The model preserves general language understanding capabilities significantly better than other domain-specific models.

Model GLUE Avg.
RoBERTa 86.35
MatSciBERT 77.00
SciBERT 78.13
ManufactuBERT 81.78

Citation

If you use ManufactuBERT in your research, please cite:

@misc{armingaud2025manufactubertefficientcontinualpretraining,
      title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing}, 
      author={Robin Armingaud and Romaric Besançon},
      year={2025},
      eprint={2511.05135},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.05135}, 
}