ManufactuBERT

Specialized RoBERTa-base model continually pretrained on a high-quality corpus specifically curated for the manufacturing domain. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models.

Model Details

Developed by: Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List)
Language: English
Domain: Manufacturing and Industrial Engineering

Training Data

The first ManufactuBERT model was continually pretrained on Manu-FineWeb, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using MinHash and SemDeDup to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version.

Named Entity Recognition (NER)

Model	Mat. Syn.	FabNER	SOFC	SOFC Slot	MatScholar	ChemdNER	Avg.
RoBERTa	73.12	82.48	82.54	69.52	84.04	90.50	80.37
MatSciBERT	76.50	83.88	82.10	72.60	85.88	92.00	82.16
ManufactuBERT	75.04	84.00	84.40	73.68	86.76	91.92	82.63

Sentence Classification & Relation Extraction

Model	Mat. Syn. (RE)	SOFC (SC)	Big Patent (SC)	Avg.
RoBERTa	94.32	94.32	63.58	84.07
MatSciBERT	95.48	94.72	63.80	84.67
ManufactuBERT	94.62	94.68	65.80	85.03

General Domain Evaluation (GLUE)

The model preserves general language understanding capabilities significantly better than other domain-specific models.

Model	GLUE Avg.
RoBERTa	86.35
MatSciBERT	77.00
SciBERT	78.13
ManufactuBERT	81.78

Citation

If you use ManufactuBERT in your research, please cite:

@misc{armingaud2025manufactubertefficientcontinualpretraining,
      title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing}, 
      author={Robin Armingaud and Romaric Besançon},
      year={2025},
      eprint={2511.05135},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.05135}, 
}

Downloads last month: 4

Model tree for rarmingaud/ManufactuBERT

Base model

FacebookAI/roberta-base

Finetuned

(2360)

this model

Dataset used to train rarmingaud/ManufactuBERT

Paper for rarmingaud/ManufactuBERT

ManufactuBERT: Efficient Continual Pretraining for Manufacturing

Paper • 2511.05135 • Published Nov 7, 2025 • 1