ManufactuBERT
Specialized RoBERTa-base model continually pretrained on a high-quality corpus specifically curated for the manufacturing domain. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models.
Model Details
- Developed by: Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List)
- Language: English
- Domain: Manufacturing and Industrial Engineering
Training Data
The first ManufactuBERT model was continually pretrained on Manu-FineWeb, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using MinHash and SemDeDup to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version.
Named Entity Recognition (NER)
| Model | Mat. Syn. | FabNER | SOFC | SOFC Slot | MatScholar | ChemdNER | Avg. |
|---|---|---|---|---|---|---|---|
| RoBERTa | 73.12 | 82.48 | 82.54 | 69.52 | 84.04 | 90.50 | 80.37 |
| MatSciBERT | 76.50 | 83.88 | 82.10 | 72.60 | 85.88 | 92.00 | 82.16 |
| ManufactuBERT | 75.04 | 84.00 | 84.40 | 73.68 | 86.76 | 91.92 | 82.63 |
Sentence Classification & Relation Extraction
| Model | Mat. Syn. (RE) | SOFC (SC) | Big Patent (SC) | Avg. |
|---|---|---|---|---|
| RoBERTa | 94.32 | 94.32 | 63.58 | 84.07 |
| MatSciBERT | 95.48 | 94.72 | 63.80 | 84.67 |
| ManufactuBERT | 94.62 | 94.68 | 65.80 | 85.03 |
General Domain Evaluation (GLUE)
The model preserves general language understanding capabilities significantly better than other domain-specific models.
| Model | GLUE Avg. |
|---|---|
| RoBERTa | 86.35 |
| MatSciBERT | 77.00 |
| SciBERT | 78.13 |
| ManufactuBERT | 81.78 |
Citation
If you use ManufactuBERT in your research, please cite:
@misc{armingaud2025manufactubertefficientcontinualpretraining,
title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing},
author={Robin Armingaud and Romaric Besançon},
year={2025},
eprint={2511.05135},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.05135},
}
- Downloads last month
- -
Model tree for rarmingaud/ManufactuBERT
Base model
FacebookAI/roberta-base