---
license: mit
datasets:
- HuggingFaceFW/fineweb
language:
- en
metrics:
- f1
base_model:
- FacebookAI/roberta-base
library_name: transformers
tags:
- manufacturing
---
# ManufactuBERT

Specialized **RoBERTa-base** model continually pretrained on a high-quality corpus specifically curated for the **manufacturing domain**. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models.

## Model Details
- **Developed by:** Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List) 
- **Language:** English 
- **Domain:** Manufacturing and Industrial Engineering 

## Training Data
The first ManufactuBERT model was continually pretrained on **Manu-FineWeb**, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using **MinHash** and **SemDeDup** to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version.

## Named Entity Recognition (NER)
| Model | Mat. Syn. | FabNER | SOFC | SOFC Slot | MatScholar | ChemdNER | Avg. |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| RoBERTa | 73.12 | 82.48 | 82.54 | 69.52 | 84.04 | 90.50 | 80.37 |
| MatSciBERT | **76.50** | 83.88 | 82.10 | 72.60 | 85.88 | **92.00** | 82.16 |
| **ManufactuBERT** | 75.04 | **84.00** | **84.40** | **73.68** | **86.76** | 91.92 | **82.63** |

## Sentence Classification & Relation Extraction
| Model | Mat. Syn. (RE) | SOFC (SC) | Big Patent (SC) | Avg. |
| :--- | :---: | :---: | :---: | :---: |
| RoBERTa | 94.32 | 94.32 | 63.58 | 84.07 |
| MatSciBERT | **95.48** | **94.72** | 63.80 | 84.67 |
| **ManufactuBERT** | 94.62 | 94.68 | **65.80** | **85.03** |

## General Domain Evaluation (GLUE)
The model preserves general language understanding capabilities significantly better than other domain-specific models.

| Model | GLUE Avg. |
| :--- | :---: |
| RoBERTa | 86.35 |
| MatSciBERT | 77.00 |
| SciBERT | 78.13 |
| **ManufactuBERT** | **81.78** |


## Citation
If you use ManufactuBERT in your research, please cite:
```bibtex
@misc{armingaud2025manufactubertefficientcontinualpretraining,
      title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing}, 
      author={Robin Armingaud and Romaric Besançon},
      year={2025},
      eprint={2511.05135},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.05135}, 
}
```