| | --- |
| | license: mit |
| | datasets: |
| | - HuggingFaceFW/fineweb |
| | language: |
| | - en |
| | metrics: |
| | - f1 |
| | base_model: |
| | - FacebookAI/roberta-base |
| | library_name: transformers |
| | tags: |
| | - manufacturing |
| | --- |
| | # ManufactuBERT |
| |
|
| | Specialized **RoBERTa-base** model continually pretrained on a high-quality corpus specifically curated for the **manufacturing domain**. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models. |
| |
|
| | ## Model Details |
| | - **Developed by:** Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List) |
| | - **Language:** English |
| | - **Domain:** Manufacturing and Industrial Engineering |
| |
|
| | ## Training Data |
| | The first ManufactuBERT model was continually pretrained on **Manu-FineWeb**, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using **MinHash** and **SemDeDup** to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version. |
| |
|
| | ## Named Entity Recognition (NER) |
| | | Model | Mat. Syn. | FabNER | SOFC | SOFC Slot | MatScholar | ChemdNER | Avg. | |
| | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
| | | RoBERTa | 73.12 | 82.48 | 82.54 | 69.52 | 84.04 | 90.50 | 80.37 | |
| | | MatSciBERT | **76.50** | 83.88 | 82.10 | 72.60 | 85.88 | **92.00** | 82.16 | |
| | | **ManufactuBERT** | 75.04 | **84.00** | **84.40** | **73.68** | **86.76** | 91.92 | **82.63** | |
| |
|
| | ## Sentence Classification & Relation Extraction |
| | | Model | Mat. Syn. (RE) | SOFC (SC) | Big Patent (SC) | Avg. | |
| | | :--- | :---: | :---: | :---: | :---: | |
| | | RoBERTa | 94.32 | 94.32 | 63.58 | 84.07 | |
| | | MatSciBERT | **95.48** | **94.72** | 63.80 | 84.67 | |
| | | **ManufactuBERT** | 94.62 | 94.68 | **65.80** | **85.03** | |
| |
|
| | ## General Domain Evaluation (GLUE) |
| | The model preserves general language understanding capabilities significantly better than other domain-specific models. |
| |
|
| | | Model | GLUE Avg. | |
| | | :--- | :---: | |
| | | RoBERTa | 86.35 | |
| | | MatSciBERT | 77.00 | |
| | | SciBERT | 78.13 | |
| | | **ManufactuBERT** | **81.78** | |
| |
|
| |
|
| | ## Citation |
| | If you use ManufactuBERT in your research, please cite: |
| | ```bibtex |
| | @misc{armingaud2025manufactubertefficientcontinualpretraining, |
| | title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing}, |
| | author={Robin Armingaud and Romaric Besançon}, |
| | year={2025}, |
| | eprint={2511.05135}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2511.05135}, |
| | } |
| | ``` |