--- license: mit datasets: - HuggingFaceFW/fineweb language: - en metrics: - f1 base_model: - FacebookAI/roberta-base library_name: transformers tags: - manufacturing --- # ManufactuBERT Specialized **RoBERTa-base** model continually pretrained on a high-quality corpus specifically curated for the **manufacturing domain**. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models. ## Model Details - **Developed by:** Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List) - **Language:** English - **Domain:** Manufacturing and Industrial Engineering ## Training Data The first ManufactuBERT model was continually pretrained on **Manu-FineWeb**, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using **MinHash** and **SemDeDup** to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version. ## Named Entity Recognition (NER) | Model | Mat. Syn. | FabNER | SOFC | SOFC Slot | MatScholar | ChemdNER | Avg. | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | RoBERTa | 73.12 | 82.48 | 82.54 | 69.52 | 84.04 | 90.50 | 80.37 | | MatSciBERT | **76.50** | 83.88 | 82.10 | 72.60 | 85.88 | **92.00** | 82.16 | | **ManufactuBERT** | 75.04 | **84.00** | **84.40** | **73.68** | **86.76** | 91.92 | **82.63** | ## Sentence Classification & Relation Extraction | Model | Mat. Syn. (RE) | SOFC (SC) | Big Patent (SC) | Avg. | | :--- | :---: | :---: | :---: | :---: | | RoBERTa | 94.32 | 94.32 | 63.58 | 84.07 | | MatSciBERT | **95.48** | **94.72** | 63.80 | 84.67 | | **ManufactuBERT** | 94.62 | 94.68 | **65.80** | **85.03** | ## General Domain Evaluation (GLUE) The model preserves general language understanding capabilities significantly better than other domain-specific models. | Model | GLUE Avg. | | :--- | :---: | | RoBERTa | 86.35 | | MatSciBERT | 77.00 | | SciBERT | 78.13 | | **ManufactuBERT** | **81.78** | ## Citation If you use ManufactuBERT in your research, please cite: ```bibtex @misc{armingaud2025manufactubertefficientcontinualpretraining, title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing}, author={Robin Armingaud and Romaric Besançon}, year={2025}, eprint={2511.05135}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.05135}, } ```