Duplicated from cea-list-ia/ManufactuBERT

rarmingaud
/

ManufactuBERT

Model card Files Files and versions

ManufactuBERT / README.md

rarmingaud's picture

Duplicate from cea-list-ia/ManufactuBERT

a46d002 6 days ago

|

history blame contribute delete

2.65 kB

	---
	license: mit
	datasets:
	- HuggingFaceFW/fineweb
	language:
	- en
	metrics:
	- f1
	base_model:
	- FacebookAI/roberta-base
	library_name: transformers
	tags:
	- manufacturing
	---
	# ManufactuBERT

	Specialized RoBERTa-base model continually pretrained on a high-quality corpus specifically curated for the manufacturing domain. By employing a rigorous data selection and deduplication pipeline, ManufactuBERT achieves state-of-the-art performance on industrial NLP tasks while being more efficient to train than standard domain-adapted models.

	## Model Details
	- Developed by: Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List)
	- Language: English
	- Domain: Manufacturing and Industrial Engineering

	## Training Data
	The first ManufactuBERT model was continually pretrained on Manu-FineWeb, a 10-billion-token subset of the FineWeb dataset. The final model version used a further refined version (~2 billion tokens) created through a multi-stage process using MinHash and SemDeDup to remove approximately 80% of redundant data. This process accelerated convergence by approximately 33% compared to the non-deduplicated version.

	## Named Entity Recognition (NER)
	\| Model \| Mat. Syn. \| FabNER \| SOFC \| SOFC Slot \| MatScholar \| ChemdNER \| Avg. \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| RoBERTa \| 73.12 \| 82.48 \| 82.54 \| 69.52 \| 84.04 \| 90.50 \| 80.37 \|
	\| MatSciBERT \| 76.50 \| 83.88 \| 82.10 \| 72.60 \| 85.88 \| 92.00 \| 82.16 \|
	\| ManufactuBERT \| 75.04 \| 84.00 \| 84.40 \| 73.68 \| 86.76 \| 91.92 \| 82.63 \|

	## Sentence Classification & Relation Extraction
	\| Model \| Mat. Syn. (RE) \| SOFC (SC) \| Big Patent (SC) \| Avg. \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| RoBERTa \| 94.32 \| 94.32 \| 63.58 \| 84.07 \|
	\| MatSciBERT \| 95.48 \| 94.72 \| 63.80 \| 84.67 \|
	\| ManufactuBERT \| 94.62 \| 94.68 \| 65.80 \| 85.03 \|

	## General Domain Evaluation (GLUE)
	The model preserves general language understanding capabilities significantly better than other domain-specific models.

	\| Model \| GLUE Avg. \|
	\| :--- \| :---: \|
	\| RoBERTa \| 86.35 \|
	\| MatSciBERT \| 77.00 \|
	\| SciBERT \| 78.13 \|
	\| ManufactuBERT \| 81.78 \|


	## Citation
	If you use ManufactuBERT in your research, please cite:
	```bibtex
	@misc{armingaud2025manufactubertefficientcontinualpretraining,
	title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing},
	author={Robin Armingaud and Romaric Besançon},
	year={2025},
	eprint={2511.05135},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2511.05135},
	}
	```