TransBERT-bio-fr / README.md

Update README.md

b899ed5 verified about 2 months ago

6.56 kB

	---
	license: mit
	library_name: transformers
	base_model: jknafou/TransBERT-bio-fr
	language:
	- fr
	tags:
	- life-sciences
	- clinical
	- biomedical
	- bio
	- medical
	- biology
	pipeline_tag: fill-mask
	---
	# TransBERT-bio-fr
	TransBERT-bio-fr is a French biomedical language model pretrained exclusively on synthetically translated PubMed abstracts, using the TransCorpus framework. This model demonstrates that high-quality domain-specific language models can be built for low-resource languages using only machine-translated data.

	# Model Details
	- Architecture: BERT-base (12 layers, 768 hidden, 12 heads, 110M parameters)
	- Tokenizer: SentencePiece unigram, 32k vocab, trained on synthetic biomedical French
	- Training Data: 36.4GB corpus, 22M PubMed abstracts, translated from English to French, available here: [TransCorpus-bio-fr 🤗](https://huggingface.co/datasets/jknafou/TransCorpus-bio-fr)
	- Translation Model: M2M-100 (1.2B) using [TransCorpus Toolkit](https://github.com/jknafou/TransCorpus)
	- Domain: Biomedical, clinical, life sciences (French)

	# Motivation
	The lack of large-scale, high-quality biomedical corpora in French has historically limited the development of domain-specific language models. TransBERT-bio-fr addresses this gap by leveraging recent advances in neural machine translation to generate a massive, high-quality synthetic corpus, making robust French biomedical NLP possible.

	# How to Use

	Loading the model and tokenizer :

	```python
	from transformers import AutoModel, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("jknafou/TransBERT-bio-fr")
	model = AutoModel.from_pretrained("jknafou/TransBERT-bio-fr")
	```

	Perform the mask filling task :

	```python
	from transformers import pipeline

	fill_mask = pipeline("fill-mask", model="jknafou/TransBERT-bio-fr", tokenizer="jknafou/TransBERT-bio-fr")
	results = fill_mask("L’insuline est une hormone produite par le <mask> et régule la glycémie.")
	# [{'score': 0.6606941223144531,
	# 'token': 486,
	# 'token_str': 'foie',
	# 'sequence': 'L’insuline est une hormone produite par le foie et régule la glycémie.'},
	# {'score': 0.172934889793396,
	# 'token': 2642,
	# 'token_str': 'pancréas',
	# 'sequence': 'L’insuline est une hormone produite par le pancréas et régule la glycémie.'},
	# {'score': 0.08486421406269073,
	# 'token': 488,
	# 'token_str': 'cerveau',
	# 'sequence': 'L’insuline est une hormone produite par le cerveau et régule la glycémie.'},
	# {'score': 0.017183693125844002,
	# 'token': 2092,
	# 'token_str': 'cœur',
	# 'sequence': 'L’insuline est une hormone produite par le cœur et régule la glycémie.'},
	# {'score': 0.009480085223913193,
	# 'token': 712,
	# 'token_str': 'corps',
	# 'sequence': 'L’insuline est une hormone produite par le corps et régule la glycémie.'}]
	```


	# Key Results
	TransBERT-bio-fr sets a new state-of-the-art (SOTA) on the French biomedical benchmark DrBenchmark, outperforming both general-domain (CamemBERT) and previous domain-specific (DrBERT) models on classification, NER, POS, and STS tasks.

	\| Task \| CamemBERT \| DrBERT \| TransBERT \|
	\| -------------------- \| ----------- \| ----------- \|----------- \|
	\| Classification (F1) \| 74.17 \| 73.73 \| *75.71<sup></sup>** \|
	\| NER (F1) \| 81.55 \| 80.88 \| *83.15<sup></sup>** \|
	\| POS (F1) \| 98.29 \| 98.18<sup></sup> \| 98.31* \|
	\| STS (R²) \| 83.38 \| 73.56<sup>*</sup> \| 83.04 \|

	<sup>*</sup>Statistically significance (Friedman & Nemenyi test, p<0.01).

	## Paper published at EMNLP2025
	TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. A paper detailing these results will be submitted to EMNLP 2025. 📝 [Current Paper Version](https://transbert.s3.text-analytics.ch/TransBERT.pdf)

	# Why Synthetic Translation?
	- Scalable: Enables pretraining on gigabytes of text for any language with a strong MT system.
	- Effective: Outperforms models trained on native data in key biomedical tasks.
	- Accessible: Makes high-quality domain-specific PLMs possible for low-resource languages.

	# 🔗 Related Resources

	This model was pretrained on large-scale synthetic French biomedical data generated using [TransCorpus](https://github.com/jknafou/TransCorpus), an open-source toolkit for scalable, parallel translation and preprocessing.
	For source code, data recipes, and reproducible pipelines, visit the [TransCorpus GitHub repository](https://github.com/jknafou/TransCorpus). If you use this model, please cite:
	```text
	@inproceedings{knafou-etal-2025-transbert,
	title = "{T}rans{BERT}: A Framework for Synthetic Translation in Domain-Specific Language Modeling",
	author = {Knafou, Julien and
	Mottin, Luc and
	Mottaz, Ana{\"i}s and
	Flament, Alexandre and
	Ruch, Patrick},
	editor = "Christodoulopoulos, Christos and
	Chakraborty, Tanmoy and
	Rose, Carolyn and
	Peng, Violet",
	booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
	month = nov,
	year = "2025",
	address = "Suzhou, China",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2025.findings-emnlp.1053/",
	doi = "10.18653/v1/2025.findings-emnlp.1053",
	pages = "19338--19354",
	ISBN = "979-8-89176-335-7",
	abstract = "The scarcity of non-English language data in specialized domains significantly limits the development of effective Natural Language Processing (NLP) tools. We present TransBERT, a novel framework for pre-training language models using exclusively synthetically translated text, and introduce TransCorpus, a scalable translation toolkit. Focusing on the life sciences domain in French, our approach demonstrates that state-of-the-art performance on various downstream tasks can be achieved solely by leveraging synthetically translated data. We release the TransCorpus toolkit, the TransCorpus-bio-fr corpus (36.4GB of French life sciences text), TransBERT-bio-fr, its associated pre-trained language model and reproducible code for both pre-training and fine-tuning. Our results highlight the viability of synthetic translation in a high-resource translation direction for building high-quality NLP resources in low-resource language/domain pairs."
	}
	```