File size: 6,564 Bytes
031261b 2da3a0a fca1e10 031261b f705787 ed03627 031261b 3ab8df2 9d98315 b08928b 9d98315 b08928b 969ad2d b08928b 4a17ba4 b08928b 4a17ba4 b08928b 4a17ba4 b08928b 4a17ba4 04989a7 b08928b 4a17ba4 b08928b b899ed5 b08928b 9d98315 6857c02 b899ed5 1e4c2b5 6857c02 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | ---
license: mit
library_name: transformers
base_model: jknafou/TransBERT-bio-fr
language:
- fr
tags:
- life-sciences
- clinical
- biomedical
- bio
- medical
- biology
pipeline_tag: fill-mask
---
# TransBERT-bio-fr
TransBERT-bio-fr is a French biomedical language model pretrained exclusively on synthetically translated PubMed abstracts, using the TransCorpus framework. This model demonstrates that high-quality domain-specific language models can be built for low-resource languages using only machine-translated data.
# Model Details
- **Architecture**: BERT-base (12 layers, 768 hidden, 12 heads, 110M parameters)
- **Tokenizer**: SentencePiece unigram, 32k vocab, trained on synthetic biomedical French
- **Training Data**: 36.4GB corpus, 22M PubMed abstracts, translated from English to French, available here: [TransCorpus-bio-fr 🤗](https://huggingface.co/datasets/jknafou/TransCorpus-bio-fr)
- **Translation Model**: M2M-100 (1.2B) using [TransCorpus Toolkit](https://github.com/jknafou/TransCorpus)
- **Domain**: Biomedical, clinical, life sciences (French)
# Motivation
The lack of large-scale, high-quality biomedical corpora in French has historically limited the development of domain-specific language models. TransBERT-bio-fr addresses this gap by leveraging recent advances in neural machine translation to generate a massive, high-quality synthetic corpus, making robust French biomedical NLP possible.
# How to Use
Loading the model and tokenizer :
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jknafou/TransBERT-bio-fr")
model = AutoModel.from_pretrained("jknafou/TransBERT-bio-fr")
```
Perform the mask filling task :
```python
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="jknafou/TransBERT-bio-fr", tokenizer="jknafou/TransBERT-bio-fr")
results = fill_mask("L’insuline est une hormone produite par le <mask> et régule la glycémie.")
# [{'score': 0.6606941223144531,
# 'token': 486,
# 'token_str': 'foie',
# 'sequence': 'L’insuline est une hormone produite par le foie et régule la glycémie.'},
# {'score': 0.172934889793396,
# 'token': 2642,
# 'token_str': 'pancréas',
# 'sequence': 'L’insuline est une hormone produite par le pancréas et régule la glycémie.'},
# {'score': 0.08486421406269073,
# 'token': 488,
# 'token_str': 'cerveau',
# 'sequence': 'L’insuline est une hormone produite par le cerveau et régule la glycémie.'},
# {'score': 0.017183693125844002,
# 'token': 2092,
# 'token_str': 'cœur',
# 'sequence': 'L’insuline est une hormone produite par le cœur et régule la glycémie.'},
# {'score': 0.009480085223913193,
# 'token': 712,
# 'token_str': 'corps',
# 'sequence': 'L’insuline est une hormone produite par le corps et régule la glycémie.'}]
```
# Key Results
TransBERT-bio-fr sets a new state-of-the-art (SOTA) on the French biomedical benchmark DrBenchmark, outperforming both general-domain (CamemBERT) and previous domain-specific (DrBERT) models on classification, NER, POS, and STS tasks.
| Task | CamemBERT | DrBERT | TransBERT |
| -------------------- | ----------- | ----------- |----------- |
| Classification (F1) | 74.17 | 73.73 | **75.71<sup>*</sup>** |
| NER (F1) | 81.55 | 80.88 | **83.15<sup>*</sup>** |
| POS (F1) | 98.29 | 98.18<sup>*</sup> | **98.31** |
| STS (R²) | **83.38** | 73.56<sup>*</sup> | 83.04 |
<sup>*</sup>Statistically significance (Friedman & Nemenyi test, p<0.01).
## Paper published at EMNLP2025
TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. A paper detailing these results will be submitted to EMNLP 2025. 📝 [Current Paper Version](https://transbert.s3.text-analytics.ch/TransBERT.pdf)
# Why Synthetic Translation?
- **Scalable**: Enables pretraining on gigabytes of text for any language with a strong MT system.
- **Effective**: Outperforms models trained on native data in key biomedical tasks.
- **Accessible**: Makes high-quality domain-specific PLMs possible for low-resource languages.
# 🔗 Related Resources
This model was pretrained on large-scale synthetic French biomedical data generated using [TransCorpus](https://github.com/jknafou/TransCorpus), an open-source toolkit for scalable, parallel translation and preprocessing.
For source code, data recipes, and reproducible pipelines, visit the [TransCorpus GitHub repository](https://github.com/jknafou/TransCorpus). If you use this model, please cite:
```text
@inproceedings{knafou-etal-2025-transbert,
title = "{T}rans{BERT}: A Framework for Synthetic Translation in Domain-Specific Language Modeling",
author = {Knafou, Julien and
Mottin, Luc and
Mottaz, Ana{\"i}s and
Flament, Alexandre and
Ruch, Patrick},
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.1053/",
doi = "10.18653/v1/2025.findings-emnlp.1053",
pages = "19338--19354",
ISBN = "979-8-89176-335-7",
abstract = "The scarcity of non-English language data in specialized domains significantly limits the development of effective Natural Language Processing (NLP) tools. We present TransBERT, a novel framework for pre-training language models using exclusively synthetically translated text, and introduce TransCorpus, a scalable translation toolkit. Focusing on the life sciences domain in French, our approach demonstrates that state-of-the-art performance on various downstream tasks can be achieved solely by leveraging synthetically translated data. We release the TransCorpus toolkit, the TransCorpus-bio-fr corpus (36.4GB of French life sciences text), TransBERT-bio-fr, its associated pre-trained language model and reproducible code for both pre-training and fine-tuning. Our results highlight the viability of synthetic translation in a high-resource translation direction for building high-quality NLP resources in low-resource language/domain pairs."
}
```
|