|
|
--- |
|
|
license: mit |
|
|
library_name: transformers |
|
|
base_model: jknafou/TransBERT-bio-fr |
|
|
language: |
|
|
- fr |
|
|
tags: |
|
|
- life-sciences |
|
|
- clinical |
|
|
- biomedical |
|
|
- bio |
|
|
- medical |
|
|
- biology |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
# TransBERT-bio-fr |
|
|
TransBERT-bio-fr is a French biomedical language model pretrained exclusively on synthetically translated PubMed abstracts, using the TransCorpus framework. This model demonstrates that high-quality domain-specific language models can be built for low-resource languages using only machine-translated data. |
|
|
|
|
|
# Model Details |
|
|
- **Architecture**: BERT-base (12 layers, 768 hidden, 12 heads, 110M parameters) |
|
|
- **Tokenizer**: SentencePiece unigram, 32k vocab, trained on synthetic biomedical French |
|
|
- **Training Data**: 36.4GB corpus, 22M PubMed abstracts, translated from English to French, available here: [TransCorpus-bio-fr 🤗](https://huggingface.co/datasets/jknafou/TransCorpus-bio-fr) |
|
|
- **Translation Model**: M2M-100 (1.2B) using [TransCorpus Toolkit](https://github.com/jknafou/TransCorpus) |
|
|
- **Domain**: Biomedical, clinical, life sciences (French) |
|
|
|
|
|
# Motivation |
|
|
The lack of large-scale, high-quality biomedical corpora in French has historically limited the development of domain-specific language models. TransBERT-bio-fr addresses this gap by leveraging recent advances in neural machine translation to generate a massive, high-quality synthetic corpus, making robust French biomedical NLP possible. |
|
|
|
|
|
# How to Use |
|
|
|
|
|
Loading the model and tokenizer : |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("jknafou/TransBERT-bio-fr") |
|
|
model = AutoModel.from_pretrained("jknafou/TransBERT-bio-fr") |
|
|
``` |
|
|
|
|
|
Perform the mask filling task : |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
fill_mask = pipeline("fill-mask", model="jknafou/TransBERT-bio-fr", tokenizer="jknafou/TransBERT-bio-fr") |
|
|
results = fill_mask("L’insuline est une hormone produite par le <mask> et régule la glycémie.") |
|
|
# [{'score': 0.6606941223144531, |
|
|
# 'token': 486, |
|
|
# 'token_str': 'foie', |
|
|
# 'sequence': 'L’insuline est une hormone produite par le foie et régule la glycémie.'}, |
|
|
# {'score': 0.172934889793396, |
|
|
# 'token': 2642, |
|
|
# 'token_str': 'pancréas', |
|
|
# 'sequence': 'L’insuline est une hormone produite par le pancréas et régule la glycémie.'}, |
|
|
# {'score': 0.08486421406269073, |
|
|
# 'token': 488, |
|
|
# 'token_str': 'cerveau', |
|
|
# 'sequence': 'L’insuline est une hormone produite par le cerveau et régule la glycémie.'}, |
|
|
# {'score': 0.017183693125844002, |
|
|
# 'token': 2092, |
|
|
# 'token_str': 'cœur', |
|
|
# 'sequence': 'L’insuline est une hormone produite par le cœur et régule la glycémie.'}, |
|
|
# {'score': 0.009480085223913193, |
|
|
# 'token': 712, |
|
|
# 'token_str': 'corps', |
|
|
# 'sequence': 'L’insuline est une hormone produite par le corps et régule la glycémie.'}] |
|
|
``` |
|
|
|
|
|
|
|
|
# Key Results |
|
|
TransBERT-bio-fr sets a new state-of-the-art (SOTA) on the French biomedical benchmark DrBenchmark, outperforming both general-domain (CamemBERT) and previous domain-specific (DrBERT) models on classification, NER, POS, and STS tasks. |
|
|
|
|
|
| Task | CamemBERT | DrBERT | TransBERT | |
|
|
| -------------------- | ----------- | ----------- |----------- | |
|
|
| Classification (F1) | 74.17 | 73.73 | **75.71<sup>*</sup>** | |
|
|
| NER (F1) | 81.55 | 80.88 | **83.15<sup>*</sup>** | |
|
|
| POS (F1) | 98.29 | 98.18<sup>*</sup> | **98.31** | |
|
|
| STS (R²) | **83.38** | 73.56<sup>*</sup> | 83.04 | |
|
|
|
|
|
<sup>*</sup>Statistically significance (Friedman & Nemenyi test, p<0.01). |
|
|
|
|
|
## Paper published at EMNLP2025 |
|
|
TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. A paper detailing these results will be submitted to EMNLP 2025. 📝 [Current Paper Version](https://transbert.s3.text-analytics.ch/TransBERT.pdf) |
|
|
|
|
|
# Why Synthetic Translation? |
|
|
- **Scalable**: Enables pretraining on gigabytes of text for any language with a strong MT system. |
|
|
- **Effective**: Outperforms models trained on native data in key biomedical tasks. |
|
|
- **Accessible**: Makes high-quality domain-specific PLMs possible for low-resource languages. |
|
|
|
|
|
# 🔗 Related Resources |
|
|
|
|
|
This model was pretrained on large-scale synthetic French biomedical data generated using [TransCorpus](https://github.com/jknafou/TransCorpus), an open-source toolkit for scalable, parallel translation and preprocessing. |
|
|
For source code, data recipes, and reproducible pipelines, visit the [TransCorpus GitHub repository](https://github.com/jknafou/TransCorpus). If you use this model, please cite: |
|
|
```text |
|
|
@inproceedings{knafou-etal-2025-transbert, |
|
|
title = "{T}rans{BERT}: A Framework for Synthetic Translation in Domain-Specific Language Modeling", |
|
|
author = {Knafou, Julien and |
|
|
Mottin, Luc and |
|
|
Mottaz, Ana{\"i}s and |
|
|
Flament, Alexandre and |
|
|
Ruch, Patrick}, |
|
|
editor = "Christodoulopoulos, Christos and |
|
|
Chakraborty, Tanmoy and |
|
|
Rose, Carolyn and |
|
|
Peng, Violet", |
|
|
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", |
|
|
month = nov, |
|
|
year = "2025", |
|
|
address = "Suzhou, China", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2025.findings-emnlp.1053/", |
|
|
doi = "10.18653/v1/2025.findings-emnlp.1053", |
|
|
pages = "19338--19354", |
|
|
ISBN = "979-8-89176-335-7", |
|
|
abstract = "The scarcity of non-English language data in specialized domains significantly limits the development of effective Natural Language Processing (NLP) tools. We present TransBERT, a novel framework for pre-training language models using exclusively synthetically translated text, and introduce TransCorpus, a scalable translation toolkit. Focusing on the life sciences domain in French, our approach demonstrates that state-of-the-art performance on various downstream tasks can be achieved solely by leveraging synthetically translated data. We release the TransCorpus toolkit, the TransCorpus-bio-fr corpus (36.4GB of French life sciences text), TransBERT-bio-fr, its associated pre-trained language model and reproducible code for both pre-training and fine-tuning. Our results highlight the viability of synthetic translation in a high-resource translation direction for building high-quality NLP resources in low-resource language/domain pairs." |
|
|
} |
|
|
``` |
|
|
|