TransBERT-bio-fr / README.md
jknafou's picture
Update README.md
969ad2d verified
|
raw
history blame
4.07 kB
metadata
license: mit
library_name: transformers
base_model: jknafou/TransBERT-bio-fr
language:
  - fr
tags:
  - life-sciences
  - clinical
  - biomedical
  - bio
  - medical
  - biology
pipeline_tag: fill-mask

TransBERT-bio-fr

TransBERT-bio-fr is a French biomedical language model pretrained exclusively on synthetically translated PubMed abstracts, using the TransCorpus framework. This model demonstrates that high-quality domain-specific language models can be built for low-resource languages using only machine-translated data.

Model Details

  • Architecture: BERT-base (12 layers, 768 hidden, 12 heads, 110M parameters)
  • Tokenizer: SentencePiece unigram, 32k vocab, trained on synthetic biomedical French
  • Training Data: 36.4GB corpus, 22M PubMed abstracts, translated from English to French, available here: TransCorpus-bio-fr 🤗
  • Translation Model: M2M-100 (1.2B) using TransCorpus Toolkit
  • Domain: Biomedical, clinical, life sciences (French)

Motivation

The lack of large-scale, high-quality biomedical corpora in French has historically limited the development of domain-specific language models. TransBERT-bio-fr addresses this gap by leveraging recent advances in neural machine translation to generate a massive, high-quality synthetic corpus, making robust French biomedical NLP possible.

How to Use

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("jknafou/TransBERT-bio-fr")
model = AutoModel.from_pretrained("jknafou/TransBERT-bio-fr")

text = "Le patient présente une tachycardie supraventriculaire."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Key Results

TransBERT-bio-fr sets a new state-of-the-art (SOTA) on the French biomedical benchmark DrBenchmark, outperforming both general-domain (CamemBERT) and previous domain-specific (DrBERT) models on classification, NER, POS, and STS tasks.

Task CamemBERT DrBERT TransBERT
Classification (F1) 74.17 73.73 75.71*
NER (F1) 81.55 80.88 83.15*
POS (F1) 98.29 98.18* 98.31
STS (R²) 83.38 73.56* 83.04

*Statistically significance (Friedman & Nemenyi test, p<0.01).

Paper to be submitted to EMNLP2025

TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. A paper detailing these results will be submitted to EMNLP 2025. 📝 Current Paper Version

Why Synthetic Translation?

  • Scalable: Enables pretraining on gigabytes of text for any language with a strong MT system.
  • Effective: Outperforms models trained on native data in key biomedical tasks.
  • Accessible: Makes high-quality domain-specific PLMs possible for low-resource languages.

🔗 Related Resources

This model was pretrained on large-scale synthetic French biomedical data generated using TransCorpus, an open-source toolkit for scalable, parallel translation and preprocessing. For source code, data recipes, and reproducible pipelines, visit the TransCorpus GitHub repository. If you use this model, please cite:

@misc{knafou-transbert,
    author = {Knafou, Julien and Mottin, Luc and Ana\"{i}s, Mottaz and Alexandre, Flament and  Ruch, Patrick},
    title = {TransBERT: A Framework for Synthetic Translation in Domain-Specific Language Modeling},
    year = {2025},
    note = {Submitted to EMNLP2025. Anonymous ACL submission available:},
    url = {https://transbert.s3.text-analytics.ch/TransBERT.pdf},
}