CardioBERTa_base.nl / README.md
UMCU's picture
Update README.md
f546f6f verified
|
raw
history blame
1.17 kB
metadata
license: gpl-3.0
language:
  - nl
base_model:
  - CLTL/MedRoBERTa.nl
tags:
  - medical
  - healthcare
metrics:
  - perplexity
library_name: transformers

Continued, on-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.

Data statistics

Sources:

  • Dutch: medical guidelines (FMS, NHG)
  • Dutch: NtvG papers
  • English: Pubmed abstracts
  • English: PMC abstracts translated using DeepL
  • English: Apollo guidelines, papers and books
  • English: Meditron guidelines
  • English: MIMIC3
  • English: MIMIC CXR
  • English: MIMIC4

All translated (if not with DeepL) with performed with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.

  • Number of tokens: 15B
  • Number of documents: 27M

Training

  • Effective batch size: 5120
  • Learning rate: 2e-4
  • Weight decay: 1e-3
  • Learning schedule: linear, with 5_000 warmup steps
  • Num epochs: ~3

Train perplexity: 3.0 Validation perplexity: 3.0

Acknowledgement

We were happy to be able to use the Google TPU research cloud for training the model.