UMCU
/

CardioBERTa_base.nl

Model card Files Files and versions

UMCU commited on Mar 4, 2025

Commit

f546f6f

·

verified ·

1 Parent(s): 980dbe3

Update README.md

Files changed (1) hide show

README.md +51 -3

README.md CHANGED Viewed

@@ -1,3 +1,51 @@
----
-license: gpl-3.0
----

+---
+license: gpl-3.0
+language:
+- nl
+base_model:
+- CLTL/MedRoBERTa.nl
+tags:
+- medical
+- healthcare
+metrics:
+- perplexity
+library_name: transformers
+---
+Continued, on-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co/CLTL/MedRoBERTa.nl) using about 50GB of open Dutch and translated
+English corpora.
+# Data statistics
+Sources:
+* Dutch: medical guidelines (FMS, NHG)
+* Dutch: [NtvG](https://www.ntvg.nl/) papers
+* English: Pubmed abstracts
+* English: PMC abstracts translated using DeepL
+* English: Apollo guidelines, papers and books
+* English: Meditron guidelines
+* English: MIMIC3
+* English: MIMIC CXR
+* English: MIMIC4
+All translated (if not with DeepL) with performed with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
+* Number of tokens: 15B
+* Number of documents: 27M
+# Training
+* Effective batch size: 5120
+* Learning rate: 2e-4
+* Weight decay: 1e-3
+* Learning schedule: linear, with 5_000 warmup steps
+* Num epochs: ~3
+Train perplexity: 3.0
+Validation perplexity: 3.0
+# Acknowledgement
+We were happy to be able to use the [Google TPU research cloud](https://sites.research.google/trc/about/) for training the model.