UMCU commited on
Commit
f546f6f
·
verified ·
1 Parent(s): 980dbe3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -3
README.md CHANGED
@@ -1,3 +1,51 @@
1
- ---
2
- license: gpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ language:
4
+ - nl
5
+ base_model:
6
+ - CLTL/MedRoBERTa.nl
7
+ tags:
8
+ - medical
9
+ - healthcare
10
+ metrics:
11
+ - perplexity
12
+ library_name: transformers
13
+ ---
14
+
15
+ Continued, on-premise, pre-training of [MedRoBERTa.nl](https://huggingface.co/CLTL/MedRoBERTa.nl) using about 50GB of open Dutch and translated
16
+ English corpora.
17
+
18
+
19
+ # Data statistics
20
+
21
+ Sources:
22
+ * Dutch: medical guidelines (FMS, NHG)
23
+ * Dutch: [NtvG](https://www.ntvg.nl/) papers
24
+ * English: Pubmed abstracts
25
+ * English: PMC abstracts translated using DeepL
26
+ * English: Apollo guidelines, papers and books
27
+ * English: Meditron guidelines
28
+ * English: MIMIC3
29
+ * English: MIMIC CXR
30
+ * English: MIMIC4
31
+
32
+ All translated (if not with DeepL) with performed with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
33
+
34
+ * Number of tokens: 15B
35
+ * Number of documents: 27M
36
+
37
+ # Training
38
+
39
+ * Effective batch size: 5120
40
+ * Learning rate: 2e-4
41
+ * Weight decay: 1e-3
42
+ * Learning schedule: linear, with 5_000 warmup steps
43
+ * Num epochs: ~3
44
+
45
+ Train perplexity: 3.0
46
+ Validation perplexity: 3.0
47
+
48
+ # Acknowledgement
49
+
50
+ We were happy to be able to use the [Google TPU research cloud](https://sites.research.google/trc/about/) for training the model.
51
+