Instructions to use gijsdanoe/DOKTERBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gijsdanoe/DOKTERBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="gijsdanoe/DOKTERBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("gijsdanoe/DOKTERBERT") model = AutoModelForMaskedLM.from_pretrained("gijsdanoe/DOKTERBERT") - Notebooks
- Google Colab
- Kaggle
Model Card for DOKTERBERT
DOKTERBERT is a Dutch clinical language model pretrained with a SNOMED CT-grounded contrastive objective that aligns contextual span representations to SNOMED concept anchors, organising clinical concept representations against the ontology rather than treating terms in isolation.
Model Details
Model Description
DOKTERBERT (Dutch Ontology-grounded Knowledge-injected Text Encoder for Representations using BERT) is built on MedRoBERTa.nl and continues pretraining with a structure-aware contrastive loss. Contextual span representations are aligned to SNOMED CT concept anchors, with contrastive pressure between negatives weighted by graph distance in the SNOMED IS-A hierarchy, concentrating discriminative pressure on semantically adjacent concepts. The result is a representation space whose geometry reflects clinical concept structure rather than only linguistic co-occurrence.
- Developed by: Gijs Danoe, Matthijs S. Berends, Andreas Voss, Axel Hamprecht
- Shared by: Gijs Danoe
- Model type: RoBERTa-based clinical text encoder (continued pretraining)
- Language(s) (NLP): Dutch (nl)
- License: MIT
- Finetuned from model: CLTL/MedRoBERTa.nl
Model Sources
- Repository: https://github.com/gijsdanoe/DOKTERBERT
- Paper: Danoe et al. (2026), DOKTERBERT at #SMM4H–HeaRD 2026: Ontology-Grounded Contextual Representations for Dutch Clinical NLP
Uses
Direct Use
DOKTERBERT is designed for tasks that operate directly on the representation space, without task-specific fine-tuning: similarity-based retrieval, clustering, and anomaly detection over clinical text. It is most useful where labelled data is scarce and downstream systems depend on embedding geometry. Span representations should be obtained by mean-pooling the final-layer hidden states over the tokens of a span in its sentence context, as used in training.
Downstream Use
The model can be fine-tuned for supervised tasks such as named entity recognition. On supervised NER it performs comparably to its baselines (see Evaluation); its advantage is in representation quality rather than fine-tuned task performance.
Out-of-Scope Use
DOKTERBERT is a research artifact and is not a medical device. It must not be used to inform clinical decisions about individual patients. It is trained on Dutch primary-care and health-information text; use on other languages or clinical registers is out of scope.
Bias, Risks, and Limitations
The contrastive training signal depends on the quality of the span-to-concept linker, which introduces errors at both the exact-match and similarity-fallback stages. The representational evaluations use SNOMED concept identity as ground truth — the same structure used in the training objective — so they measure whether the model encodes SNOMED structure, not whether the geometry transfers to label schemes that differ from SNOMED. The contrastive objective saw roughly 30,000 SNOMED concepts in training, a small fraction of the full ontology; generalization to unseen concepts is untested. As a statistical model of clinical language, DOKTERBERT may reflect biases present in its training data.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Outputs should not be relied on for clinical decisions and should be validated against the intended downstream task.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("gijsdanoe/DOKTERBERT")
model = AutoModel.from_pretrained("gijsdanoe/DOKTERBERT")
text = "Patiënt presenteert met koorts en hoesten."
enc = tokenizer(text, return_tensors="pt")
with torch.no_grad():
out = model(**enc)
emb = out.last_hidden_state.mean(dim=1) # mean-pooled embedding
Training Details
Training Data
2.34 GB of Dutch clinical text: GP consultation notes, medical journal articles, clinical and pharmacological guidelines, and patient-facing health information. The corpus extends MedRoBERTa.nl's hospital-based foundation into primary care and the broader Dutch health-information ecosystem.
Training Procedure
Preprocessing
Candidate medical spans are extracted from the corpus with spaCy dependency parsing, then linked to SNOMED CT concepts by exact string match against the Dutch SNOMED term set, with a SapBERT similarity fallback (cosine threshold 0.85). This yields 11.4M linked spans covering 30,408 unique SNOMED concepts.
Training Hyperparameters
- Training regime:
- Initialized from MedRoBERTa.nl; 1 epoch of continued pretraining
- Objective: masked language modelling + distance-weighted InfoNCE contrastive loss
- Contrastive weight α = 0.2, temperature τ = 0.07, graph-distance decay σ = 15
- Optimizer AdamW, learning rate 2e-5, 1,000 warmup steps, weight decay 0.01
Evaluation
Testing Data, Factors & Metrics
Testing Data
MultiClinNER-nl, the Dutch subtask of the MultiClinAI shared task at the SMM4H/HeaRD workshop.
Factors
Evaluation disaggregates by entity type (DISEASE, PROCEDURE, SYMPTOM) for supervised NER, and by representational task for the unsupervised analysis.
Metrics
Supervised NER: per-type and macro F1. Representational analysis: nearest-neighbour same-concept retrieval, clustering alignment with SNOMED concepts (NMI, ARI), concept discrimination gap, and intra/inter-concept similarity ratio.
Results
On supervised NER, DOKTERBERT, RobBERT, MedRoBERTa.nl, and MedRoBERTa.nl-SapBERT all fall within a narrow band (macro F1 ≈ 0.69–0.70). On the unsupervised representational analysis, DOKTERBERT separates clearly from every baseline, with a concept discrimination gap of +0.592 versus +0.170 for the next-best model, and leads on retrieval, clustering, and intra/inter-concept separation.
Summary
Standard fine-tuning evaluation obscures pretraining-level differences in representation quality that representation analysis exposes; DOKTERBERT's ontology grounding produces a measurably more clinically structured embedding space.
Technical Specifications
Model Architecture and Objective
RoBERTa-base encoder (≈125M parameters), initialized from MedRoBERTa.nl. Training objective combines masked language modelling with a distance-weighted InfoNCE contrastive loss aligning contextual span embeddings to SNOMED concept anchors.
Software
PyTorch, Hugging Face Transformers, FAISS, spaCy, NetworkX.
Citation
BibTeX:
@inproceedings{danoe2026dokterbert,
title = {DOKTERBERT at \#SMM4H--HeaRD 2026: Ontology-Grounded Contextual Representations for Dutch Clinical NLP},
author = {Danoe, Gijs and Berends, Matthijs S. and Voss, Andreas and Hamprecht, Axel},
booktitle = {Proceedings of the SMM4H/HeaRD Workshop},
year = {2026}
}
APA:
Danoe, G., Berends, M. S., Voss, A., & Hamprecht, A. (2026). DOKTERBERT at #SMM4H–HeaRD 2026: Ontology-grounded contextual representations for Dutch clinical NLP. In Proceedings of the SMM4H/HeaRD Workshop.
Model Card Authors
Gijs Danoe
Model Card Contact
- Downloads last month
- 18
Model tree for gijsdanoe/DOKTERBERT
Base model
CLTL/MedRoBERTa.nl