Model Card for DOKTERBERT

DOKTERBERT is a Dutch clinical language model pretrained with a SNOMED CT-grounded contrastive objective that aligns contextual span representations to SNOMED concept anchors, organising clinical concept representations against the ontology rather than treating terms in isolation.

Model Details

Model Description

DOKTERBERT (Dutch Ontology-grounded Knowledge-injected Text Encoder for Representations using BERT) is built on MedRoBERTa.nl and continues pretraining with a structure-aware contrastive loss. Contextual span representations are aligned to SNOMED CT concept anchors, with contrastive pressure between negatives weighted by graph distance in the SNOMED IS-A hierarchy, concentrating discriminative pressure on semantically adjacent concepts. The result is a representation space whose geometry reflects clinical concept structure rather than only linguistic co-occurrence.

Developed by: Gijs Danoe, Matthijs S. Berends, Andreas Voss, Axel Hamprecht
Shared by: Gijs Danoe
Model type: RoBERTa-based clinical text encoder (continued pretraining)
Language(s) (NLP): Dutch (nl)
License: MIT
Finetuned from model: CLTL/MedRoBERTa.nl

Model Sources

Repository: https://github.com/gijsdanoe/DOKTERBERT
Paper: Danoe et al. (2026), DOKTERBERT at #SMM4H–HeaRD 2026: Ontology-Grounded Contextual Representations for Dutch Clinical NLP

Uses

Direct Use

DOKTERBERT is designed for tasks that operate directly on the representation space, without task-specific fine-tuning: similarity-based retrieval, clustering, and anomaly detection over clinical text. It is most useful where labelled data is scarce and downstream systems depend on embedding geometry. Span representations should be obtained by mean-pooling the final-layer hidden states over the tokens of a span in its sentence context, as used in training.

Downstream Use

The model can be fine-tuned for supervised tasks such as named entity recognition. On supervised NER it performs comparably to its baselines (see Evaluation); its advantage is in representation quality rather than fine-tuned task performance.

Out-of-Scope Use

DOKTERBERT is a research artifact and is not a medical device. It must not be used to inform clinical decisions about individual patients. It is trained on Dutch primary-care and health-information text; use on other languages or clinical registers is out of scope.

Bias, Risks, and Limitations

The contrastive training signal depends on the quality of the span-to-concept linker, which introduces errors at both the exact-match and similarity-fallback stages. The representational evaluations use SNOMED concept identity as ground truth — the same structure used in the training objective — so they measure whether the model encodes SNOMED structure, not whether the geometry transfers to label schemes that differ from SNOMED. The contrastive objective saw roughly 30,000 SNOMED concepts in training, a small fraction of the full ontology; generalization to unseen concepts is untested. As a statistical model of clinical language, DOKTERBERT may reflect biases present in its training data.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Outputs should not be relied on for clinical decisions and should be validated against the intended downstream task.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("gijsdanoe/DOKTERBERT")
model = AutoModel.from_pretrained("gijsdanoe/DOKTERBERT")

text = "Patiënt presenteert met koorts en hoesten."
enc = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    out = model(**enc)

emb = out.last_hidden_state.mean(dim=1)  # mean-pooled embedding

Training Details

Training Data

2.34 GB of Dutch clinical text: GP consultation notes, medical journal articles, clinical and pharmacological guidelines, and patient-facing health information. The corpus extends MedRoBERTa.nl's hospital-based foundation into primary care and the broader Dutch health-information ecosystem.

Training Procedure

Preprocessing

Candidate medical spans are extracted from the corpus with spaCy dependency parsing, then linked to SNOMED CT concepts by exact string match against the Dutch SNOMED term set, with a SapBERT similarity fallback (cosine threshold 0.85). This yields 11.4M linked spans covering 30,408 unique SNOMED concepts.

Training Hyperparameters

Training regime:
Initialized from MedRoBERTa.nl; 1 epoch of continued pretraining
Objective: masked language modelling + distance-weighted InfoNCE contrastive loss
Contrastive weight α = 0.2, temperature τ = 0.07, graph-distance decay σ = 15
Optimizer AdamW, learning rate 2e-5, 1,000 warmup steps, weight decay 0.01

Evaluation

Testing Data, Factors & Metrics

Testing Data

MultiClinNER-nl, the Dutch subtask of the MultiClinAI shared task at the SMM4H/HeaRD workshop.

Factors

Evaluation disaggregates by entity type (DISEASE, PROCEDURE, SYMPTOM) for supervised NER, and by representational task for the unsupervised analysis.

Metrics

Supervised NER: per-type and macro F1. Representational analysis: nearest-neighbour same-concept retrieval, clustering alignment with SNOMED concepts (NMI, ARI), concept discrimination gap, and intra/inter-concept similarity ratio.

Results

On supervised NER, DOKTERBERT, RobBERT, MedRoBERTa.nl, and MedRoBERTa.nl-SapBERT all fall within a narrow band (macro F1 ≈ 0.69–0.70). On the unsupervised representational analysis, DOKTERBERT separates clearly from every baseline, with a concept discrimination gap of +0.592 versus +0.170 for the next-best model, and leads on retrieval, clustering, and intra/inter-concept separation.

Summary

Standard fine-tuning evaluation obscures pretraining-level differences in representation quality that representation analysis exposes; DOKTERBERT's ontology grounding produces a measurably more clinically structured embedding space.

Technical Specifications

Model Architecture and Objective

RoBERTa-base encoder (≈125M parameters), initialized from MedRoBERTa.nl. Training objective combines masked language modelling with a distance-weighted InfoNCE contrastive loss aligning contextual span embeddings to SNOMED concept anchors.

Software

PyTorch, Hugging Face Transformers, FAISS, spaCy, NetworkX.

Citation

BibTeX:

@inproceedings{danoe2026dokterbert,
  title     = {DOKTERBERT at \#SMM4H--HeaRD 2026: Ontology-Grounded Contextual Representations for Dutch Clinical NLP},
  author    = {Danoe, Gijs and Berends, Matthijs S. and Voss, Andreas and Hamprecht, Axel},
  booktitle = {Proceedings of the SMM4H/HeaRD Workshop},
  year      = {2026}
}

APA:

Danoe, G., Berends, M. S., Voss, A., & Hamprecht, A. (2026). DOKTERBERT at #SMM4H–HeaRD 2026: Ontology-grounded contextual representations for Dutch clinical NLP. In Proceedings of the SMM4H/HeaRD Workshop.

Model Card Authors

Gijs Danoe

Model Card Contact

g.danoe@umcg.nl

Downloads last month: 18

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for gijsdanoe/DOKTERBERT

Base model

CLTL/MedRoBERTa.nl

Finetuned

(12)

this model