GHMSCHRFT-v1

GHMSCHRFT (from Dutch geheimschrift, "cipher") is a fine-tuned token classification model for detecting protected health information (PHI) in Dutch clinical free-text reports. It is the PHI detection component of the GHMSCHRFT de-identification pipeline, which combines this model with a Dutch hiding-in-plain-sight (HIPS) surrogate replacement module.

Model Details

Developed by: Luc Builtjes, Alessa Hering, and colleagues at Radboud University Medical Center (Radboudumc)
Model type: Token classification (NER) for PHI detection
Language: Dutch (nl)
License: CC BY-NC 4.0
Fine-tuned from: joeranbosma/dragon-roberta-large-mixed-domain — RoBERTa-large pretrained on mixed-domain Dutch medical text

Model Sources

Pipeline repository: https://github.com/DIAGNijmegen/GHMSCHRFT
Training & evaluation code: https://github.com/DIAGNijmegen/GHMSCHRFT-training
HIPS surrogate module: https://github.com/DIAGNijmegen/dutch-med-hips

Uses

Direct Use

This model detects and labels PHI spans in Dutch clinical free-text using 24 categories in BIO format (see tag inventory below). It is intended as part of the GHMSCHRFT pipeline, where detected spans are subsequently replaced with realistic Dutch surrogates by the dutch-med-hips module.

The recommended way to run the full pipeline (detection + surrogate replacement) is via the Docker image provided in the GHMSCHRFT repository. The model can also be used standalone for PHI span detection:

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="LMMasters/GHMSCHRFT-v1",
    aggregation_strategy="simple",
)

text = "Patiënt Jan de Vries, geboren op 15-03-1985, werd verwezen door dr. Smit."
results = ner(text)

Note: input sequences exceeding 512 tokens are handled with a strided sliding-window approach in the full pipeline. The standalone pipeline above truncates by default.

Out-of-Scope Use

Non-Dutch text: The model was trained on Dutch clinical reports and is not intended for other languages.
Non-clinical text: Performance outside clinical radiology, pathology, and outpatient letter domains is unknown.
Standalone de-identification without review: Detection recall of ~0.985 means a small fraction of PHI will not be detected. The model should be used as part of a broader de-identification workflow, not as a sole safeguard.
Re-identification: This model must not be used to attempt re-identification of individuals from de-identified records.

PHI Tag Inventory

The model predicts 24 PHI categories:

Group	Tags
Person & demographics	`PERSOON`, `PERSOONAFKORTING`, `LEEFTIJD`
Institution & location	`ZIEKENHUIS`, `ADRES`, `PLAATS`, `BEDRIJF`
Study-related	`STUDIE_NAAM`
Temporal	`DATUM`, `TIJD`
Contact & web	`TELEFOONNUMMER`, `EMAIL`, `URL`
Institutional identifiers	`PATIENTNUMMER`, `ZNUMMER`, `RAPPORT_ID`, `DOCUMENTID`, `DOCUMENTNUMMER`, `ACCREDITATIE_NUMMER`, `AGBNUMMER`, `BIGNUMMER`, `BSN`, `IBAN`, `PHINUMMER`

Bias, Risks, and Limitations

Institutional bias: The model was trained on data from two Dutch hospitals (RUMC and ZGT). Performance may differ for report types or institutions with different documentation conventions.
Rare categories: PHI categories with few training examples (e.g. company names, study names, addresses) have lower F1-scores despite high detection recall.
Ambiguous entities: Isolated first names in clinical letters and numbers that could be either ages or measurements are a known source of both false negatives and false positives.
Training data privacy: The underlying clinical training data cannot be shared. The model itself does not contain patient data, but users should be aware that transformer models can memorize fragments of training data.
Regulatory compliance: This model is a technical component. Compliance with GDPR or other applicable regulations remains the responsibility of the deploying organization.

How to Get Started

For the full de-identification pipeline (PHI detection + Dutch surrogate replacement), use the Docker image:

docker run --rm --gpus all \
    -v /path/to/input:/input \
    -v /path/to/output:/output \
    lmmasters/ghmschrft-inference:latest \
    python process.py --input /input/docker_input.jsonl --output /output/

Input format: JSONL with {"uid": "...", "text": "..."} per line.

For standalone token classification, see the code snippet under Direct Use above.

Training Details

Training Data

3,941 annotated Dutch clinical reports from two hospitals:
- Radboud University Medical Center (RUMC): radiology and pathology reports
- Ziekenhuisgroep Twente (ZGT): clinical and outpatient letters across 11 specialties
Not publicly available due to patient privacy constraints and GDPR.
Augmentation: each training report was duplicated with PHI replaced by HIPS surrogates (dutch-med-hips); 50 fully synthetic reports were generated with Qwen3 to address class imbalance for rare PHI types.

Training Procedure

Framework: HuggingFace Transformers via DRAGON baseline framework
Learning rate: 1×10⁻⁵
Effective batch size: 8 (batch size 1, 8 gradient-accumulation steps)
Gradient checkpointing: enabled
Input truncation: left-truncation to 512 tokens during training; strided sliding-window at inference
Checkpoint selection: best validation performance
Training regime: fp32

Evaluation

Testing Data

Internal test set: 980 reports (RUMC + ZGT; same institutions as training, held-out split)
External test set: 340 reports from Jeroen Bosch Ziekenhuis (JBZ) — an institution unseen during training, covering radiology, pathology, and emergency care

Metrics

Micro-averaged F1 (exact span and type match) — principal summary metric
Detection recall — a gold entity is counted as detected when every token in the gold span receives any non-O label, regardless of type or span extent; directly proxies residual disclosure risk

All headline metrics include 95% confidence intervals from document-level bootstrap resampling (5,000 iterations).

Results

Split	Precision	Recall	F1	Detection recall
Internal (RUMC + ZGT, N=980)	0.941	0.944	0.943	0.985
External (JBZ, N=340)	0.949	0.952	0.950	0.986

GHMSCHRFT substantially outperforms rule-based baselines on detection recall: DEDUCE v3 (0.549 / 0.682), RRA v2 (0.386 / 0.295), and the OpenAI Privacy Filter (0.339 / 0.498) on the internal / external test sets respectively.

Citation

A paper describing this model is currently under review. Citation information will be added upon publication. In the meantime, please cite this model directly:

@misc{builtjes2026ghmschrft,
  author    = {Builtjes, Luc and Bosma, Joeran S. and Lefkes, Judith and
               Veldhuis, Anouk and Geerdink, Jeroen and de Haas, Tijmen and
               Hoitsma, Fabian R. and Weber, Rianne and de Boer, Sarah and
               Buser, Myrthe and Kendall, Harry J. and Sanches, Delano and
               van Ginneken, Bram and Huisman, Henkjan and Smit, Ewoud J. and
               Prokop, Mathias and Hering, Alessa},
  title     = {{GHMSCHRFT-v1}: PHI Detection Model for Dutch Clinical Reports},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/LMMasters/GHMSCHRFT-v1},
}

Downloads last month: 25

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for LMMasters/GHMSCHRFT-v1

Base model

joeranbosma/dragon-roberta-large-mixed-domain

Finetuned

(1)

this model