Danish XLM-R NER Large (Two-Stage)

A Danish Named Entity Recognition model based on xlm-roberta-large (560M parameters), fine-tuned in two stages for high-recall PII detection in Danish text.

Model Description

This model detects three entity types relevant for GDPR-compliant PII processing:

  • PER - Person names
  • ORG - Organizations (companies, institutions, government bodies)
  • LOC - Locations (addresses, cities, countries)

MISC is intentionally excluded to reduce noise and focus on actionable PII entities.

Training

Two-stage fine-tuning approach:

  1. Stage 1 (broad NER): DANSK + DaNE + NorNE, 10 epochs, LR 2e-5
  2. Stage 2 (domain adaptation): DANSK-only, 1 epoch, LR 5e-6

This approach achieves the best balance between multi-domain generalization and Danish-specific performance.

Datasets

Dataset Role Size Domains
DANSK Primary 11.7K train Web, News, Wiki, Legal, Dannet, Conversation, Social Media
DaNE Supplementary 4.4K train News
NorNE Stage 1 only ~20K train News (Norwegian Bokmaal + Nynorsk)

Evaluation Results

DANSK (primary benchmark, multi-domain)

Split PER F1 ORG F1 LOC F1 Micro F1 Precision Recall
Dev 88.0 85.3 90.3 87.6 86.6 88.7
Test 84.8 84.6 90.3 86.5 85.4 87.5

DaNE (secondary benchmark, news domain)

Split PER F1 ORG F1 LOC F1 Micro F1 Precision Recall
Dev 97.5 85.1 92.9 93.0 93.2 92.9
Test 94.2 79.7 87.8 87.7 88.1 87.3

GPI Legal Documents (independent evaluation, Danish legal domain)

Evaluated on 30 human-corrected documents (contracts, invoices, case briefs, client letters):

Entity Precision Recall Notes
PER 0.76 1.00 Perfect recall; FPs are email addresses misclassified as PER
ORG 0.94 0.96 Near-perfect
LOC 0.52 0.51 Boundary errors (detects street, misses house number) — detection rate is near-perfect

LOC score reflects strict span matching. The model consistently detects location entities but predicts shorter spans (e.g., "Gothersgade" instead of "Gothersgade 81"). A post-processing step to extend LOC spans to include adjacent numbers resolves this.

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model_name = "thomasbeste/danish-xlmr-ner-large"
nlp = pipeline("ner", model=model_name, aggregation_strategy="simple")

text = "Anders Jensen fra Danske Bank bor på Vestergade 42 i København."
entities = nlp(text)

for ent in entities:
    print(f"  {ent['entity_group']}: {ent['word']} (score: {ent['score']:.3f})")

ONNX Deployment

For production use, export to ONNX INT8 for ~3x CPU speedup:

pip install optimum[onnxruntime]

# Export to ONNX
optimum-cli export onnx --model thomasbeste/danish-xlmr-ner-large ./model-onnx --task token-classification

# Quantize to INT8
python -c "
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
q = ORTQuantizer.from_pretrained('./model-onnx')
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
q.quantize(save_dir='./model-onnx-int8', quantization_config=qconfig)
"

Label Scheme

IOB2 format with 7 labels:

ID Label
0 O
1 B-PER
2 I-PER
3 B-ORG
4 I-ORG
5 B-LOC
6 I-LOC

Intended Use

Designed for GDPR-compliant PII detection in Danish enterprise document processing pipelines. Optimized for recall over precision — a missed entity (false negative) is a compliance risk, while over-detection (false positive) is safe.

Limitations

  • Optimized for Danish text. May work on other Scandinavian languages (Norwegian, Swedish) but not evaluated.
  • LOC boundary detection tends to predict shorter spans than the full address. Post-processing recommended.
  • Email addresses are sometimes misclassified as PER. Downstream validation (reject names containing @) is recommended.
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train thomasbeste/danish-xlmr-ner-large

Evaluation results