SDVM Multilingual NER โ€” Refined

An XLM-RoBERTa-base model fine-tuned for Named Entity Recognition on the refined (cleaned) PAN-X.de dataset from the XTREME benchmark.

This model is part of a paired experiment by SDVM to demonstrate the impact of data quality on NER performance. Compare with SDVM/multilingual-ner-original, which was trained on uncleaned data.

Training Details

  • Base model: xlm-roberta-base
  • Dataset: SDVM/xtreme-PAN-X.de โ€” tokens_refined and ner_tags_refined columns (cleaned)
  • Training: 3 epochs, batch size 8, learning rate 2e-5, weight decay 0.01
  • Task: Token classification with IOB2 tags

Data Refinement

The refined dataset had ~8.5% Wikipedia markup noise removed:

  • Bold/italic markers (**, '')
  • Template and link brackets ({{, }}, [[, ]])
  • Section headers (==, ===)
  • German Wikipedia redirect tokens (WEITERLEITUNG)
  • Embedded markup stripped from tokens
  • B-/I- tag continuity repaired after token removal

Labels

ID Tag
0 O
1 B-PER
2 I-PER
3 B-ORG
4 I-ORG
5 B-LOC
6 I-LOC

Usage

from transformers import pipeline

ner = pipeline("token-classification", model="SDVM/multilingual-ner-refined")
result = ner("Angela Merkel wurde in Hamburg geboren.")
print(result)

Context

By removing Wikipedia markup artifacts from the training data, this model learns cleaner token representations and produces more reliable entity predictions. Compare its F1 score with the original model to see the impact of data refinement.

Reference

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train SDVM/multilingual-ner-refined

Evaluation results