SDVM Multilingual NER — Refined

An XLM-RoBERTa-base model fine-tuned for Named Entity Recognition on the refined (cleaned) PAN-X.de dataset from the XTREME benchmark.

This model is part of a paired experiment by SDVM to demonstrate the impact of data quality on NER performance. Compare with SDVM/multilingual-ner-original, which was trained on uncleaned data.

Training Details

Base model: xlm-roberta-base
Dataset: SDVM/xtreme-PAN-X.de — tokens_refined and ner_tags_refined columns (cleaned)
Training: 3 epochs, batch size 8, learning rate 2e-5, weight decay 0.01
Task: Token classification with IOB2 tags

Data Refinement

The refined dataset had ~8.5% Wikipedia markup noise removed:

Bold/italic markers (**, '')
Template and link brackets ({{, }}, [[, ]])
Section headers (==, ===)
German Wikipedia redirect tokens (WEITERLEITUNG)
Embedded markup stripped from tokens
B-/I- tag continuity repaired after token removal

Labels

ID	Tag
0	O
1	B-PER
2	I-PER
3	B-ORG
4	I-ORG
5	B-LOC
6	I-LOC

Usage

from transformers import pipeline

ner = pipeline("token-classification", model="SDVM/multilingual-ner-refined")
result = ner("Angela Merkel wurde in Hamburg geboren.")
print(result)

Context

By removing Wikipedia markup artifacts from the training data, this model learns cleaner token representations and produces more reliable entity predictions. Compare its F1 score with the original model to see the impact of data refinement.

Reference

Based on Chapter 4 of Natural Language Processing with Transformers
Part of the SDVM data quality demonstration series

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Dataset used to train SDVM/multilingual-ner-refined

Evaluation results

F1 on PAN-X.de (Refined)
test set self-reported

0.870