SDVM/xtreme-PAN-X.de
Viewer โข Updated โข 40k โข 13
An XLM-RoBERTa-base model fine-tuned for Named Entity Recognition on the refined (cleaned) PAN-X.de dataset from the XTREME benchmark.
This model is part of a paired experiment by SDVM to demonstrate the impact of data quality on NER performance. Compare with SDVM/multilingual-ner-original, which was trained on uncleaned data.
tokens_refined and ner_tags_refined columns (cleaned)The refined dataset had ~8.5% Wikipedia markup noise removed:
**, ''){{, }}, [[, ]])==, ===)WEITERLEITUNG)| ID | Tag |
|---|---|
| 0 | O |
| 1 | B-PER |
| 2 | I-PER |
| 3 | B-ORG |
| 4 | I-ORG |
| 5 | B-LOC |
| 6 | I-LOC |
from transformers import pipeline
ner = pipeline("token-classification", model="SDVM/multilingual-ner-refined")
result = ner("Angela Merkel wurde in Hamburg geboren.")
print(result)
By removing Wikipedia markup artifacts from the training data, this model learns cleaner token representations and produces more reliable entity predictions. Compare its F1 score with the original model to see the impact of data refinement.