SDVM/xtreme-PAN-X.de
Viewer โข Updated โข 40k โข 13
An XLM-RoBERTa-base model fine-tuned for Named Entity Recognition on the original (unrefined) PAN-X.de dataset from the XTREME benchmark.
This model is part of a paired experiment by SDVM to demonstrate the impact of data quality on NER performance. Compare with SDVM/multilingual-ner-refined, which was trained on cleaned data.
tokens and ner_tags columns (original, uncleaned)| ID | Tag |
|---|---|
| 0 | O |
| 1 | B-PER |
| 2 | I-PER |
| 3 | B-ORG |
| 4 | I-ORG |
| 5 | B-LOC |
| 6 | I-LOC |
from transformers import pipeline
ner = pipeline("token-classification", model="SDVM/multilingual-ner-original")
result = ner("Angela Merkel wurde in Hamburg geboren.")
print(result)
This model was trained on the original PAN-X.de data which contains ~8.5% Wikipedia markup noise tokens (bold markers, quote marks, redirect tags, etc.). These artifacts can confuse the model during both training and inference.
For a cleaner alternative, see SDVM/multilingual-ner-refined.