ai4privacy/open-pii-masking-500k-ai4privacy
Viewer โข Updated โข 580k โข 1.53k โข 23
A multilingual transformer model (xlm-roberta-base) fine-tuned for Named Entity Recognition (NER) to detect and mask Personally Identifiable Information (PII) in text across English, German, Italian, and French.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "Ar86Bat/multilang-pii-ner"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John Doe was born on 12/12/1990 and lives in Berlin."
results = nlp(text)
print(results)
AGE, BUILDINGNUM, CITY, DATE, EMAIL, GIVENNAME, STREET, TELEPHONENUM, TIMEEMAIL and DATE (F1 โ 0.999)DRIVERLICENSENUM (F1 โ 0.85), GENDER (F1 โ 0.83), PASSPORTNUM (F1 โ 0.88), SURNAME (F1 โ 0.85), SEX (F1 โ 0.84)model/ directory.num_train_epochs=2 # Total number of training epochsper_device_train_batch_size=32 # Batch size for trainingper_device_eval_batch_size=32 # Batch size for evaluationIf you use this model, please cite the repository:
@misc{ar86bat_multilang_pii_ner_2025,
author = {Arif Hizlan},
title = {Multilingual PII NER},
year = {2025},
howpublished = {\\url{https://huggingface.co/Ar86Bat/multilang-pii-ner}}
}
https://github.com/Ar86Bat/multilang-pii-ner
MIT License