roberta-pii-ner-en

Fine-tuned roberta-base for detecting Personally Identifiable Information (PII) and Payment Card Industry (PCI) data in English text.

GitHub: rakmohan/pii-ner-en

Model Performance

Metric Score
Micro avg F1 0.95
Macro avg F1 0.94
Weighted avg F1 0.95

Per-entity metrics are available in classification_report.txt.

Usage

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="rm0013/roberta-pii-ner-en",
    aggregation_strategy="simple"
)

result = ner("Send the invoice to john.smith@acme.com, card 4111-1111-1111-1111 CVV 123.")
for entity in result:
    print(f"{entity['word']:30s} → {entity['entity_group']} ({entity['score']:.2f})")

Supported Entities (54 types)

PII: PERSON_NAME EMAIL PHONE_NUMBER SSN ADDRESS SECONDARYADDRESS DATE_OF_BIRTH DATE TIME AGE GENDER USERNAME PASSWORD IP_ADDRESS URL API_KEY PASSPORT_NUMBER DRIVER_LICENSE ORGANIZATION COMPANYNAME ACCOUNTNAME JOBAREA JOBTITLE JOBTYPE HEIGHT EYECOLOR ORDINALDIRECTION GPS_COORDINATES NEARBYGPSCOORDINATE USERAGENT DEVICE_ID VEHICLE_ID VEHICLEVIN VEHICLEVRM PHONEIMEI

PCI / Financial: CREDIT_CARD CREDIT_CARD_CVV CREDIT_CARD_EXPIRY PIN BANK_ACCOUNT BANK_ROUTING BIC AMOUNT CURRENCY CURRENCYCODE CURRENCYNAME CURRENCYSYMBOL MASKEDNUMBER BITCOINADDRESS ETHEREUMADDRESS LITECOINADDRESS

Training Data

  1. ai4privacy/pii-masking-200k — 200k annotated PII examples
  2. Synthetic PCI data generated with Faker for rare entities (CVV, PIN, routing numbers)

Training Details

Parameter Value
Base model roberta-base
Epochs 10 (early stopping patience 3)
Batch size 32
Learning rate 2e-5
Max sequence length 256
Mixed precision FP16

Limitations

  • English text only
  • Performance varies by entity type — currency-related entities (CURRENCY, CURRENCYNAME) have lower accuracy due to limited training signal
  • Not tested on non-standard text formats (code, structured data)

License

MIT

Downloads last month
25
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rm0013/roberta-pii-ner-en

Evaluation results