roberta-pii-ner-en
Fine-tuned roberta-base for detecting Personally Identifiable Information (PII) and Payment Card Industry (PCI) data in English text.
GitHub: rakmohan/pii-ner-en
Model Performance
| Metric | Score |
|---|---|
| Micro avg F1 | 0.95 |
| Macro avg F1 | 0.94 |
| Weighted avg F1 | 0.95 |
Per-entity metrics are available in classification_report.txt.
Usage
from transformers import pipeline
ner = pipeline(
"token-classification",
model="rm0013/roberta-pii-ner-en",
aggregation_strategy="simple"
)
result = ner("Send the invoice to john.smith@acme.com, card 4111-1111-1111-1111 CVV 123.")
for entity in result:
print(f"{entity['word']:30s} → {entity['entity_group']} ({entity['score']:.2f})")
Supported Entities (54 types)
PII: PERSON_NAME EMAIL PHONE_NUMBER SSN ADDRESS SECONDARYADDRESS DATE_OF_BIRTH DATE TIME AGE GENDER USERNAME PASSWORD IP_ADDRESS URL API_KEY PASSPORT_NUMBER DRIVER_LICENSE ORGANIZATION COMPANYNAME ACCOUNTNAME JOBAREA JOBTITLE JOBTYPE HEIGHT EYECOLOR ORDINALDIRECTION GPS_COORDINATES NEARBYGPSCOORDINATE USERAGENT DEVICE_ID VEHICLE_ID VEHICLEVIN VEHICLEVRM PHONEIMEI
PCI / Financial: CREDIT_CARD CREDIT_CARD_CVV CREDIT_CARD_EXPIRY PIN BANK_ACCOUNT BANK_ROUTING BIC AMOUNT CURRENCY CURRENCYCODE CURRENCYNAME CURRENCYSYMBOL MASKEDNUMBER BITCOINADDRESS ETHEREUMADDRESS LITECOINADDRESS
Training Data
- ai4privacy/pii-masking-200k — 200k annotated PII examples
- Synthetic PCI data generated with Faker for rare entities (CVV, PIN, routing numbers)
Training Details
| Parameter | Value |
|---|---|
| Base model | roberta-base |
| Epochs | 10 (early stopping patience 3) |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Max sequence length | 256 |
| Mixed precision | FP16 |
Limitations
- English text only
- Performance varies by entity type — currency-related entities (CURRENCY, CURRENCYNAME) have lower accuracy due to limited training signal
- Not tested on non-standard text formats (code, structured data)
License
MIT
- Downloads last month
- 25
Dataset used to train rm0013/roberta-pii-ner-en
Evaluation results
- Micro F1 on ai4privacy/pii-masking-200kself-reported0.950