rail-v1 — PII Detection NER

rail-v1 is a DistilBERT model fine-tuned for token classification on personally identifiable information (PII). It is the NER backbone of GuardRailAI, where it runs alongside eight checksum-validating Presidio recognisers (Luhn, US SSN, Singapore NRIC, China Resident ID, SWIFT BIC, US ABA routing, Singapore UEN, China USCC).

Performance

Evaluated on a held-out validation set (Kaggle PII val split + 891 docs from nvidia/Nemotron-PII corporate val):

Metric Value
Overall F_β5 (β=5) 0.994
Overall recall 0.995
Overall precision 0.964

Per-entity recall (gated at ≥ 0.95 unless noted):

Entity Recall Precision
PERSON_NAME 0.995 0.956
EMAIL 1.000 0.982
USERNAME 1.000 1.000
ID_NUM 1.000 0.961
URL_PERSONAL 0.997 0.991
COMPANY_NAME (gated at 0.85) 0.998 0.968
PHONE_NUM (observability) 0.980 0.909
STREET_ADDRESS (observability) 0.978 0.973

F_β5 (β=5) weights recall 25× more than precision — the model is tuned for compliance use cases where missing PII is catastrophic and over-redacting is merely annoying.

Label schema

37 BIO labels covering 18 entity types:

  • Personal: PERSON_NAME, EMAIL, USERNAME, ID_NUM, PHONE_NUM, URL_PERSONAL, STREET_ADDRESS
  • Singapore / China: NAME_ZH_BILINGUAL, ADDRESS_SG, ADDRESS_CN, SG_PHONE, CN_PHONE, BANK_ACCOUNT_SG, BANK_ACCOUNT_CN
  • Corporate: COMPANY_NAME, TAX_ID, EMPLOYEE_ID, LICENSE_NUM

Training data

Source Documents
Kaggle Learning Agency Lab PII 6,126
Hard negatives (Presidio false positives, 3× oversampled) 639
NeMo SG/CN financial PII synthetic 640
Rare-entity synthetic (PHONE_NUM, USERNAME, STREET_ADDRESS) 600
Nemotron-PII corporate (filtered to 17 business domains) 16,944
Total training documents 24,949

Validation: 681 Kaggle docs + 891 Nemotron corporate docs = 1,572 documents.

Quick start

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="ohhsj/rail-v1",
    aggregation_strategy="simple",
)

text = "Hi, I'm Jane Doe — email jane@acme.com or call +1 415 555 0143."
for span in ner(text):
    print(f"{span['entity_group']:<15} {span['word']!r:<25} score={span['score']:.3f}")

For production use, combine this NER with the Presidio pattern recognisers in the parent repo. The full pipeline includes checksum-validated detection for credit cards, SSNs, NRICs, China Resident IDs, SWIFT BICs, ABA routing numbers, Singapore UENs, and China USCCs.

Intended use

  • Batch redaction of logs and exported datasets for GDPR/CCPA compliance
  • Real-time PII screening of chat / email / form data
  • Pre-training corpus filtering for LLM teams

Limitations

  • English and Chinese only. Other languages will pass through largely unredacted.
  • Optimised for recall over precision at β=5; expect a low rate of false positives (e.g. non-PII tokens redacted as PERSON).
  • Pattern-handled entities like full street addresses can be over-redacted at sub-word level depending on tokenization. Use the parent repo's Presidio pipeline for normalised span boundaries.

Architecture

  • Base model: distilbert/distilbert-base-uncased (66M params)
  • Training: 5 epochs, batch size 16, lr 2e-5, weighted cross-entropy loss (max class weight 50)
  • Hardware: NVIDIA A100 (~11 min total)
  • Hyperparameters: see training_args.bin in this repo

Citation

If you use this model, please cite the parent project:

GuardRailAI: Context-aware PII detection extending Microsoft Presidio.
https://github.com/ohhsj/guardrail-pii
Downloads last month
74
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ohhsj/rail-v1

Finetuned
(11649)
this model