rail-v1 — PII Detection NER

rail-v1 is a DistilBERT model fine-tuned for token classification on personally identifiable information (PII). It is the NER backbone of GuardRailAI, where it runs alongside eight checksum-validating Presidio recognisers (Luhn, US SSN, Singapore NRIC, China Resident ID, SWIFT BIC, US ABA routing, Singapore UEN, China USCC).

Performance

Evaluated on a held-out validation set (Kaggle PII val split + 891 docs from nvidia/Nemotron-PII corporate val):

Metric	Value
Overall F_β5 (β=5)	0.994
Overall recall	0.995
Overall precision	0.964

Per-entity recall (gated at ≥ 0.95 unless noted):

Entity	Recall	Precision
PERSON_NAME	0.995	0.956
EMAIL	1.000	0.982
USERNAME	1.000	1.000
ID_NUM	1.000	0.961
URL_PERSONAL	0.997	0.991
COMPANY_NAME (gated at 0.85)	0.998	0.968
PHONE_NUM (observability)	0.980	0.909
STREET_ADDRESS (observability)	0.978	0.973

F_β5 (β=5) weights recall 25× more than precision — the model is tuned for compliance use cases where missing PII is catastrophic and over-redacting is merely annoying.

Label schema

37 BIO labels covering 18 entity types:

Personal: PERSON_NAME, EMAIL, USERNAME, ID_NUM, PHONE_NUM, URL_PERSONAL, STREET_ADDRESS
Singapore / China: NAME_ZH_BILINGUAL, ADDRESS_SG, ADDRESS_CN, SG_PHONE, CN_PHONE, BANK_ACCOUNT_SG, BANK_ACCOUNT_CN
Corporate: COMPANY_NAME, TAX_ID, EMPLOYEE_ID, LICENSE_NUM

Training data

Source	Documents
Kaggle Learning Agency Lab PII	6,126
Hard negatives (Presidio false positives, 3× oversampled)	639
NeMo SG/CN financial PII synthetic	640
Rare-entity synthetic (PHONE_NUM, USERNAME, STREET_ADDRESS)	600
Nemotron-PII corporate (filtered to 17 business domains)	16,944
Total training documents	24,949

Validation: 681 Kaggle docs + 891 Nemotron corporate docs = 1,572 documents.

Quick start

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="ohhsj/rail-v1",
    aggregation_strategy="simple",
)

text = "Hi, I'm Jane Doe — email jane@acme.com or call +1 415 555 0143."
for span in ner(text):
    print(f"{span['entity_group']:<15} {span['word']!r:<25} score={span['score']:.3f}")

For production use, combine this NER with the Presidio pattern recognisers in the parent repo. The full pipeline includes checksum-validated detection for credit cards, SSNs, NRICs, China Resident IDs, SWIFT BICs, ABA routing numbers, Singapore UENs, and China USCCs.

Intended use

Batch redaction of logs and exported datasets for GDPR/CCPA compliance
Real-time PII screening of chat / email / form data
Pre-training corpus filtering for LLM teams

Limitations

English and Chinese only. Other languages will pass through largely unredacted.
Optimised for recall over precision at β=5; expect a low rate of false positives (e.g. non-PII tokens redacted as PERSON).
Pattern-handled entities like full street addresses can be over-redacted at sub-word level depending on tokenization. Use the parent repo's Presidio pipeline for normalised span boundaries.

Architecture

Base model: distilbert/distilbert-base-uncased (66M params)
Training: 5 epochs, batch size 16, lr 2e-5, weighted cross-entropy loss (max class weight 50)
Hardware: NVIDIA A100 (~11 min total)
Hyperparameters: see training_args.bin in this repo

Citation

If you use this model, please cite the parent project:

GuardRailAI: Context-aware PII detection extending Microsoft Presidio.
https://github.com/ohhsj/guardrail-pii

Downloads last month: 74

Safetensors

Model size

66.4M params

Tensor type

F32

Model tree for ohhsj/rail-v1

Base model

distilbert/distilbert-base-uncased

Finetuned

(11649)

this model