rail-v2 โ€” PII Detection NER

rail-v2 is a DistilBERT model fine-tuned for token classification on personally identifiable information (PII). It is the NER backbone of GuardRailAI, where it runs alongside eight checksum-validating Presidio recognisers (Luhn, US SSN, Singapore NRIC, China Resident ID, Hong Kong HKID, SWIFT BIC, US ABA routing, Singapore UEN, China USCC).

What's new vs rail-v1

Five new entity types covering Hong Kong business documents and Singapore GST registration:

  • HK_ID โ€” Hong Kong Identity Card (with ISO 7064 Mod 11-2 checksum validator in the pipeline)
  • ADDRESS_HK โ€” Hong Kong street addresses
  • HK_PHONE โ€” Hong Kong phone numbers (+852 prefix)
  • BANK_ACCOUNT_HK โ€” HK bank account numbers (HSBC, BOC HK, SCB, Hang Seng)
  • GST_REG_NUM โ€” Singapore GST registration numbers (M2-NNNNNNNN-N)

Performance

Evaluated on a held-out validation set (Kaggle PII val + Nemotron corporate val + 148 DataDesigner HK/SG/CN business docs):

Metric Value
Overall F_ฮฒ5 (ฮฒ=5) 0.986
Overall recall 0.988
Overall precision 0.949

Per-entity recall and precision:

Entity Recall Precision
HK_ID 1.000 0.980
ADDRESS_HK 1.000 0.940
HK_PHONE 1.000 1.000
BANK_ACCOUNT_HK 1.000 1.000
GST_REG_NUM 1.000 1.000
URL_PERSONAL 1.000 0.980
ID_NUM 1.000 0.933
EMAIL 0.998 0.967
COMPANY_NAME 0.994 0.953
PHONE_NUM 0.980 0.915
USERNAME 0.976 0.976
PERSON_NAME 0.978 0.932
STREET_ADDRESS 0.951 0.938

F_ฮฒ5 (ฮฒ=5) weights recall 25ร— more than precision โ€” the model is tuned for compliance use cases where missing PII is catastrophic and over-redacting is merely annoying. All gated entities clear the 0.95 recall threshold; the five new HK/SG entities all reach perfect recall on the validation set thanks to deterministic checksum generation in the synthetic data.

Label schema

47 BIO labels covering 23 entity types:

  • Personal: PERSON_NAME, EMAIL, USERNAME, ID_NUM, PHONE_NUM, URL_PERSONAL, STREET_ADDRESS
  • Singapore: ADDRESS_SG, SG_PHONE, BANK_ACCOUNT_SG, GST_REG_NUM (new)
  • Hong Kong: HK_ID (new), ADDRESS_HK (new), HK_PHONE (new), BANK_ACCOUNT_HK (new)
  • China: NAME_ZH_BILINGUAL, ADDRESS_CN, CN_PHONE, BANK_ACCOUNT_CN
  • Corporate: COMPANY_NAME, TAX_ID, EMPLOYEE_ID, LICENSE_NUM

Training data

Source Documents
Kaggle Learning Agency Lab PII 6,126
Hard negatives (Presidio false positives, 3ร— oversampled) 639
NeMo SG/CN financial PII synthetic 640
Rare-entity synthetic (PHONE_NUM, USERNAME, STREET_ADDRESS) 600
Nemotron-PII corporate (filtered to 17 business domains) 16,944
DataDesigner HK business (HKID + BR/CR + addresses + banks) 950
DataDesigner SG expanded (+ GST registration) 950
DataDesigner CN expanded (CJK-aware tokenization) 949
Total training documents 27,798

Validation: 681 Kaggle docs + 891 Nemotron + 148 DataDesigner = 1,720 documents.

The DataDesigner corpora are generated with NVIDIA NeMo DataDesigner: identifiers (HKID, GST reg num, USCC, phone numbers, bank accounts) are pre-computed in Python so every synthetic example passes the pipeline's own checksum validators. The LLM only generates surrounding prose, never identifiers themselves. An LLM-as-judge column filters out incoherent rows (quality < 4/5 dropped).

Quick start

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="ohhsj/rail-v2",
    aggregation_strategy="simple",
)

text = (
    "Director Jason Wong (HKID A123456(3)) of Dragon Pearl Holdings Limited "
    "can be reached at +852 9123 4567 or jason@dragonpearl.com.hk."
)
for span in ner(text):
    print(f"{span['entity_group']:<15} {span['word']!r:<30} score={span['score']:.3f}")

For production use, combine this NER with the Presidio pattern recognisers in the parent repo. The full pipeline includes checksum-validated detection for credit cards, SSNs, NRICs, China Resident IDs, SWIFT BICs, ABA routing numbers, Singapore UENs, China USCCs, and Hong Kong HKID/BR/CR numbers.

Intended use

  • Batch redaction of logs and exported datasets for GDPR/CCPA/PDPO compliance
  • Real-time PII screening of chat / email / form data
  • Pre-training corpus filtering for LLM teams
  • Hong Kong corporate document workflows (BR registration, payroll, bank statements)
  • Singapore GST-registered entity onboarding

Limitations

  • English and Chinese only. Other languages will pass through largely unredacted.
  • Optimised for recall over precision at ฮฒ=5; expect a low rate of false positives.
  • The 5 new HK/SG entities are still in "ramp" territory โ€” gated at 0.85 recall rather than the default 0.95 โ€” because they were introduced in this release. Real-world recall on out-of-distribution HK documents may be lower than the validation numbers suggest. Treat the DataDesigner-only entities as defence-in-depth alongside the deterministic Presidio pattern recognisers.

Architecture

  • Base model: distilbert/distilbert-base-uncased (66M params)
  • Training: 5 epochs, batch size 16, lr 2e-5, weighted cross-entropy loss (max class weight 50)
  • Hardware: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, cu128)
  • Hyperparameters: see training_args.bin in this repo

Citation

If you use this model, please cite the parent project:

GuardRailAI: Context-aware PII detection extending Microsoft Presidio.
https://github.com/ohhsj/guardrail-pii
Downloads last month
16
Safetensors
Model size
66.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ohhsj/rail-v2

Finetuned
(11649)
this model

Space using ohhsj/rail-v2 1