rail-v2 — PII Detection NER

rail-v2 is a DistilBERT model fine-tuned for token classification on personally identifiable information (PII). It is the NER backbone of GuardRailAI, where it runs alongside eight checksum-validating Presidio recognisers (Luhn, US SSN, Singapore NRIC, China Resident ID, Hong Kong HKID, SWIFT BIC, US ABA routing, Singapore UEN, China USCC).

What's new vs rail-v1

Five new entity types covering Hong Kong business documents and Singapore GST registration:

HK_ID — Hong Kong Identity Card (with ISO 7064 Mod 11-2 checksum validator in the pipeline)
ADDRESS_HK — Hong Kong street addresses
HK_PHONE — Hong Kong phone numbers (+852 prefix)
BANK_ACCOUNT_HK — HK bank account numbers (HSBC, BOC HK, SCB, Hang Seng)
GST_REG_NUM — Singapore GST registration numbers (M2-NNNNNNNN-N)

Performance

Evaluated on a held-out validation set (Kaggle PII val + Nemotron corporate val + 148 DataDesigner HK/SG/CN business docs):

Metric	Value
Overall F_β5 (β=5)	0.986
Overall recall	0.988
Overall precision	0.949

Per-entity recall and precision:

Entity	Recall	Precision
HK_ID	1.000	0.980
ADDRESS_HK	1.000	0.940
HK_PHONE	1.000	1.000
BANK_ACCOUNT_HK	1.000	1.000
GST_REG_NUM	1.000	1.000
URL_PERSONAL	1.000	0.980
ID_NUM	1.000	0.933
EMAIL	0.998	0.967
COMPANY_NAME	0.994	0.953
PHONE_NUM	0.980	0.915
USERNAME	0.976	0.976
PERSON_NAME	0.978	0.932
STREET_ADDRESS	0.951	0.938

F_β5 (β=5) weights recall 25× more than precision — the model is tuned for compliance use cases where missing PII is catastrophic and over-redacting is merely annoying. All gated entities clear the 0.95 recall threshold; the five new HK/SG entities all reach perfect recall on the validation set thanks to deterministic checksum generation in the synthetic data.

Label schema

47 BIO labels covering 23 entity types:

Personal: PERSON_NAME, EMAIL, USERNAME, ID_NUM, PHONE_NUM, URL_PERSONAL, STREET_ADDRESS
Singapore: ADDRESS_SG, SG_PHONE, BANK_ACCOUNT_SG, GST_REG_NUM (new)
Hong Kong: HK_ID (new), ADDRESS_HK (new), HK_PHONE (new), BANK_ACCOUNT_HK (new)
China: NAME_ZH_BILINGUAL, ADDRESS_CN, CN_PHONE, BANK_ACCOUNT_CN
Corporate: COMPANY_NAME, TAX_ID, EMPLOYEE_ID, LICENSE_NUM

Training data

Source	Documents
Kaggle Learning Agency Lab PII	6,126
Hard negatives (Presidio false positives, 3× oversampled)	639
NeMo SG/CN financial PII synthetic	640
Rare-entity synthetic (PHONE_NUM, USERNAME, STREET_ADDRESS)	600
Nemotron-PII corporate (filtered to 17 business domains)	16,944
DataDesigner HK business (HKID + BR/CR + addresses + banks)	950
DataDesigner SG expanded (+ GST registration)	950
DataDesigner CN expanded (CJK-aware tokenization)	949
Total training documents	27,798

Validation: 681 Kaggle docs + 891 Nemotron + 148 DataDesigner = 1,720 documents.

The DataDesigner corpora are generated with NVIDIA NeMo DataDesigner: identifiers (HKID, GST reg num, USCC, phone numbers, bank accounts) are pre-computed in Python so every synthetic example passes the pipeline's own checksum validators. The LLM only generates surrounding prose, never identifiers themselves. An LLM-as-judge column filters out incoherent rows (quality < 4/5 dropped).

Quick start

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="ohhsj/rail-v2",
    aggregation_strategy="simple",
)

text = (
    "Director Jason Wong (HKID A123456(3)) of Dragon Pearl Holdings Limited "
    "can be reached at +852 9123 4567 or jason@dragonpearl.com.hk."
)
for span in ner(text):
    print(f"{span['entity_group']:<15} {span['word']!r:<30} score={span['score']:.3f}")

For production use, combine this NER with the Presidio pattern recognisers in the parent repo. The full pipeline includes checksum-validated detection for credit cards, SSNs, NRICs, China Resident IDs, SWIFT BICs, ABA routing numbers, Singapore UENs, China USCCs, and Hong Kong HKID/BR/CR numbers.

Intended use

Batch redaction of logs and exported datasets for GDPR/CCPA/PDPO compliance
Real-time PII screening of chat / email / form data
Pre-training corpus filtering for LLM teams
Hong Kong corporate document workflows (BR registration, payroll, bank statements)
Singapore GST-registered entity onboarding

Limitations

English and Chinese only. Other languages will pass through largely unredacted.
Optimised for recall over precision at β=5; expect a low rate of false positives.
The 5 new HK/SG entities are still in "ramp" territory — gated at 0.85 recall rather than the default 0.95 — because they were introduced in this release. Real-world recall on out-of-distribution HK documents may be lower than the validation numbers suggest. Treat the DataDesigner-only entities as defence-in-depth alongside the deterministic Presidio pattern recognisers.

Architecture

Base model: distilbert/distilbert-base-uncased (66M params)
Training: 5 epochs, batch size 16, lr 2e-5, weighted cross-entropy loss (max class weight 50)
Hardware: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, cu128)
Hyperparameters: see training_args.bin in this repo

Citation

If you use this model, please cite the parent project:

GuardRailAI: Context-aware PII detection extending Microsoft Presidio.
https://github.com/ohhsj/guardrail-pii

Downloads last month: 16

Safetensors

Model size

66.4M params

Tensor type

F32

Model tree for ohhsj/rail-v2

Base model

distilbert/distilbert-base-uncased

Finetuned

(11649)

this model

ohhsj
/

rail-v2