Instructions to use ohhsj/rail-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ohhsj/rail-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ohhsj/rail-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ohhsj/rail-v2") model = AutoModelForTokenClassification.from_pretrained("ohhsj/rail-v2") - Notebooks
- Google Colab
- Kaggle
rail-v2 โ PII Detection NER
rail-v2 is a DistilBERT model fine-tuned for token classification on personally identifiable information (PII). It is the NER backbone of GuardRailAI, where it runs alongside eight checksum-validating Presidio recognisers (Luhn, US SSN, Singapore NRIC, China Resident ID, Hong Kong HKID, SWIFT BIC, US ABA routing, Singapore UEN, China USCC).
What's new vs rail-v1
Five new entity types covering Hong Kong business documents and Singapore GST registration:
- HK_ID โ Hong Kong Identity Card (with ISO 7064 Mod 11-2 checksum validator in the pipeline)
- ADDRESS_HK โ Hong Kong street addresses
- HK_PHONE โ Hong Kong phone numbers (+852 prefix)
- BANK_ACCOUNT_HK โ HK bank account numbers (HSBC, BOC HK, SCB, Hang Seng)
- GST_REG_NUM โ Singapore GST registration numbers (M2-NNNNNNNN-N)
Performance
Evaluated on a held-out validation set (Kaggle PII val + Nemotron corporate val + 148 DataDesigner HK/SG/CN business docs):
| Metric | Value |
|---|---|
| Overall F_ฮฒ5 (ฮฒ=5) | 0.986 |
| Overall recall | 0.988 |
| Overall precision | 0.949 |
Per-entity recall and precision:
| Entity | Recall | Precision |
|---|---|---|
| HK_ID | 1.000 | 0.980 |
| ADDRESS_HK | 1.000 | 0.940 |
| HK_PHONE | 1.000 | 1.000 |
| BANK_ACCOUNT_HK | 1.000 | 1.000 |
| GST_REG_NUM | 1.000 | 1.000 |
| URL_PERSONAL | 1.000 | 0.980 |
| ID_NUM | 1.000 | 0.933 |
| 0.998 | 0.967 | |
| COMPANY_NAME | 0.994 | 0.953 |
| PHONE_NUM | 0.980 | 0.915 |
| USERNAME | 0.976 | 0.976 |
| PERSON_NAME | 0.978 | 0.932 |
| STREET_ADDRESS | 0.951 | 0.938 |
F_ฮฒ5 (ฮฒ=5) weights recall 25ร more than precision โ the model is tuned for compliance use cases where missing PII is catastrophic and over-redacting is merely annoying. All gated entities clear the 0.95 recall threshold; the five new HK/SG entities all reach perfect recall on the validation set thanks to deterministic checksum generation in the synthetic data.
Label schema
47 BIO labels covering 23 entity types:
- Personal:
PERSON_NAME,EMAIL,USERNAME,ID_NUM,PHONE_NUM,URL_PERSONAL,STREET_ADDRESS - Singapore:
ADDRESS_SG,SG_PHONE,BANK_ACCOUNT_SG,GST_REG_NUM(new) - Hong Kong:
HK_ID(new),ADDRESS_HK(new),HK_PHONE(new),BANK_ACCOUNT_HK(new) - China:
NAME_ZH_BILINGUAL,ADDRESS_CN,CN_PHONE,BANK_ACCOUNT_CN - Corporate:
COMPANY_NAME,TAX_ID,EMPLOYEE_ID,LICENSE_NUM
Training data
| Source | Documents |
|---|---|
| Kaggle Learning Agency Lab PII | 6,126 |
| Hard negatives (Presidio false positives, 3ร oversampled) | 639 |
| NeMo SG/CN financial PII synthetic | 640 |
| Rare-entity synthetic (PHONE_NUM, USERNAME, STREET_ADDRESS) | 600 |
| Nemotron-PII corporate (filtered to 17 business domains) | 16,944 |
| DataDesigner HK business (HKID + BR/CR + addresses + banks) | 950 |
| DataDesigner SG expanded (+ GST registration) | 950 |
| DataDesigner CN expanded (CJK-aware tokenization) | 949 |
| Total training documents | 27,798 |
Validation: 681 Kaggle docs + 891 Nemotron + 148 DataDesigner = 1,720 documents.
The DataDesigner corpora are generated with NVIDIA NeMo DataDesigner: identifiers (HKID, GST reg num, USCC, phone numbers, bank accounts) are pre-computed in Python so every synthetic example passes the pipeline's own checksum validators. The LLM only generates surrounding prose, never identifiers themselves. An LLM-as-judge column filters out incoherent rows (quality < 4/5 dropped).
Quick start
from transformers import pipeline
ner = pipeline(
"token-classification",
model="ohhsj/rail-v2",
aggregation_strategy="simple",
)
text = (
"Director Jason Wong (HKID A123456(3)) of Dragon Pearl Holdings Limited "
"can be reached at +852 9123 4567 or jason@dragonpearl.com.hk."
)
for span in ner(text):
print(f"{span['entity_group']:<15} {span['word']!r:<30} score={span['score']:.3f}")
For production use, combine this NER with the Presidio pattern recognisers in the parent repo. The full pipeline includes checksum-validated detection for credit cards, SSNs, NRICs, China Resident IDs, SWIFT BICs, ABA routing numbers, Singapore UENs, China USCCs, and Hong Kong HKID/BR/CR numbers.
Intended use
- Batch redaction of logs and exported datasets for GDPR/CCPA/PDPO compliance
- Real-time PII screening of chat / email / form data
- Pre-training corpus filtering for LLM teams
- Hong Kong corporate document workflows (BR registration, payroll, bank statements)
- Singapore GST-registered entity onboarding
Limitations
- English and Chinese only. Other languages will pass through largely unredacted.
- Optimised for recall over precision at ฮฒ=5; expect a low rate of false positives.
- The 5 new HK/SG entities are still in "ramp" territory โ gated at 0.85 recall rather than the default 0.95 โ because they were introduced in this release. Real-world recall on out-of-distribution HK documents may be lower than the validation numbers suggest. Treat the DataDesigner-only entities as defence-in-depth alongside the deterministic Presidio pattern recognisers.
Architecture
- Base model:
distilbert/distilbert-base-uncased(66M params) - Training: 5 epochs, batch size 16, lr 2e-5, weighted cross-entropy loss (max class weight 50)
- Hardware: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, cu128)
- Hyperparameters: see
training_args.binin this repo
Citation
If you use this model, please cite the parent project:
GuardRailAI: Context-aware PII detection extending Microsoft Presidio.
https://github.com/ohhsj/guardrail-pii
- Downloads last month
- 16
Model tree for ohhsj/rail-v2
Base model
distilbert/distilbert-base-uncased