Token Classification
Transformers
Safetensors
English
Chinese
distilbert
pii-detection
ner
presidio
gdpr
ccpa
hong-kong
singapore
china
Instructions to use ohhsj/rail-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ohhsj/rail-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ohhsj/rail-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ohhsj/rail-v2") model = AutoModelForTokenClassification.from_pretrained("ohhsj/rail-v2") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| library_name: transformers | |
| pipeline_tag: token-classification | |
| tags: | |
| - pii-detection | |
| - ner | |
| - presidio | |
| - gdpr | |
| - ccpa | |
| - distilbert | |
| - hong-kong | |
| - singapore | |
| - china | |
| base_model: distilbert/distilbert-base-uncased | |
| # rail-v2 — PII Detection NER | |
| `rail-v2` is a DistilBERT model fine-tuned for token classification on personally identifiable information (PII). It is the NER backbone of [GuardRailAI](https://github.com/ohhsj/guardrail-pii), where it runs alongside eight checksum-validating Presidio recognisers (Luhn, US SSN, Singapore NRIC, China Resident ID, Hong Kong HKID, SWIFT BIC, US ABA routing, Singapore UEN, China USCC). | |
| ## What's new vs rail-v1 | |
| Five new entity types covering Hong Kong business documents and Singapore GST registration: | |
| - **HK_ID** — Hong Kong Identity Card (with ISO 7064 Mod 11-2 checksum validator in the pipeline) | |
| - **ADDRESS_HK** — Hong Kong street addresses | |
| - **HK_PHONE** — Hong Kong phone numbers (+852 prefix) | |
| - **BANK_ACCOUNT_HK** — HK bank account numbers (HSBC, BOC HK, SCB, Hang Seng) | |
| - **GST_REG_NUM** — Singapore GST registration numbers (M2-NNNNNNNN-N) | |
| ## Performance | |
| Evaluated on a held-out validation set (Kaggle PII val + Nemotron corporate val + 148 DataDesigner HK/SG/CN business docs): | |
| | Metric | Value | | |
| |---|---| | |
| | Overall F_β5 (β=5) | **0.986** | | |
| | Overall recall | 0.988 | | |
| | Overall precision | 0.949 | | |
| Per-entity recall and precision: | |
| | Entity | Recall | Precision | | |
| |---|---|---| | |
| | HK_ID | 1.000 | 0.980 | | |
| | ADDRESS_HK | 1.000 | 0.940 | | |
| | HK_PHONE | 1.000 | 1.000 | | |
| | BANK_ACCOUNT_HK | 1.000 | 1.000 | | |
| | GST_REG_NUM | 1.000 | 1.000 | | |
| | URL_PERSONAL | 1.000 | 0.980 | | |
| | ID_NUM | 1.000 | 0.933 | | |
| | EMAIL | 0.998 | 0.967 | | |
| | COMPANY_NAME | 0.994 | 0.953 | | |
| | PHONE_NUM | 0.980 | 0.915 | | |
| | USERNAME | 0.976 | 0.976 | | |
| | PERSON_NAME | 0.978 | 0.932 | | |
| | STREET_ADDRESS | 0.951 | 0.938 | | |
| F_β5 (β=5) weights recall 25× more than precision — the model is tuned for compliance use cases where missing PII is catastrophic and over-redacting is merely annoying. All gated entities clear the 0.95 recall threshold; the five new HK/SG entities all reach perfect recall on the validation set thanks to deterministic checksum generation in the synthetic data. | |
| ## Label schema | |
| 47 BIO labels covering 23 entity types: | |
| - **Personal**: `PERSON_NAME`, `EMAIL`, `USERNAME`, `ID_NUM`, `PHONE_NUM`, `URL_PERSONAL`, `STREET_ADDRESS` | |
| - **Singapore**: `ADDRESS_SG`, `SG_PHONE`, `BANK_ACCOUNT_SG`, `GST_REG_NUM` *(new)* | |
| - **Hong Kong**: `HK_ID` *(new)*, `ADDRESS_HK` *(new)*, `HK_PHONE` *(new)*, `BANK_ACCOUNT_HK` *(new)* | |
| - **China**: `NAME_ZH_BILINGUAL`, `ADDRESS_CN`, `CN_PHONE`, `BANK_ACCOUNT_CN` | |
| - **Corporate**: `COMPANY_NAME`, `TAX_ID`, `EMPLOYEE_ID`, `LICENSE_NUM` | |
| ## Training data | |
| | Source | Documents | | |
| |---|---| | |
| | Kaggle Learning Agency Lab PII | 6,126 | | |
| | Hard negatives (Presidio false positives, 3× oversampled) | 639 | | |
| | NeMo SG/CN financial PII synthetic | 640 | | |
| | Rare-entity synthetic (PHONE_NUM, USERNAME, STREET_ADDRESS) | 600 | | |
| | Nemotron-PII corporate (filtered to 17 business domains) | 16,944 | | |
| | DataDesigner HK business (HKID + BR/CR + addresses + banks) | 950 | | |
| | DataDesigner SG expanded (+ GST registration) | 950 | | |
| | DataDesigner CN expanded (CJK-aware tokenization) | 949 | | |
| | **Total training documents** | **27,798** | | |
| Validation: 681 Kaggle docs + 891 Nemotron + 148 DataDesigner = 1,720 documents. | |
| The DataDesigner corpora are generated with [NVIDIA NeMo DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner): identifiers (HKID, GST reg num, USCC, phone numbers, bank accounts) are pre-computed in Python so every synthetic example passes the pipeline's own checksum validators. The LLM only generates surrounding prose, never identifiers themselves. An LLM-as-judge column filters out incoherent rows (quality < 4/5 dropped). | |
| ## Quick start | |
| ```python | |
| from transformers import pipeline | |
| ner = pipeline( | |
| "token-classification", | |
| model="ohhsj/rail-v2", | |
| aggregation_strategy="simple", | |
| ) | |
| text = ( | |
| "Director Jason Wong (HKID A123456(3)) of Dragon Pearl Holdings Limited " | |
| "can be reached at +852 9123 4567 or jason@dragonpearl.com.hk." | |
| ) | |
| for span in ner(text): | |
| print(f"{span['entity_group']:<15} {span['word']!r:<30} score={span['score']:.3f}") | |
| ``` | |
| For production use, combine this NER with the Presidio pattern recognisers in the parent repo. The full pipeline includes checksum-validated detection for credit cards, SSNs, NRICs, China Resident IDs, SWIFT BICs, ABA routing numbers, Singapore UENs, China USCCs, and Hong Kong HKID/BR/CR numbers. | |
| ## Intended use | |
| - Batch redaction of logs and exported datasets for GDPR/CCPA/PDPO compliance | |
| - Real-time PII screening of chat / email / form data | |
| - Pre-training corpus filtering for LLM teams | |
| - Hong Kong corporate document workflows (BR registration, payroll, bank statements) | |
| - Singapore GST-registered entity onboarding | |
| ## Limitations | |
| - English and Chinese only. Other languages will pass through largely unredacted. | |
| - Optimised for **recall over precision** at β=5; expect a low rate of false positives. | |
| - The 5 new HK/SG entities are still in "ramp" territory — gated at 0.85 recall rather than the default 0.95 — because they were introduced in this release. Real-world recall on out-of-distribution HK documents may be lower than the validation numbers suggest. Treat the DataDesigner-only entities as defence-in-depth alongside the deterministic Presidio pattern recognisers. | |
| ## Architecture | |
| - Base model: `distilbert/distilbert-base-uncased` (66M params) | |
| - Training: 5 epochs, batch size 16, lr 2e-5, weighted cross-entropy loss (max class weight 50) | |
| - Hardware: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, cu128) | |
| - Hyperparameters: see `training_args.bin` in this repo | |
| ## Citation | |
| If you use this model, please cite the parent project: | |
| ``` | |
| GuardRailAI: Context-aware PII detection extending Microsoft Presidio. | |
| https://github.com/ohhsj/guardrail-pii | |
| ``` | |