Instructions to use ohhsj/rail-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ohhsj/rail-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ohhsj/rail-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ohhsj/rail-v1") model = AutoModelForTokenClassification.from_pretrained("ohhsj/rail-v1") - Notebooks
- Google Colab
- Kaggle
rail-v1 — PII Detection NER
rail-v1 is a DistilBERT model fine-tuned for token classification on personally identifiable information (PII). It is the NER backbone of GuardRailAI, where it runs alongside eight checksum-validating Presidio recognisers (Luhn, US SSN, Singapore NRIC, China Resident ID, SWIFT BIC, US ABA routing, Singapore UEN, China USCC).
Performance
Evaluated on a held-out validation set (Kaggle PII val split + 891 docs from nvidia/Nemotron-PII corporate val):
| Metric | Value |
|---|---|
| Overall F_β5 (β=5) | 0.994 |
| Overall recall | 0.995 |
| Overall precision | 0.964 |
Per-entity recall (gated at ≥ 0.95 unless noted):
| Entity | Recall | Precision |
|---|---|---|
| PERSON_NAME | 0.995 | 0.956 |
| 1.000 | 0.982 | |
| USERNAME | 1.000 | 1.000 |
| ID_NUM | 1.000 | 0.961 |
| URL_PERSONAL | 0.997 | 0.991 |
| COMPANY_NAME (gated at 0.85) | 0.998 | 0.968 |
| PHONE_NUM (observability) | 0.980 | 0.909 |
| STREET_ADDRESS (observability) | 0.978 | 0.973 |
F_β5 (β=5) weights recall 25× more than precision — the model is tuned for compliance use cases where missing PII is catastrophic and over-redacting is merely annoying.
Label schema
37 BIO labels covering 18 entity types:
- Personal:
PERSON_NAME,EMAIL,USERNAME,ID_NUM,PHONE_NUM,URL_PERSONAL,STREET_ADDRESS - Singapore / China:
NAME_ZH_BILINGUAL,ADDRESS_SG,ADDRESS_CN,SG_PHONE,CN_PHONE,BANK_ACCOUNT_SG,BANK_ACCOUNT_CN - Corporate:
COMPANY_NAME,TAX_ID,EMPLOYEE_ID,LICENSE_NUM
Training data
| Source | Documents |
|---|---|
| Kaggle Learning Agency Lab PII | 6,126 |
| Hard negatives (Presidio false positives, 3× oversampled) | 639 |
| NeMo SG/CN financial PII synthetic | 640 |
| Rare-entity synthetic (PHONE_NUM, USERNAME, STREET_ADDRESS) | 600 |
| Nemotron-PII corporate (filtered to 17 business domains) | 16,944 |
| Total training documents | 24,949 |
Validation: 681 Kaggle docs + 891 Nemotron corporate docs = 1,572 documents.
Quick start
from transformers import pipeline
ner = pipeline(
"token-classification",
model="ohhsj/rail-v1",
aggregation_strategy="simple",
)
text = "Hi, I'm Jane Doe — email jane@acme.com or call +1 415 555 0143."
for span in ner(text):
print(f"{span['entity_group']:<15} {span['word']!r:<25} score={span['score']:.3f}")
For production use, combine this NER with the Presidio pattern recognisers in the parent repo. The full pipeline includes checksum-validated detection for credit cards, SSNs, NRICs, China Resident IDs, SWIFT BICs, ABA routing numbers, Singapore UENs, and China USCCs.
Intended use
- Batch redaction of logs and exported datasets for GDPR/CCPA compliance
- Real-time PII screening of chat / email / form data
- Pre-training corpus filtering for LLM teams
Limitations
- English and Chinese only. Other languages will pass through largely unredacted.
- Optimised for recall over precision at β=5; expect a low rate of false positives (e.g. non-PII tokens redacted as
PERSON). - Pattern-handled entities like full street addresses can be over-redacted at sub-word level depending on tokenization. Use the parent repo's Presidio pipeline for normalised span boundaries.
Architecture
- Base model:
distilbert/distilbert-base-uncased(66M params) - Training: 5 epochs, batch size 16, lr 2e-5, weighted cross-entropy loss (max class weight 50)
- Hardware: NVIDIA A100 (~11 min total)
- Hyperparameters: see
training_args.binin this repo
Citation
If you use this model, please cite the parent project:
GuardRailAI: Context-aware PII detection extending Microsoft Presidio.
https://github.com/ohhsj/guardrail-pii
- Downloads last month
- 74
Model tree for ohhsj/rail-v1
Base model
distilbert/distilbert-base-uncased