OpenMed-PPSN-v5 / README.md
temsa's picture
Publish OpenMed-PPSN-v5 for QA testing
dfd658f verified
---
language:
- en
- ga
license: apache-2.0
library_name: transformers
tags:
- pii
- token-classification
- healthcare
- de-identification
- ireland
- ppsn
base_model:
- OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1
datasets:
- nvidia/Nemotron-PII
- joelniklaus/mapa
model-index:
- name: OpenMed-PPSN-v5
results:
- task:
type: token-classification
name: PPSN detection (Irish regression set)
dataset:
name: irish_ppsn_regression_v5
type: custom
metrics:
- type: f1
value: 0.8235
name: Raw model F1
- type: f1
value: 1.0000
name: Hybrid strict F1
---
# OpenMed PPSN v5
Token classification model derived from `OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1` with added `B-PPSN` / `I-PPSN` support for Irish PPS Number detection.
## What this release contains
- Full model weights (`model.safetensors`) with original OpenMed labels + PPSN labels.
- `label_meta.json` with label mapping and provenance.
- Repro/eval artifacts:
- `irish_ppsn_regression_v5.jsonl`
- `eval_manual_irish_v5_raw.json`
- `eval_hybrid_v5_no_plausible.json`
- `ab_non_ppsn_v5_baseline.json`
## Key results
On `irish_ppsn_regression_v5`:
- Raw model-only PPSN performance: `P=0.7778 R=0.8750 F1=0.8235`.
- Recommended strict hybrid mode (model + checksum + span normalization, no plausible fallback):
- `P=1.0000 R=1.0000 F1=1.0000`.
Non-PPSN retention check vs base OpenMed (A/B):
- Real-text behavior agreement F1: `1.0000`.
- Entity density unchanged in the sampled real-text proxy set.
## Recommended production usage
For best PPSN reliability, run this model with a strict hybrid post-processor:
1. Keep model detections for all labels.
2. For PPSN specifically, normalize/expand spans to full PPSN candidates.
3. Validate PPSN with checksum.
4. Disable broad "plausible" fallback in production (`no-plausible-ppsn`).
This repository includes only the model artifacts; hybrid scripts are in the companion codebase used to train/evaluate this release.
## Limitations
- The Irish regression set is small and targeted; additional domain-specific validation is required before regulated deployment.
- PPSN detection in noisy OCR/ASR or heavily malformed text may require extra hardening.
## License and attribution
- This derivative release is distributed under Apache-2.0, consistent with the base model license tag.
- Base model: `OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1`.
- Training/evaluation used synthetic and augmented data sources, including `nvidia/Nemotron-PII` and `joelniklaus/mapa`.
- See `NOTICE` for attribution details.
## QA quick check
- Load this model as a standard `AutoModelForTokenClassification` checkpoint.
- Run against `irish_ppsn_regression_v5.jsonl`.
- Confirm results in `eval_manual_irish_v5_raw.json` (raw) and `eval_hybrid_v5_no_plausible.json` (strict hybrid).