--- language: - en - ga license: apache-2.0 library_name: transformers tags: - pii - token-classification - healthcare - de-identification - ireland - ppsn base_model: - OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1 datasets: - nvidia/Nemotron-PII - joelniklaus/mapa model-index: - name: OpenMed-PPSN-v5 results: - task: type: token-classification name: PPSN detection (Irish regression set) dataset: name: irish_ppsn_regression_v5 type: custom metrics: - type: f1 value: 0.8235 name: Raw model F1 - type: f1 value: 1.0000 name: Hybrid strict F1 --- # OpenMed PPSN v5 Token classification model derived from `OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1` with added `B-PPSN` / `I-PPSN` support for Irish PPS Number detection. ## What this release contains - Full model weights (`model.safetensors`) with original OpenMed labels + PPSN labels. - `label_meta.json` with label mapping and provenance. - Repro/eval artifacts: - `irish_ppsn_regression_v5.jsonl` - `eval_manual_irish_v5_raw.json` - `eval_hybrid_v5_no_plausible.json` - `ab_non_ppsn_v5_baseline.json` ## Key results On `irish_ppsn_regression_v5`: - Raw model-only PPSN performance: `P=0.7778 R=0.8750 F1=0.8235`. - Recommended strict hybrid mode (model + checksum + span normalization, no plausible fallback): - `P=1.0000 R=1.0000 F1=1.0000`. Non-PPSN retention check vs base OpenMed (A/B): - Real-text behavior agreement F1: `1.0000`. - Entity density unchanged in the sampled real-text proxy set. ## Recommended production usage For best PPSN reliability, run this model with a strict hybrid post-processor: 1. Keep model detections for all labels. 2. For PPSN specifically, normalize/expand spans to full PPSN candidates. 3. Validate PPSN with checksum. 4. Disable broad "plausible" fallback in production (`no-plausible-ppsn`). This repository includes only the model artifacts; hybrid scripts are in the companion codebase used to train/evaluate this release. ## Limitations - The Irish regression set is small and targeted; additional domain-specific validation is required before regulated deployment. - PPSN detection in noisy OCR/ASR or heavily malformed text may require extra hardening. ## License and attribution - This derivative release is distributed under Apache-2.0, consistent with the base model license tag. - Base model: `OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1`. - Training/evaluation used synthetic and augmented data sources, including `nvidia/Nemotron-PII` and `joelniklaus/mapa`. - See `NOTICE` for attribution details. ## QA quick check - Load this model as a standard `AutoModelForTokenClassification` checkpoint. - Run against `irish_ppsn_regression_v5.jsonl`. - Confirm results in `eval_manual_irish_v5_raw.json` (raw) and `eval_hybrid_v5_no_plausible.json` (strict hybrid).