OpenMed-PPSN-v5 / README.md

temsa

Publish OpenMed-PPSN-v5 for QA testing

dfd658f verified 3 days ago

preview code

raw

history blame contribute delete

2.88 kB

metadata

language:
  - en
  - ga
license: apache-2.0
library_name: transformers
tags:
  - pii
  - token-classification
  - healthcare
  - de-identification
  - ireland
  - ppsn
base_model:
  - OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1
datasets:
  - nvidia/Nemotron-PII
  - joelniklaus/mapa
model-index:
  - name: OpenMed-PPSN-v5
    results:
      - task:
          type: token-classification
          name: PPSN detection (Irish regression set)
        dataset:
          name: irish_ppsn_regression_v5
          type: custom
        metrics:
          - type: f1
            value: 0.8235
            name: Raw model F1
          - type: f1
            value: 1
            name: Hybrid strict F1

OpenMed PPSN v5

Token classification model derived from OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1 with added B-PPSN / I-PPSN support for Irish PPS Number detection.

What this release contains

Full model weights (model.safetensors) with original OpenMed labels + PPSN labels.
label_meta.json with label mapping and provenance.
Repro/eval artifacts:
- irish_ppsn_regression_v5.jsonl
- eval_manual_irish_v5_raw.json
- eval_hybrid_v5_no_plausible.json
- ab_non_ppsn_v5_baseline.json

Key results

On irish_ppsn_regression_v5:

Raw model-only PPSN performance: P=0.7778 R=0.8750 F1=0.8235.
Recommended strict hybrid mode (model + checksum + span normalization, no plausible fallback):
- P=1.0000 R=1.0000 F1=1.0000.

Non-PPSN retention check vs base OpenMed (A/B):

Real-text behavior agreement F1: 1.0000.
Entity density unchanged in the sampled real-text proxy set.

Recommended production usage

For best PPSN reliability, run this model with a strict hybrid post-processor:

Keep model detections for all labels.
For PPSN specifically, normalize/expand spans to full PPSN candidates.
Validate PPSN with checksum.
Disable broad "plausible" fallback in production (no-plausible-ppsn).

This repository includes only the model artifacts; hybrid scripts are in the companion codebase used to train/evaluate this release.

Limitations

The Irish regression set is small and targeted; additional domain-specific validation is required before regulated deployment.
PPSN detection in noisy OCR/ASR or heavily malformed text may require extra hardening.

License and attribution

This derivative release is distributed under Apache-2.0, consistent with the base model license tag.
Base model: OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1.
Training/evaluation used synthetic and augmented data sources, including nvidia/Nemotron-PII and joelniklaus/mapa.
See NOTICE for attribution details.

QA quick check

Load this model as a standard AutoModelForTokenClassification checkpoint.
Run against irish_ppsn_regression_v5.jsonl.
Confirm results in eval_manual_irish_v5_raw.json (raw) and eval_hybrid_v5_no_plausible.json (strict hybrid).