OpenMed-PPSN-v4 / README.md
temsa's picture
Clean initial public release (sanitized)
c4ec319
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- pii
- phi
- token-classification
- healthcare
- de-identification
- ppsn
datasets:
- nvidia/Nemotron-PII
base_model:
- OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1
---
# OpenMed-PPSN-v4
## Model Summary
This model extends `OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1` with Irish PPSN detection (`B-PPSN` / `I-PPSN`) while preserving baseline behavior on non-PPSN entities.
## Intended Use
- Clinical and administrative de-identification pipelines.
- Hybrid usage is recommended: model detection + deterministic PPSN checksum validation.
## Out-of-Scope
- Not a legal/compliance determination system.
- Not validated for every institution, note format, OCR stream, or jurisdiction.
## Training Data
- Synthetic PPSN span-labeled data generated with the official checksum rules.
- Additional benchmarking reference: `nvidia/Nemotron-PII` (CC-BY-4.0).
## Evaluation Snapshot
- Synthetic non-PPSN F1 (base): `0.8701`
- Synthetic non-PPSN F1 (candidate): `0.8420`
- Candidate-vs-base agreement F1 on real-text proxy set: `0.9814`
- Entities/1k chars (base): `11.4124`
- Entities/1k chars (candidate): `11.1231`
## Usage
Load with `transformers` token classification pipeline, or integrate with a PPSN checksum validator in your masking service.
## Limitations
- Real-text drift check used a proxy general-text set.
- Production deployment requires local validation on your own data.
## Licensing and Attribution
- Base model is tagged Apache-2.0.
- This release is Apache-2.0.
- Dataset attribution for CC-BY-4.0 sources is preserved in `NOTICE`.