temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4

Fourth QA release candidate for Irish core PII detection with OpenMed mLiteClinical.

This repository should be evaluated against:

  • current public candidate: temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc3
  • stable public release: temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1
  • this repository: temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4

This RC improves the current public candidate on the main release-gating suites without reopening the PPSN false positives that were addressed after v2-rc3.

Included Variants

Variant Artifact Backend Recommended Thresholds Intended Use
Full checkpoint repo root transformers ppsn=0.71, other=0.50 highest-fidelity evaluation and deployment
Quantized checkpoint onnx/model_quantized.onnx ONNX Runtime dynamic int8 ppsn=0.71, other=0.60 CPU-oriented deployment

Coverage

  • PPSN
  • account_number
  • bank_routing_number
  • credit_debit_card
  • PASSPORT_NUMBER
  • postcode
  • phone_number
  • email
  • first_name
  • last_name
  • swift_bic

The main focus is English and Irish Gaelic handling for Irish administrative, citizen-support, and HSE-style text.

Recommended Inference

Full checkpoint:

uv run python inference_mask.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4 \
  --ppsn-min-score 0.71 \
  --other-min-score 0.50 \
  --text "My PPSN is 1234567T and my sort code is 90-00-17." \
  --json

Fast CPU path with the bundled ONNX q8 artifact:

uv run python inference_mask_onnx.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4 \
  --ppsn-min-score 0.71 \
  --other-min-score 0.60 \
  --text "Please provide your passport: NN5123456." \
  --json

If you prefer plain python3, install the dependencies from pyproject.toml first.

What Improved Versus temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc3

Full checkpoint:

  • core suite F1: 0.9806 -> 0.9870
  • overlap suite F1: 0.9429 -> 1.0000
  • strict remaining IoU=1.0 F1: 0.4444 -> 0.6000
  • multilingual PPSN-only F1: 0.9333 -> 0.9545
  • matches the current public candidate on edge, numeric, gap, and user PPSN regression F1

Bundled ONNX q8:

  • core suite F1: 0.9677 -> 0.9804
  • multilingual PPSN-only F1: 0.9333 -> 0.9600
  • keeps the overlap and strict remaining gains from v2-rc3
  • q8 remains weaker than the full checkpoint on the small edge suite: 0.9474 vs 1.0000

Benchmark Table

Variant Core Edge Numeric Gap User PPSN Overlap Strict Remaining IoU=1.0 Multilingual PPSN
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc3 full 0.9806 1.0000 0.9333 0.9167 1.0000 0.9429 0.4444 0.9333
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4 full 0.9870 1.0000 0.9333 0.9167 1.0000 1.0000 0.6000 0.9545
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc3 ONNX q8 0.9677 1.0000 0.9333 0.9167 1.0000 1.0000 0.6667 0.9333
temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4 ONNX q8 0.9804 0.9474 0.9333 0.9167 1.0000 1.0000 0.6667 0.9600

Quantized Artifact

The bundled quantized artifact is:

  • onnx/model_quantized.onnx

For this release line, the promoted q8 recipe remains the standard ONNX Runtime dynamic int8 export with per-channel quantization over MatMul, Gemm, and Attention.

Two alternatives were reviewed and not promoted for this model family:

  • QAT in this DistilBERT token-classification stack
  • Mezzanine's recent Qwen-focused weight transforms

Known Limits

This is still a raw token-classification release candidate without hybrid rule logic. QA should still test these carefully:

  • Passport PA 1234567 was used to board the flight.
  • Usaideadh pas PA 1234567 chun dul ar bord an eitilt.
  • Call me on 0851234567 tomorrow.

Included Files

  • full transformers checkpoint in the repo root
  • dynamic int8 ONNX artifact in onnx/model_quantized.onnx
  • inference_mask.py
  • inference_mask_onnx.py
  • qa_config.json
  • training_sources.json
  • benchmark summaries in eval/

License And Attribution

  • release license: Apache-2.0
  • base model: OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
  • upstream attributed data: joelniklaus/mapa, gretelai/synthetic_pii_finance_multilingual
  • synthetic Irish training and replay data created in this workspace

See NOTICE for attribution details.

Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4

Datasets used to train temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4

Evaluation results