OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5

Token-classification checkpoint for Irish core PII in English and Irish Gaelic.

Included Variants

  • Full transformers checkpoint in the repo root
  • Unquantized ONNX export in onnx/model.onnx
  • Dynamic q8 ONNX artifact in onnx/model_quantized.onnx
  • inference_mask.py for the full checkpoint
  • inference_mask_onnx.py for the ONNX q8 artifact
  • benchmark files in eval/

Coverage

  • PPSN
  • ACCOUNT_NUMBER
  • BANK_ROUTING_NUMBER
  • CREDIT_DEBIT_CARD
  • PASSPORT_NUMBER
  • POSTCODE
  • PHONE_NUMBER
  • EMAIL
  • FIRST_NAME
  • LAST_NAME
  • SWIFT_BIC

What Changed From rc4

rc5 keeps the same fine-tuned checkpoint weights as temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc4, but changes the shipped inference stack:

  • recommended PPSN threshold lowered from 0.71 to 0.55
  • recommended decoder is now the Irish core label-aware repair decoder for both full and q8 inference
  • bundled q8 artifact is rebuilt from a preprocessed ONNX export before dynamic int8 quantization

This is the right change because the new QA misses in Gaelic weak-context PPSN text were calibration/inference failures, not weight-quality failures.

Recommended Inference

Full checkpoint:

uv run python inference_mask.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 \
  --ppsn-min-score 0.55 \
  --other-min-score 0.50 \
  --text "Duradh liom mo uimhir 1234567T a sholatar agus me ag denamh iarratais." \
  --json

Dynamic q8 ONNX:

uv run python inference_mask_onnx.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5 \
  --onnx-file onnx/model_quantized.onnx \
  --ppsn-min-score 0.55 \
  --other-min-score 0.50 \
  --text "Is e mo upsp na 1234567tw agus teastaionn uaim eolas faoi liuntas curamora." \
  --json

The bundled pyproject.toml is intended for uv. Use uv run so onnxruntime is available for the q8 script.

Key Benchmarks

Fix For The Reported Gaelic PPSN Regression

Variant QA Gaelic weak-context PPSN F1
rc4 full published defaults 0.0000
rc4 q8 published defaults 0.6667
rc5 full 1.0000
rc5 q8 1.0000

Base OpenMed vs rc5

Suite Base OpenMed rc5 full rc5 ONNX q8
Irish core manual 0.6119 0.9737 0.9669
Irish PPSN/phone edge 0.0769 0.9744 0.9744
Remaining gaps n/a 1.0000 0.8889
Phone/passport/finance n/a 0.9600 0.9362
Finance boundary repair n/a 0.9143 0.8750
Multilingual PPSN 0.0000 0.9333 0.9333
User PPSN regressions n/a 1.0000 1.0000
Irish PPSN overlap n/a 1.0000 1.0000

Core Label Breakdown

Label Base OpenMed rc5 full rc5 ONNX q8
PPSN 0.0000 0.9231 0.9231
PHONE_NUMBER 0.0000 0.9565 0.9565
POSTCODE 0.0000 1.0000 0.8571
PASSPORT_NUMBER 0.0000 1.0000 1.0000
ACCOUNT_NUMBER 0.4000 0.8571 0.8571
BANK_ROUTING_NUMBER 0.0000 1.0000 1.0000
EMAIL 1.0000 1.0000 1.0000
FIRST_NAME 0.8947 1.0000 1.0000
LAST_NAME 0.8889 1.0000 1.0000

Dynamic q8 Artifact

Artifact paths:

  • unquantized: onnx/model.onnx
  • quantized: onnx/model_quantized.onnx

Quantization recipe used in this repo:

  • ONNX pre-processing before quantization
  • ONNX Runtime dynamic int8
  • qint8
  • per_channel=true
  • op_types=MatMul,Gemm,Attention

This q8 path keeps the same F1 as the best prior q8 recipe on the sampled comparison suites while improving CPU throughput on the manual Irish-core suites.

CPU Throughput

Suite Base OpenMed rc5 full rc5 ONNX q8
Irish core manual 15.79 6.70 34.43
Irish PPSN/phone edge 16.60 16.50 36.56
Multilingual PPSN 121.08 125.30 289.49

Limits

  • The full checkpoint is still stronger than q8 on the finance-boundary suite.
  • The q8 artifact is still weaker than the full checkpoint on the strict remaining-gap suite.
  • Grouped credit/debit-card boundary cases remain the main shared weakness and should still be QA tested.

License And Attribution

  • Release license: Apache-2.0
  • Base model: OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
  • See NOTICE and training_sources.json for attribution and release details.
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5

Datasets used to train temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5