OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6

Token-classification release for Irish core PII in English and Irish Gaelic.

Included Variants

  • Full transformers checkpoint in the repo root
  • Unquantized ONNX export in onnx/model.onnx
  • Dynamic q8 ONNX artifact in onnx/model_quantized.onnx
  • inference_mask.py for the full checkpoint
  • inference_mask_onnx.py for the ONNX q8 artifact
  • benchmark files in eval/

Coverage

  • PPSN
  • ACCOUNT_NUMBER
  • BANK_ROUTING_NUMBER
  • CREDIT_DEBIT_CARD
  • PASSPORT_NUMBER
  • POSTCODE
  • PHONE_NUMBER
  • EMAIL
  • FIRST_NAME
  • LAST_NAME
  • SWIFT_BIC

What Changed From rc5

rc6 keeps the same checkpoint weights and the same bundled ONNX q8 artifact as temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5.

The improvement is in the shipped inference stack:

  • regex-guided span repair for numeric and boundary-heavy labels
  • stronger Irish phone validation, including +353 (01) ... forms
  • Irish IBAN plausibility repair that catches real Irish bank-code placeholders while rejecting obvious fake values like IE00TEST...
  • PPSN structure filtering that blocks invalid shapes like 12345678T
  • exact boundary trimming for grouped card and passport formats

This is the right change because the remaining misses were mostly decoding and boundary failures, not weight quality failures.

Recommended Inference

Full checkpoint:

uv run python inference_mask.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6 \
  --ppsn-min-score 0.55 \
  --other-min-score 0.50 \
  --text "Please provide your passport: NN5123456 and call me on 0851234567." \
  --json

Dynamic q8 ONNX:

uv run python inference_mask_onnx.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6 \
  --onnx-file onnx/model_quantized.onnx \
  --ppsn-min-score 0.55 \
  --other-min-score 0.50 \
  --text "My IBAN is IE29 AIBK 9311 5212 345678 and my PPSN is 1234567T." \
  --json

Key Benchmarks

rc5 vs rc6

Suite rc5 full rc6 full rc5 ONNX q8 rc6 ONNX q8
Irish core manual 0.9737 1.0000 0.9669 0.9934
Irish PPSN / phone edge 0.9744 0.9744 0.9744 1.0000
Phone / passport / finance 0.9600 1.0000 0.9362 1.0000
Finance boundary repair 0.9143 1.0000 0.8750 1.0000
QA Gaelic weak-context PPSN 1.0000 1.0000 1.0000 1.0000
HFW contextual smoke 0.7490 0.7712 n/a 0.7804

Core Label Breakdown

Label rc6 full rc6 ONNX q8
PPSN 1.0000 1.0000
PHONE_NUMBER 1.0000 1.0000
POSTCODE 1.0000 0.8571
PASSPORT_NUMBER 1.0000 1.0000
ACCOUNT_NUMBER 1.0000 1.0000
BANK_ROUTING_NUMBER 1.0000 1.0000
EMAIL 1.0000 1.0000
FIRST_NAME 1.0000 1.0000
LAST_NAME 1.0000 1.0000

Dynamic q8 Artifact

Artifact paths:

  • unquantized: onnx/model.onnx
  • quantized: onnx/model_quantized.onnx

Quantization recipe used in this repo:

  • ONNX pre-processing before quantization
  • ONNX Runtime dynamic int8
  • qint8
  • per_channel=true
  • op_types=MatMul,Gemm,Attention

Limits

  • POSTCODE remains the main q8 gap on the manual Irish core suite.
  • The HFW contextual smoke suite still contains first/last-name false positives inherited from the underlying checkpoint.
  • This release is a stronger inference stack over rc5; it is not a new fine-tuned checkpoint.

License And Attribution

  • Release license: Apache-2.0
  • Base model: OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
  • See NOTICE and training_sources.json for attribution and release details.
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6

Datasets used to train temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6