OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6
Token-classification release for Irish core PII in English and Irish Gaelic.
Included Variants
- Full
transformerscheckpoint in the repo root - Unquantized ONNX export in
onnx/model.onnx - Dynamic q8 ONNX artifact in
onnx/model_quantized.onnx inference_mask.pyfor the full checkpointinference_mask_onnx.pyfor the ONNX q8 artifact- benchmark files in
eval/
Coverage
PPSNACCOUNT_NUMBERBANK_ROUTING_NUMBERCREDIT_DEBIT_CARDPASSPORT_NUMBERPOSTCODEPHONE_NUMBEREMAILFIRST_NAMELAST_NAMESWIFT_BIC
What Changed From rc5
rc6 keeps the same checkpoint weights and the same bundled ONNX q8 artifact as temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc5.
The improvement is in the shipped inference stack:
- regex-guided span repair for numeric and boundary-heavy labels
- stronger Irish phone validation, including
+353 (01) ...forms - Irish IBAN plausibility repair that catches real Irish bank-code placeholders while rejecting obvious fake values like
IE00TEST... - PPSN structure filtering that blocks invalid shapes like
12345678T - exact boundary trimming for grouped card and passport formats
This is the right change because the remaining misses were mostly decoding and boundary failures, not weight quality failures.
Recommended Inference
Full checkpoint:
uv run python inference_mask.py \
--model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6 \
--ppsn-min-score 0.55 \
--other-min-score 0.50 \
--text "Please provide your passport: NN5123456 and call me on 0851234567." \
--json
Dynamic q8 ONNX:
uv run python inference_mask_onnx.py \
--model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6 \
--onnx-file onnx/model_quantized.onnx \
--ppsn-min-score 0.55 \
--other-min-score 0.50 \
--text "My IBAN is IE29 AIBK 9311 5212 345678 and my PPSN is 1234567T." \
--json
Key Benchmarks
rc5 vs rc6
| Suite | rc5 full | rc6 full | rc5 ONNX q8 | rc6 ONNX q8 |
|---|---|---|---|---|
| Irish core manual | 0.9737 | 1.0000 | 0.9669 | 0.9934 |
| Irish PPSN / phone edge | 0.9744 | 0.9744 | 0.9744 | 1.0000 |
| Phone / passport / finance | 0.9600 | 1.0000 | 0.9362 | 1.0000 |
| Finance boundary repair | 0.9143 | 1.0000 | 0.8750 | 1.0000 |
| QA Gaelic weak-context PPSN | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| HFW contextual smoke | 0.7490 | 0.7712 | n/a | 0.7804 |
Core Label Breakdown
| Label | rc6 full | rc6 ONNX q8 |
|---|---|---|
| PPSN | 1.0000 | 1.0000 |
| PHONE_NUMBER | 1.0000 | 1.0000 |
| POSTCODE | 1.0000 | 0.8571 |
| PASSPORT_NUMBER | 1.0000 | 1.0000 |
| ACCOUNT_NUMBER | 1.0000 | 1.0000 |
| BANK_ROUTING_NUMBER | 1.0000 | 1.0000 |
| 1.0000 | 1.0000 | |
| FIRST_NAME | 1.0000 | 1.0000 |
| LAST_NAME | 1.0000 | 1.0000 |
Dynamic q8 Artifact
Artifact paths:
- unquantized:
onnx/model.onnx - quantized:
onnx/model_quantized.onnx
Quantization recipe used in this repo:
- ONNX pre-processing before quantization
- ONNX Runtime dynamic int8
qint8per_channel=trueop_types=MatMul,Gemm,Attention
Limits
POSTCODEremains the main q8 gap on the manual Irish core suite.- The HFW contextual smoke suite still contains first/last-name false positives inherited from the underlying checkpoint.
- This release is a stronger inference stack over
rc5; it is not a new fine-tuned checkpoint.
License And Attribution
- Release license: Apache-2.0
- Base model:
OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 - See
NOTICEandtraining_sources.jsonfor attribution and release details.
- Downloads last month
- -