OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7
Token-classification release for Irish core PII in English and Irish Gaelic.
Included Variants
- Full
transformerscheckpoint in the repo root - Unquantized ONNX export in
onnx/model.onnx - Dynamic q8 ONNX artifact in
onnx/model_quantized.onnx inference_mask.pyfor the full checkpointinference_mask_onnx.pyfor the ONNX q8 artifact- benchmark files in
eval/
Coverage
PPSNACCOUNT_NUMBERBANK_ROUTING_NUMBERCREDIT_DEBIT_CARDPASSPORT_NUMBERPOSTCODEPHONE_NUMBEREMAILFIRST_NAMELAST_NAMESWIFT_BIC
What Changed From rc6
rc7 keeps the same checkpoint weights and the same bundled ONNX q8 artifact as temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6.
The change is the scanner implementation:
- candidate extraction is now defined in a human-readable spec:
scanner_spec.yaml - generated runtime data lives in
irish_core_generated_scanner_spec.py - regeneration is explicit:
python generate_scanner_spec.py - semantic validation remains procedural in
ppsn.py,eircode.py, andirish_core_decoder.py - release Python files no longer depend on the third-party
regexpackage - the scanner layer no longer uses regex-based candidate extraction on untrusted text
- PPSN candidate extraction no longer consumes a following word-initial letter after whitespace, so cases like
1234567T a ...stay bounded to1234567T
This is not a pure PEG grammar because some labels need semantic checks after lexical scanning:
PPSN: checksum and suffix plausibilityACCOUNT_NUMBER: Irish IBAN structure and mod-97CREDIT_DEBIT_CARD: Luhn and grouped-card plausibilityPOSTCODE: Eircode routing-key and unique-identifier validation
So the public contract is:
- human-readable lexical scan spec in
scanner_spec.yaml - generated scanner configuration in
irish_core_generated_scanner_spec.py - deterministic semantic validators in code
How To Use It
If you want the published rc7 behavior, use the bundled inference stack.
A plain transformers token-classification pipeline will not reproduce the release behavior on numeric/boundary-heavy labels.
Full checkpoint:
uv run python inference_mask.py \
--model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7 \
--ppsn-min-score 0.55 \
--other-min-score 0.50 \
--text "Please provide your passport: NN5123456 and call me on 0851234567." \
--json
Dynamic q8 ONNX:
uv run python inference_mask_onnx.py \
--model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7 \
--onnx-file onnx/model_quantized.onnx \
--ppsn-min-score 0.55 \
--other-min-score 0.50 \
--text "My IBAN is IE29 AIBK 9311 5212 345678 and my PPSN is 1234567T." \
--json
Key Benchmarks
Benchmarked with the spec-driven scanner stack shipped in this repo.
Full Checkpoint
| Suite | F1 |
|---|---|
| Irish core manual | 1.0000 |
| Phone / passport / finance | 1.0000 |
| Finance boundary repair | 1.0000 |
| QA Gaelic weak-context PPSN | 1.0000 |
ONNX q8
| Suite | F1 |
|---|---|
| Irish core manual | 0.9934 |
| Phone / passport / finance | 1.0000 |
| Finance boundary repair | 1.0000 |
| QA Gaelic weak-context PPSN | 1.0000 |
Irish Core Label Breakdown
| Label | Full | ONNX q8 |
|---|---|---|
| PPSN | 1.0000 | 1.0000 |
| PHONE_NUMBER | 1.0000 | 1.0000 |
| POSTCODE | 1.0000 | 0.8571 |
| PASSPORT_NUMBER | 1.0000 | 1.0000 |
| ACCOUNT_NUMBER | 1.0000 | 1.0000 |
| BANK_ROUTING_NUMBER | 1.0000 | 1.0000 |
| 1.0000 | 1.0000 | |
| FIRST_NAME | 1.0000 | 1.0000 |
| LAST_NAME | 1.0000 | 1.0000 |
Dynamic q8 Artifact
Artifact paths:
- unquantized:
onnx/model.onnx - quantized:
onnx/model_quantized.onnx
Quantization recipe used in this repo:
- ONNX pre-processing before quantization
- ONNX Runtime dynamic int8
qint8per_channel=trueop_types=MatMul,Gemm,Attention
Limits
POSTCODEremains the main q8 gap on the manual Irish core suite.- The release contract includes the bundled decoder; raw
pipeline(...)output is not the release behavior. - The scanner spec covers lexical extraction; semantic validators remain in code by design.
License And Attribution
- Release license: Apache-2.0
- Base model:
OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 - See
NOTICEandtraining_sources.jsonfor attribution and release details.
- Downloads last month
- -