temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1
Full transformers checkpoint derived from OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 and tuned for Irish core PII:
PPSNaccount_numberbank_routing_numbercredit_debit_cardPASSPORT_NUMBERpostcodephone_numberemailfirst_namelast_nameswift_bic
The main focus is English + Irish Gaelic (ga) handling for Irish administrative, citizen-support, and HSE-style text.
Included Artifacts
- Full
transformersmodel files in the repo root - Dynamic int8 ONNX export in
onnx/model_quantized.onnx inference_mask.pyfor the full modelinference_mask_onnx.pyfor the ONNX int8 artifact- clean benchmark summaries in
eval/
Recommended Inference
Highest accuracy:
python3 inference_mask.py \
--text "My PPSN is 1234567TW and call me on 087 123 4567." \
--json
Fast CPU path:
python3 inference_mask_onnx.py \
--text "My PPSN is 1234567TW and call me on 087 123 4567." \
--json
Dedicated Eircode example:
python3 inference_mask.py \
--text "My Eircode is D02 X285." \
--json
Benchmarks
Reference comparison on the manual Irish core suite and PPSN regression suites:
| Label | Base OpenMed | Previous Public Model | This Release | ONNX Q8 |
|---|---|---|---|---|
PPSN |
0.0000 | 0.0800 | 0.8000 | 0.7273 |
account_number |
0.3333 | 0.3333 | 1.0000 | 1.0000 |
bank_routing_number |
0.0000 | 0.0000 | 1.0000 | 1.0000 |
credit_debit_card |
0.1538 | 0.1818 | 1.0000 | 0.3333 |
PASSPORT_NUMBER |
0.0000 | 0.0000 | 1.0000 | 1.0000 |
postcode |
0.0000 | 0.0000 | 1.0000 | 1.0000 |
phone_number |
0.0000 | 0.0000 | 0.8571 | 0.8571 |
email |
0.7059 | 1.0000 | 1.0000 | 1.0000 |
first_name |
0.8947 | 0.8947 | 1.0000 | 1.0000 |
last_name |
0.8889 | 0.8889 | 1.0000 | 1.0000 |
swift_bic |
0.0000 | 0.0000 | 1.0000 | 1.0000 |
Edge and multilingual PPSN checks:
| Suite | Base OpenMed | Previous Public Model | This Release | ONNX Q8 |
|---|---|---|---|---|
edge_ppsn |
0.0000 | 0.4211 | 0.5000 | 0.4000 |
edge_phone_number |
0.1429 | 0.1429 | 0.6316 | 0.5000 |
multilingual_ppsn |
0.0000 | 0.9704 | 0.9940 | 0.9882 |
Multilingual PPSN throughput on CPU (eval/multilingual_ppsn_v1_all.jsonl):
- Base OpenMed:
42.30examples/s - Previous public PPSN model:
42.63examples/s - This release:
41.18examples/s - ONNX Q8:
81.99examples/s
Practical Reading Of The Benchmarks
- This release is materially better than the previous public PPSN-only model on Irish phones, Eircodes, account details, passport numbers, and names.
- The bundled ONNX int8 export is useful for CPU speed, but it is not accuracy-identical to the full checkpoint.
- The largest ONNX drops are on
credit_debit_cardand some PPSN edge cases. Use the full model when those matter.
License And Attribution
- Model weights in this repo are distributed under Apache-2.0.
- Base model:
OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 - Training data included synthetic Irish data plus attributed upstream data from:
joelniklaus/mapa(cc-by-4.0)gretelai/synthetic_pii_finance_multilingual(apache-2.0)
- See
NOTICEfor attribution details.
- Downloads last month
- 14
Model tree for temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1
Datasets used to train temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v1
Evaluation results
- PPSN F1 on irish_core_pii_v1self-reported0.800
- phone_number F1 on irish_core_pii_v1self-reported0.857
- postcode F1 on irish_core_pii_v1self-reported1.000
- PPSN F1 on multilingual_ppsn_v1_allself-reported0.994