OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7

Token-classification release for Irish core PII in English and Irish Gaelic.

Included Variants

  • Full transformers checkpoint in the repo root
  • Unquantized ONNX export in onnx/model.onnx
  • Dynamic q8 ONNX artifact in onnx/model_quantized.onnx
  • inference_mask.py for the full checkpoint
  • inference_mask_onnx.py for the ONNX q8 artifact
  • benchmark files in eval/

Coverage

  • PPSN
  • ACCOUNT_NUMBER
  • BANK_ROUTING_NUMBER
  • CREDIT_DEBIT_CARD
  • PASSPORT_NUMBER
  • POSTCODE
  • PHONE_NUMBER
  • EMAIL
  • FIRST_NAME
  • LAST_NAME
  • SWIFT_BIC

What Changed From rc6

rc7 keeps the same checkpoint weights and the same bundled ONNX q8 artifact as temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc6.

The change is the scanner implementation:

  • candidate extraction is now defined in a human-readable spec: scanner_spec.yaml
  • generated runtime data lives in irish_core_generated_scanner_spec.py
  • regeneration is explicit: python generate_scanner_spec.py
  • semantic validation remains procedural in ppsn.py, eircode.py, and irish_core_decoder.py
  • release Python files no longer depend on the third-party regex package
  • the scanner layer no longer uses regex-based candidate extraction on untrusted text
  • PPSN candidate extraction no longer consumes a following word-initial letter after whitespace, so cases like 1234567T a ... stay bounded to 1234567T

This is not a pure PEG grammar because some labels need semantic checks after lexical scanning:

  • PPSN: checksum and suffix plausibility
  • ACCOUNT_NUMBER: Irish IBAN structure and mod-97
  • CREDIT_DEBIT_CARD: Luhn and grouped-card plausibility
  • POSTCODE: Eircode routing-key and unique-identifier validation

So the public contract is:

  1. human-readable lexical scan spec in scanner_spec.yaml
  2. generated scanner configuration in irish_core_generated_scanner_spec.py
  3. deterministic semantic validators in code

How To Use It

If you want the published rc7 behavior, use the bundled inference stack. A plain transformers token-classification pipeline will not reproduce the release behavior on numeric/boundary-heavy labels.

Full checkpoint:

uv run python inference_mask.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7 \
  --ppsn-min-score 0.55 \
  --other-min-score 0.50 \
  --text "Please provide your passport: NN5123456 and call me on 0851234567." \
  --json

Dynamic q8 ONNX:

uv run python inference_mask_onnx.py \
  --model temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7 \
  --onnx-file onnx/model_quantized.onnx \
  --ppsn-min-score 0.55 \
  --other-min-score 0.50 \
  --text "My IBAN is IE29 AIBK 9311 5212 345678 and my PPSN is 1234567T." \
  --json

Key Benchmarks

Benchmarked with the spec-driven scanner stack shipped in this repo.

Full Checkpoint

Suite F1
Irish core manual 1.0000
Phone / passport / finance 1.0000
Finance boundary repair 1.0000
QA Gaelic weak-context PPSN 1.0000

ONNX q8

Suite F1
Irish core manual 0.9934
Phone / passport / finance 1.0000
Finance boundary repair 1.0000
QA Gaelic weak-context PPSN 1.0000

Irish Core Label Breakdown

Label Full ONNX q8
PPSN 1.0000 1.0000
PHONE_NUMBER 1.0000 1.0000
POSTCODE 1.0000 0.8571
PASSPORT_NUMBER 1.0000 1.0000
ACCOUNT_NUMBER 1.0000 1.0000
BANK_ROUTING_NUMBER 1.0000 1.0000
EMAIL 1.0000 1.0000
FIRST_NAME 1.0000 1.0000
LAST_NAME 1.0000 1.0000

Dynamic q8 Artifact

Artifact paths:

  • unquantized: onnx/model.onnx
  • quantized: onnx/model_quantized.onnx

Quantization recipe used in this repo:

  • ONNX pre-processing before quantization
  • ONNX Runtime dynamic int8
  • qint8
  • per_channel=true
  • op_types=MatMul,Gemm,Attention

Limits

  • POSTCODE remains the main q8 gap on the manual Irish core suite.
  • The release contract includes the bundled decoder; raw pipeline(...) output is not the release behavior.
  • The scanner spec covers lexical extraction; semantic validators remain in code by design.

License And Attribution

  • Release license: Apache-2.0
  • Base model: OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
  • See NOTICE and training_sources.json for attribution and release details.
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7

Datasets used to train temsa/OpenMed-mLiteClinical-IrishCorePII-135M-v2-rc7