IrishCore-DiffMask-135M-v1-rc3
IrishCore-DiffMask-135M-v1-rc3 is a raw-only Irish PII masking model derived from OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.
It is a small, scanner-free span extractor tuned for:
PPSNACCOUNT_NUMBERBANK_ROUTING_NUMBERCREDIT_DEBIT_CARDPASSPORT_NUMBERPOSTCODEPHONE_NUMBEREMAILFIRST_NAMELAST_NAMESWIFT_BIC
The main target is English plus Irish Gaelic text in citizen-support, public-sector, and HSE-style flows. The repo ships both the full transformers checkpoint and a dynamic q8 ONNX artifact for CPU deployment.
What "DiffMask" Means Here
This release is not a generative diffusion language model. It is a compact discriminative token-span model trained with a diffusion-style denoising schedule.
The short version:
- Base OpenMed: plain BIO token classification
- DiffMask: token-span extraction with token-presence and boundary heads
- DiffMask training: repeated masked denoising over the same sentence
- DiffMask inference: one forward pass, no iterative refinement, no text generation
Concretely:
- The encoder starts from the DistilBERT-family weights inside
OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1. - The model adds three task heads over the encoder hidden states:
- a per-label token-presence head
- a typed start-boundary head
- a typed end-boundary head
- During training, each input sentence is corrupted multiple times by replacing a random fraction of visible tokens with
[MASK]. - The corruption level follows a short noise schedule from heavy masking to light masking.
- The same gold spans are learned at every noise level, and the losses are averaged across the denoising passes.
- At inference time there is no diffusion loop and no rewrite step: the model runs once and a score-only span decoder reconstructs spans from token scores plus typed boundaries.
So the "DLLM" aspect here is the training recipe: repeated masked denoising over text, not autoregressive generation.
What It Is Not
This model is not a full discrete diffusion language model in the LLaDA sense.
A true DLLM would usually have:
- timestep or noise conditioning inside the model
- iterative denoising at inference time
- multi-step sequence refinement at runtime
- text generation or full-sequence reconstruction as a first-class objective
This release does not do that.
Instead, it uses the diffusion idea only as a training-time robustness trick:
- corrupt the sentence with
[MASK]at several noise levels - train on the same target spans each time
- average those losses
At runtime, it behaves like a normal fast discriminative extractor.
Architecture
- Encoder: DistilBERT-size encoder from the OpenMed mLiteClinical 135M base
- Heads:
- token presence per released label
- typed start boundary per released label
- typed end boundary per released label
- Decoder:
- score-only span decoding from offsets, token continuity, label-specific thresholds, and typed boundaries
- no regex candidate extractor
- no checksum validator
- no scanner layer
The release behavior is fully defined by the weights plus the bundled decoder in common.py.
Training And Inference Flow
Training:
- tokenize a sentence with gold BIO spans
- convert spans into:
- token-presence targets
- typed start targets
- typed end targets
- create several noised copies of the same tokenized sentence by masking random visible tokens
- run the same encoder+heads on each noised copy
- average the losses across those denoising passes
Inference:
- tokenize the raw text once
- run a single forward pass
- predict:
- which labels are present on each token
- where each labeled span starts
- where each labeled span ends
- decode spans with label-aware thresholds and boundary rules
- replace the detected spans with placeholders such as
[PII:PPSN]
There is no multi-step refinement loop in deployment.
How It Differs From The Original OpenMed Model
The original OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 is a standard DistilBertForTokenClassification model:
- one encoder
- one token-classification head
- BIO labels such as
B-email,I-email,B-phone_number - generic token aggregation to recover spans
DiffMask changes two things:
Different supervision
- base OpenMed learns only BIO token labels
- DiffMask learns token presence plus typed span boundaries
Different training recipe
- base OpenMed is trained as a standard token classifier
- DiffMask is trained on multiple masked-noised views of the same sentence
That makes DiffMask better suited to structured Irish identifiers and mixed PII masking, while still keeping a small encoder and a fast CPU path.
How It Differs From rc5 And rc8
| Model | Core idea | External scanner/validator | Runtime shape |
|---|---|---|---|
rc5 |
token classifier + repair logic | yes | heavier, decoder-assisted |
rc8 |
raw-only token-span model | no | one pass + span decoder |
DiffMask |
raw-only token-span model + denoising training | no | one pass + span decoder |
So DiffMask is closest to rc8 operationally, but it uses a stronger training recipe.
Why This Exists
The older rc5 release still depended on a repair-oriented decoder stack. The public rc8 release removed that external logic, but it regressed on several structured Irish identifiers. This release keeps the raw-only deployment shape while re-hardening the model on Irish numeric and mixed-PII cases.
rc3 is the next candidate after rc2. It keeps the stronger focusv3 checkpoint selected during local iteration, then applies a small decoder-profile retune for the published config:
- lower
EMAILtoken extend threshold to keep contiguous mailbox fragments together - lower
PASSPORT_NUMBERq8 threshold slightly to recover a mixed-message passport miss after dynamic quantization
The weights remain raw-only and scanner-free. The rc3 change is the checkpoint plus a stricter release-time decoder profile in config.json.
References
Direct implementation references:
- Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805 - Sanh et al., DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
https://arxiv.org/abs/1910.01108 - Fu et al., Boundary Smoothing for Named Entity Recognition
https://aclanthology.org/2022.acl-long.490/ - Wang et al., SPANNER: Named Entity Re-/Recognition as Span Prediction
https://aclanthology.org/2021.acl-long.558/
Conceptual diffusion-style training references:
- Nie et al., LLaDA 2.0: Scaling Up Diffusion Language Models to 100B
https://arxiv.org/abs/2512.15745 - Gong et al., Scaling Diffusion Language Models via Adaptation from Autoregressive Models
https://arxiv.org/abs/2410.17891
These diffusion papers were used as architectural inspiration for the masked noising schedule. This release does not implement a generative text diffusion runtime.
Included Artifacts
- Full
transformerscheckpoint in the repo root - Dynamic q8 ONNX export in
onnx/model_quantized.onnx - Unquantized ONNX export in
onnx/model.onnx inference_mask.pyfor the full checkpointinference_mask_onnx.pyfor the ONNX q8 pathcommon.py,model.py, andmultitask_model.pyimplementing the release decoder- benchmark files in
eval/
Artifact sizes:
- Full checkpoint:
514 MB(model.safetensors) - Dynamic q8 ONNX:
393 MB(onnx/model_quantized.onnx)
How To Use It
Full checkpoint:
uv run python inference_mask.py \
--model temsa/IrishCore-DiffMask-135M-v1-rc3 \
--min-score 0.5 \
--text "My PPSN is 1234567TW, my Eircode is D02 X285, and my phone is 087 123 4567." \
--json
Dynamic q8 ONNX:
uv run python inference_mask_onnx.py \
--model temsa/IrishCore-DiffMask-135M-v1-rc3 \
--min-score 0.5 \
--text "Please provide your passport NN5123456 and call me on 0851234567." \
--json
Both scripts emit explicit placeholders like [PII:PPSN] in masked_text.
Q8 Comparison
Deployment-relevant comparison on CPU:
| Model | Core F1 | Edge F1 | Finance F1 | Finance-boundary F1 | User PPSN F1 | GA weak PPSN F1 | Multilingual PPSN F1 | Hardening F1 |
|---|---|---|---|---|---|---|---|---|
rc5 ONNX q8 |
0.9669 | 0.9744 | 0.9362 | 0.8750 | 1.0000 | 1.0000 | 0.9333 | - |
rc8 ONNX q8 |
0.9737 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9176 | 0.7059 |
IrishCore-DiffMask-135M-v1-rc3 ONNX q8 |
0.9664 | 1.0000 | 1.0000 | 1.0000 | 0.8571 | 1.0000 | 0.9591 | 1.0000 |
UAT replay exact suite used for the recent hardening pass:
| Model | UAT replay exact F1 | Precision | Recall |
|---|---|---|---|
IrishCore-DiffMask-135M-v1-rc1 ONNX q8 |
0.4545 | 1.0000 | 0.2941 |
IrishCore-DiffMask-135M-v1-rc2 ONNX q8 |
0.8276 | 1.0000 | 0.7059 |
rc8 ONNX q8 |
0.3636 | 0.3750 | 0.3529 |
IrishCore-DiffMask-135M-v1-rc3 ONNX q8 |
0.9032 | 1.0000 | 0.8235 |
CPU throughput references:
| Suite | rc5 q8 |
rc8 q8 |
IrishCore-DiffMask-135M-v1-rc3 q8 |
|---|---|---|---|
| Irish core short-text path | 33.6193 ex/s | 257.3756 ex/s | 29.9676 ex/s |
| Multilingual PPSN short-text path | 35.5561 ex/s | 230.5181 ex/s | 54.2219 ex/s |
| Runtime profile source | 23.8338 ex/s | 179.4708 ex/s | 46.1519 ex/s |
Notes:
- The
rc5speed references come from its published q8 end-to-end inference stack, which includes its older repair decoder. - The
rc8andIrishCore-DiffMask-135M-v1-rc3numbers use the same raw-only token-span ONNX path. - A weight-only q4 ONNX experiment was also tried during development, but it was slower than q8 on this CPU and is not shipped.
- The
user_raw_regression_cases_v1suite is a legacy PPSN-only regression set. Inrc3, the single counted false positive is0871234567, which is now intentionally masked asPHONE_NUMBERrather than misread asPPSN.
Additional Training Data Used For This RC
Published training sources:
temsa/OpenMed-Irish-CorePII-TrainMix-v1temsa/OpenMed-Irish-PPSN-Eircode-Spec-v1joelniklaus/mapagretelai/synthetic_pii_finance_multilingual
Additional local synthetic hardening and replay sets used during checkpoint selection:
irish_core_diffmask_v5_mixdllm_uat_replay_v1dllm_gap_patch_v4dllm_uat_patch_v3irish_core_diffmask_focus_v3
rc3 is based on the locally selected focusv3 checkpoint and then retuned with a narrower decoder profile for the public config.
Limits
- This is still a compact model. The hardest remaining errors are multilingual PPSN near-miss cases rather than Irish core numeric formats.
- The release path is intentionally scanner-free. If you need deterministic validation of individual identifier types, add that in your application layer.
- If you rely on release behavior, use the bundled inference scripts or import
decode_token_presence_segmentsfromcommon.py. - Known remaining misses on the current UAT replay suite are the second phone number in the long Client Identity Services sentence (
071 967 2616),R93 EC57inside the longer allocation-centre block, andEPStamp4@enterprise.gov.ie.
License And Attribution
- Release license: Apache-2.0
- Base model:
OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 - The derivative release remains subject to the attribution terms of the upstream datasets listed above.
- See
NOTICE,training_sources.json, andeval/benchmark_summary.jsonfor provenance and benchmark details.
- Downloads last month
- 14
Model tree for temsa/IrishCore-DiffMask-135M-v1-rc3
Datasets used to train temsa/IrishCore-DiffMask-135M-v1-rc3
Papers for temsa/IrishCore-DiffMask-135M-v1-rc3
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Evaluation results
- Overall F1 on irish_core_pii_v1self-reported0.966
- Overall F1 on multilingual_ppsn_v1_allself-reported0.959
- Overall F1 on irish_dllm_hardening_exact_v1self-reported1.000
- Overall F1 on diffmask_gap_uat_exact_v1self-reported0.903