IrishCore-DiffMask-135M-v1-rc1

IrishCore-DiffMask-135M-v1-rc1 is a raw-only Irish PII masking model derived from OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.

It is a small, scanner-free span extractor tuned for:

  • PPSN
  • ACCOUNT_NUMBER
  • BANK_ROUTING_NUMBER
  • CREDIT_DEBIT_CARD
  • PASSPORT_NUMBER
  • POSTCODE
  • PHONE_NUMBER
  • EMAIL
  • FIRST_NAME
  • LAST_NAME
  • SWIFT_BIC

The main target is English plus Irish Gaelic text in citizen-support, public-sector, and HSE-style flows. The repo ships both the full transformers checkpoint and a dynamic q8 ONNX artifact for CPU deployment.

What "DiffMask" Means Here

This release is not a generative diffusion language model. It is a compact discriminative token-span model trained with a diffusion-style denoising schedule.

The short version:

  • Base OpenMed: plain BIO token classification
  • DiffMask: token-span extraction with token-presence and boundary heads
  • DiffMask training: repeated masked denoising over the same sentence
  • DiffMask inference: one forward pass, no iterative refinement, no text generation

Concretely:

  • The encoder starts from the DistilBERT-family weights inside OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1.
  • The model adds three task heads over the encoder hidden states:
    • a per-label token-presence head
    • a typed start-boundary head
    • a typed end-boundary head
  • During training, each input sentence is corrupted multiple times by replacing a random fraction of visible tokens with [MASK].
  • The corruption level follows a short noise schedule from heavy masking to light masking.
  • The same gold spans are learned at every noise level, and the losses are averaged across the denoising passes.
  • At inference time there is no diffusion loop and no rewrite step: the model runs once and a score-only span decoder reconstructs spans from token scores plus typed boundaries.

So the "DLLM" aspect here is the training recipe: repeated masked denoising over text, not autoregressive generation.

What It Is Not

This model is not a full discrete diffusion language model in the LLaDA sense.

A true DLLM would usually have:

  • timestep or noise conditioning inside the model
  • iterative denoising at inference time
  • multi-step sequence refinement at runtime
  • text generation or full-sequence reconstruction as a first-class objective

This release does not do that.

Instead, it uses the diffusion idea only as a training-time robustness trick:

  • corrupt the sentence with [MASK] at several noise levels
  • train on the same target spans each time
  • average those losses

At runtime, it behaves like a normal fast discriminative extractor.

Architecture

  • Encoder: DistilBERT-size encoder from the OpenMed mLiteClinical 135M base
  • Heads:
    • token presence per released label
    • typed start boundary per released label
    • typed end boundary per released label
  • Decoder:
    • score-only span decoding from offsets, token continuity, label-specific thresholds, and typed boundaries
    • no regex candidate extractor
    • no checksum validator
    • no scanner layer

The release behavior is fully defined by the weights plus the bundled decoder in common.py.

Training And Inference Flow

Training:

  1. tokenize a sentence with gold BIO spans
  2. convert spans into:
    • token-presence targets
    • typed start targets
    • typed end targets
  3. create several noised copies of the same tokenized sentence by masking random visible tokens
  4. run the same encoder+heads on each noised copy
  5. average the losses across those denoising passes

Inference:

  1. tokenize the raw text once
  2. run a single forward pass
  3. predict:
    • which labels are present on each token
    • where each labeled span starts
    • where each labeled span ends
  4. decode spans with label-aware thresholds and boundary rules
  5. replace the detected spans with placeholders such as [PII:PPSN]

There is no multi-step refinement loop in deployment.

How It Differs From The Original OpenMed Model

The original OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 is a standard DistilBertForTokenClassification model:

  • one encoder
  • one token-classification head
  • BIO labels such as B-email, I-email, B-phone_number
  • generic token aggregation to recover spans

DiffMask changes two things:

  1. Different supervision

    • base OpenMed learns only BIO token labels
    • DiffMask learns token presence plus typed span boundaries
  2. Different training recipe

    • base OpenMed is trained as a standard token classifier
    • DiffMask is trained on multiple masked-noised views of the same sentence

That makes DiffMask better suited to structured Irish identifiers and mixed PII masking, while still keeping a small encoder and a fast CPU path.

How It Differs From rc5 And rc8

Model Core idea External scanner/validator Runtime shape
rc5 token classifier + repair logic yes heavier, decoder-assisted
rc8 raw-only token-span model no one pass + span decoder
DiffMask raw-only token-span model + denoising training no one pass + span decoder

So DiffMask is closest to rc8 operationally, but it uses a stronger training recipe.

Why This Exists

The older rc5 release still depended on a repair-oriented decoder stack. The public rc8 release removed that external logic, but it regressed on several structured Irish identifiers. This release keeps the raw-only deployment shape while re-hardening the model on Irish numeric and mixed-PII cases.

References

Direct implementation references:

Conceptual diffusion-style training references:

These diffusion papers were used as architectural inspiration for the masked noising schedule. This release does not implement a generative text diffusion runtime.

Included Artifacts

  • Full transformers checkpoint in the repo root
  • Dynamic q8 ONNX export in onnx/model_quantized.onnx
  • Unquantized ONNX export in onnx/model.onnx
  • inference_mask.py for the full checkpoint
  • inference_mask_onnx.py for the ONNX q8 path
  • common.py, model.py, and multitask_model.py implementing the release decoder
  • benchmark files in eval/

Artifact sizes:

  • Full checkpoint: 514 MB (model.safetensors)
  • Dynamic q8 ONNX: 393 MB (onnx/model_quantized.onnx)

How To Use It

Full checkpoint:

uv run python inference_mask.py \
  --model temsa/IrishCore-DiffMask-135M-v1-rc1 \
  --min-score 0.5 \
  --text "My PPSN is 1234567TW, my Eircode is D02 X285, and my phone is 087 123 4567." \
  --json

Dynamic q8 ONNX:

uv run python inference_mask_onnx.py \
  --model temsa/IrishCore-DiffMask-135M-v1-rc1 \
  --min-score 0.5 \
  --text "Please provide your passport NN5123456 and call me on 0851234567." \
  --json

Both scripts emit explicit placeholders like [PII:PPSN] in masked_text.

Q8 Comparison

Deployment-relevant comparison on CPU:

Model Core F1 Edge F1 Finance F1 Finance-boundary F1 User PPSN F1 GA weak PPSN F1 Multilingual PPSN F1 Hardening F1
rc5 ONNX q8 0.9669 0.9744 0.9362 0.8750 1.0000 1.0000 0.9333 -
rc8 ONNX q8 0.9737 1.0000 1.0000 1.0000 1.0000 1.0000 0.9176 0.7059
IrishCore-DiffMask-135M-v1-rc1 ONNX q8 0.9934 1.0000 1.0000 1.0000 1.0000 1.0000 0.9412 0.9744

CPU throughput references:

Suite rc5 q8 rc8 q8 IrishCore-DiffMask-135M-v1-rc1 q8
Irish core short-text path 33.6193 ex/s 257.3756 ex/s 251.2358 ex/s
Multilingual PPSN short-text path 35.5561 ex/s 230.5181 ex/s 256.0768 ex/s
Runtime profile source 23.8338 ex/s 179.4708 ex/s 184.2930 ex/s

Notes:

  • The rc5 speed references come from its published q8 end-to-end inference stack, which includes its older repair decoder.
  • The rc8 and IrishCore-DiffMask-135M-v1-rc1 numbers use the same raw-only token-span ONNX path.
  • A weight-only q4 ONNX experiment was also tried during development, but it was slower than q8 on this CPU and is not shipped.

Limits

  • This is still a compact model. The hardest remaining errors are multilingual PPSN near-miss cases rather than Irish core numeric formats.
  • The release path is intentionally scanner-free. If you need deterministic validation of individual identifier types, add that in your application layer.
  • If you rely on release behavior, use the bundled inference scripts or import decode_token_presence_segments from common.py.

License And Attribution

  • Release license: Apache-2.0
  • Base model: OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
  • The derivative release remains subject to the attribution terms of the upstream datasets listed above.
  • See NOTICE, training_sources.json, and eval/benchmark_summary.json for provenance and benchmark details.
Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/IrishCore-DiffMask-135M-v1-rc1

Datasets used to train temsa/IrishCore-DiffMask-135M-v1-rc1

Papers for temsa/IrishCore-DiffMask-135M-v1-rc1

Evaluation results