DiffMask Eval Harness

Result

The deployment-aligned harness is the clean single-pass token-span path.

experiments/irish_core_span_raw_only/benchmark_multitask.py and scripts/eval_dllm_release.py --inference-mode clean_single_pass match exactly on the checked suites.
The old diffusion-style eval path (diffusion_last_pass) is not deployment-aligned and depresses scores on several suites.

Dataset	benchmark_multitask	eval_clean_single_pass	eval_diffusion_last_pass
`fresh_holdout`	0.7170	0.7170	0.6545
`uat_exact`	0.9032	0.9032	0.9032
`irish_core`	0.9733	0.9733	0.9737
`multilingual_ppsn`	0.9274	0.9274	0.8966