DiffMask Eval Harness
- Model:
models/irishcore-diffmask-135m-v1-rc6b-focusv10-e012-b48w0
Result
The deployment-aligned harness is the clean single-pass token-span path.
experiments/irish_core_span_raw_only/benchmark_multitask.pyandscripts/eval_dllm_release.py --inference-mode clean_single_passmatch exactly on the checked suites.- The old diffusion-style eval path (
diffusion_last_pass) is not deployment-aligned and depresses scores on several suites.
Comparison
| Dataset | benchmark_multitask | eval_clean_single_pass | eval_diffusion_last_pass |
|---|---|---|---|
fresh_holdout |
0.7170 | 0.7170 | 0.6545 |
uat_exact |
0.9032 | 0.9032 | 0.9032 |
irish_core |
0.9733 | 0.9733 | 0.9737 |
multilingual_ppsn |
0.9274 | 0.9274 | 0.8966 |
Conclusion
- Use
clean_single_passfor release gating and model comparison. - Keep
diffusion_last_passonly as a training diagnostic if needed.