File size: 2,470 Bytes
7b9f964
163634f
 
 
 
7b9f964
90c2600
7b9f964
 
163634f
 
 
 
 
7b9f964
 
163634f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
title: DiffusionGemma vs Gemma-4  Post-OCR Correction
emoji: 📰
colorFrom: yellow
colorTo: red
sdk: gradio
sdk_version: "6.17.3"
app_file: app.py
pinned: false
license: apache-2.0
short_description: Diffusion vs autoregressive LLM on historical OCR cleanup
models:
  - google/diffusiongemma-26B-A4B-it
  - google/gemma-4-E4B-it
---

# DiffusionGemma vs Gemma-4: post-OCR correction

A pragmatic first-pass comparison of Google's **experimental diffusion LLM**
[DiffusionGemma-26B-A4B-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it)
(released 2026-06-10; 26B MoE, 3.8B active; generates 256-token blocks by iterative
denoising) against an autoregressive baseline,
[Gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it) (~4.5B effective),
on **post-OCR correction of 19th-century English newspaper text**.

**Hypothesis**: a diffusion LM treats correction as denoising, so it may be
(a) faster and (b) less prone to *over-correction* — rewriting text that was
already correct — than an autoregressive model, possibly at some accuracy cost.

## Method (v1, pragmatic)

- 75 passages from [BLN600](https://doi.org/10.15131/shef.data.25439023)
  (19th-c British Library newspapers, aligned OCR + human gold transcription),
  align-trimmed to ≤220 Gemma tokens so outputs fit DiffusionGemma's single
  256-token block. Identical prompt for both models; thinking mode off; bf16;
  batch size 1; A100-80GB.
- Gemma-4 decodes greedily. DiffusionGemma uses its generation-config default
  entropy sampler (**no greedy equivalent exists** for the diffusion sampler —
  this is an unavoidable asymmetry, not a tuning choice).
- **Over-correction rate**: of input characters that were already correct
  (per input↔gold character alignment), the fraction the model changed
  (per input↔output alignment). **Fix rate**: of input characters that were
  wrong, the fraction the model changed. Text NFC-normalized, whitespace
  collapsed, before all metrics. CER/WER via jiwer.

## Limitations

n=75, single prompt, one run (no seeds/significance testing), 256-token block
caps passage length, tokens/sec for DiffusionGemma is computed over denoising
the whole block, DiffusionGemma is experimental and one day old at benchmark
time. Live demo examples are from ICDAR2019 post-OCR (CC-BY-4.0) because
BLN600's CC-BY-NC license doesn't permit redistribution here; benchmark passage
texts are likewise not republished — only per-passage metrics.