Spaces:
Running on Zero
Running on Zero
| title: DiffusionGemma vs Gemma-4 — Post-OCR Correction | |
| emoji: 📰 | |
| colorFrom: yellow | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: "6.17.3" | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Diffusion vs autoregressive LLM on historical OCR cleanup | |
| models: | |
| - google/diffusiongemma-26B-A4B-it | |
| - google/gemma-4-E4B-it | |
| # DiffusionGemma vs Gemma-4: post-OCR correction | |
| A pragmatic first-pass comparison of Google's **experimental diffusion LLM** | |
| [DiffusionGemma-26B-A4B-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it) | |
| (released 2026-06-10; 26B MoE, 3.8B active; generates 256-token blocks by iterative | |
| denoising) against an autoregressive baseline, | |
| [Gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it) (~4.5B effective), | |
| on **post-OCR correction of 19th-century English newspaper text**. | |
| **Hypothesis**: a diffusion LM treats correction as denoising, so it may be | |
| (a) faster and (b) less prone to *over-correction* — rewriting text that was | |
| already correct — than an autoregressive model, possibly at some accuracy cost. | |
| ## Method (v1, pragmatic) | |
| - 75 passages from [BLN600](https://doi.org/10.15131/shef.data.25439023) | |
| (19th-c British Library newspapers, aligned OCR + human gold transcription), | |
| align-trimmed to ≤220 Gemma tokens so outputs fit DiffusionGemma's single | |
| 256-token block. Identical prompt for both models; thinking mode off; bf16; | |
| batch size 1; A100-80GB. | |
| - Gemma-4 decodes greedily. DiffusionGemma uses its generation-config default | |
| entropy sampler (**no greedy equivalent exists** for the diffusion sampler — | |
| this is an unavoidable asymmetry, not a tuning choice). | |
| - **Over-correction rate**: of input characters that were already correct | |
| (per input↔gold character alignment), the fraction the model changed | |
| (per input↔output alignment). **Fix rate**: of input characters that were | |
| wrong, the fraction the model changed. Text NFC-normalized, whitespace | |
| collapsed, before all metrics. CER/WER via jiwer. | |
| ## Limitations | |
| n=75, single prompt, one run (no seeds/significance testing), 256-token block | |
| caps passage length, tokens/sec for DiffusionGemma is computed over denoising | |
| the whole block, DiffusionGemma is experimental and one day old at benchmark | |
| time. Live demo examples are from ICDAR2019 post-OCR (CC-BY-4.0) because | |
| BLN600's CC-BY-NC license doesn't permit redistribution here; benchmark passage | |
| texts are likewise not republished — only per-passage metrics. | |