Spaces:

davanstrien
/

diffusiongemma-ocr-correction

Running on Zero

App Files Files Community

diffusiongemma-ocr-correction / README.md

davanstrien HF Staff

Upload README.md with huggingface_hub

90c2600 verified about 23 hours ago

preview code

raw

history blame contribute delete

2.47 kB

	---
	title: DiffusionGemma vs Gemma-4 — Post-OCR Correction
	emoji: 📰
	colorFrom: yellow
	colorTo: red
	sdk: gradio
	sdk_version: "6.17.3"
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Diffusion vs autoregressive LLM on historical OCR cleanup
	models:
	- google/diffusiongemma-26B-A4B-it
	- google/gemma-4-E4B-it
	---

	# DiffusionGemma vs Gemma-4: post-OCR correction

	A pragmatic first-pass comparison of Google's experimental diffusion LLM
	[DiffusionGemma-26B-A4B-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it)
	(released 2026-06-10; 26B MoE, 3.8B active; generates 256-token blocks by iterative
	denoising) against an autoregressive baseline,
	[Gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it) (~4.5B effective),
	on post-OCR correction of 19th-century English newspaper text.

	Hypothesis: a diffusion LM treats correction as denoising, so it may be
	(a) faster and (b) less prone to over-correction — rewriting text that was
	already correct — than an autoregressive model, possibly at some accuracy cost.

	## Method (v1, pragmatic)

	- 75 passages from [BLN600](https://doi.org/10.15131/shef.data.25439023)
	(19th-c British Library newspapers, aligned OCR + human gold transcription),
	align-trimmed to ≤220 Gemma tokens so outputs fit DiffusionGemma's single
	256-token block. Identical prompt for both models; thinking mode off; bf16;
	batch size 1; A100-80GB.
	- Gemma-4 decodes greedily. DiffusionGemma uses its generation-config default
	entropy sampler (no greedy equivalent exists for the diffusion sampler —
	this is an unavoidable asymmetry, not a tuning choice).
	- Over-correction rate: of input characters that were already correct
	(per input↔gold character alignment), the fraction the model changed
	(per input↔output alignment). Fix rate: of input characters that were
	wrong, the fraction the model changed. Text NFC-normalized, whitespace
	collapsed, before all metrics. CER/WER via jiwer.

	## Limitations

	n=75, single prompt, one run (no seeds/significance testing), 256-token block
	caps passage length, tokens/sec for DiffusionGemma is computed over denoising
	the whole block, DiffusionGemma is experimental and one day old at benchmark
	time. Live demo examples are from ICDAR2019 post-OCR (CC-BY-4.0) because
	BLN600's CC-BY-NC license doesn't permit redistribution here; benchmark passage
	texts are likewise not republished — only per-passage metrics.