document-ocr / data /samples /README.md
fm1320's picture
Initial: document-ocr demo for HF Spaces
ffe59ba

Bundled sample documents

Six synthetic, public-domain images covering the document shapes most real-world OCR pipelines hit:

  • receipt.png: printed grocery receipt, line items + totals
  • invoice.png: vendor invoice, multi-column form layout
  • business-card.png: tight contact card, mixed text sizes
  • table.png: dense numerical table with totals row
  • handwritten.png: jittered text that simulates informal handwriting
  • multi-column.png: two-column newspaper-style layout where reading order matters

index.json carries metadata for each: the GLiNER labels we ask for, plus a short description shown in the UI.

Regenerate with python scripts/generate_samples.py. Pillow is the only dep; no real customer data is involved.