document-ocr / data /samples /README.md
fm1320's picture
Initial: document-ocr demo for HF Spaces
ffe59ba
# Bundled sample documents
Six synthetic, public-domain images covering the document shapes most
real-world OCR pipelines hit:
- `receipt.png`: printed grocery receipt, line items + totals
- `invoice.png`: vendor invoice, multi-column form layout
- `business-card.png`: tight contact card, mixed text sizes
- `table.png`: dense numerical table with totals row
- `handwritten.png`: jittered text that simulates informal handwriting
- `multi-column.png`: two-column newspaper-style layout where reading order
matters
`index.json` carries metadata for each: the GLiNER labels we ask for, plus a
short description shown in the UI.
Regenerate with `python scripts/generate_samples.py`. Pillow is the only
dep; no real customer data is involved.