Dataset-Maker / README.md
arittrabag's picture
Pin Python 3.10 so pinned wheels (PyMuPDF/numpy/scipy) install without source builds
b27b6da verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
title: Dataset Maker
emoji: 🧩
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
python_version: '3.10'
app_file: app.py
pinned: false
license: mit

🧩 Dataset-Maker

Turn any PDF into a non-overlapping torn-fragment dataset for image stitching / fragment-reassembly research. Each page is rendered to A4, torn into irregular organic pieces on a black background, and exported as a ZIP with stitching ground truth.

Built to run on the HuggingFace Spaces free tier (Gradio SDK).

The guarantee: zero overlap

The pieces form a strict partition of the page - every pixel belongs to exactly one fragment, so pieces never overlap and together cover the page exactly. That makes the (x, y) offset of each piece an exact stitching label.

How: instead of drawing blobs (which overlap), we assign each pixel to its nearest seed point (argmin). That is a Voronoi tessellation - already a perfect partition. To make edges look torn instead of straight, we domain-warp the pixel grid with value noise before the nearest-seed test. Warping the query points (not the rule) keeps the result a partition while boundaries become jagged and organic.

Inspired by Voronoi-jigsaw / eroded-boundary fragment literature: Eroded Boundaries, Pairwise Irregular Fragments, Deepzzle.

Pipeline

PDF ──▶ render @DPI ──▶ fit/slice to A4 ──▶ priority queue (cheap pages first)
                                                   │
                            per-page seed ─────────▼
                          domain-warped Voronoi partition (no overlap)
                                                   │
                            crop to bbox, black bg │
                                                   ▼
                       PNG-optimize ──▶ ZIP + manifest.json (x,y labels)

Performance notes

  • Gradio queue - demo.queue(max_size, default_concurrency_limit) caps load for the 2-vCPU free tier; the heavy event has its own concurrency_limit.
  • Priority queue (src/queue_manager.py) - binary min-heap, push/pop O(log n), peek Θ(1), space Θ(n). Orders page jobs cheap-first so the user sees early progress and tail latency drops. Best case Ω(1) per op; no comparison heap beats Ω(log n) amortized under interleaved ops.
  • Vectorized partition - SciPy cKDTree nearest-seed query is O(H·W·log S); mask extraction is Θ(H·W).
  • Export - tight bbox crop + PNG optimize/compress_level=9; optional median-cut palette for smaller archives.

Output layout

pieces/page_0001/piece_000.png   # fragment on black bg
manifest.json                    # per-piece {file, x, y, w, h} = stitching GT
                                 # + per-page adjacency [[i, j], ...] neighbor pairs
README.txt                       # reassembly snippet

Each page also lists adjacency - undirected [i, j] piece-index pairs that share a torn border (4-connectivity). Use as positive pairs for pairwise / graph-based stitching models (Deepzzle / PairingNet style); any unlisted pair is a negative. Computed in one vectorized Θ(H·W) pass from the partition map - no measurable pipeline overhead.

Run locally

Use Python 3.10–3.12. Python 3.13/3.14 have no PyMuPDF wheel yet and fall back to a source build.

python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python app.py            # http://127.0.0.1:7860
pytest -q                # invariant + queue tests

Pinned web stack (don't loosen)

gradio==4.44.1 needs a matching server stack. Newer auto-resolved versions break it, so these are pinned in requirements.txt:

pin why
fastapi==0.112.4 / starlette==0.38.6 starlette ≥0.29 reordered TemplateResponse args → gradio passes a dict as template name → TypeError: unhashable type: 'dict' on every page load
huggingface_hub==0.25.2 hub ≥1.0 removed HfFolder that gradio 4.44 imports
pydantic==2.10.6 pydantic ≥2.11 emits bool additionalProperties → gradio_client 1.3.0 get_api_info() crashes

Layout

app.py                 Gradio UI, queue config, theme
src/pdf_loader.py      PDF -> A4 RGB pages (PyMuPDF), tall-page slicing
src/sampling.py        Bridson Poisson-disk seed sampling
src/noise.py           Vectorized value noise (domain warp)
src/tearing.py         Voronoi partition + warp + piece extraction  ← core
src/optimizer.py       PNG optimization / quantization
src/queue_manager.py   Priority job queue (min-heap)
src/packager.py        ZIP + manifest builder
src/pipeline.py        PDF -> torn pages orchestration
tests/                 partition no-overlap + queue ordering

License

MIT