Spaces:
Running
A newer version of the Gradio SDK is available: 6.17.3
title: Dataset Maker
emoji: 🧩
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
python_version: '3.10'
app_file: app.py
pinned: false
license: mit
🧩 Dataset-Maker
Turn any PDF into a non-overlapping torn-fragment dataset for image stitching / fragment-reassembly research. Each page is rendered to A4, torn into irregular organic pieces on a black background, and exported as a ZIP with stitching ground truth.
Built to run on the HuggingFace Spaces free tier (Gradio SDK).
The guarantee: zero overlap
The pieces form a strict partition of the page - every pixel belongs to
exactly one fragment, so pieces never overlap and together cover the page
exactly. That makes the (x, y) offset of each piece an exact stitching label.
How: instead of drawing blobs (which overlap), we assign each pixel to its
nearest seed point (argmin). That is a Voronoi tessellation - already a
perfect partition. To make edges look torn instead of straight, we
domain-warp the pixel grid with value noise before the nearest-seed test.
Warping the query points (not the rule) keeps the result a partition while
boundaries become jagged and organic.
Inspired by Voronoi-jigsaw / eroded-boundary fragment literature: Eroded Boundaries, Pairwise Irregular Fragments, Deepzzle.
Pipeline
PDF ──▶ render @DPI ──▶ fit/slice to A4 ──▶ priority queue (cheap pages first)
│
per-page seed ─────────▼
domain-warped Voronoi partition (no overlap)
│
crop to bbox, black bg │
▼
PNG-optimize ──▶ ZIP + manifest.json (x,y labels)
Performance notes
- Gradio queue -
demo.queue(max_size, default_concurrency_limit)caps load for the 2-vCPU free tier; the heavy event has its ownconcurrency_limit. - Priority queue (
src/queue_manager.py) - binary min-heap,push/popO(log n),peekΘ(1), spaceΘ(n). Orders page jobs cheap-first so the user sees early progress and tail latency drops. Best caseΩ(1)per op; no comparison heap beatsΩ(log n)amortized under interleaved ops. - Vectorized partition - SciPy
cKDTreenearest-seed query isO(H·W·log S); mask extraction isΘ(H·W). - Export - tight bbox crop + PNG
optimize/compress_level=9; optional median-cut palette for smaller archives.
Output layout
pieces/page_0001/piece_000.png # fragment on black bg
manifest.json # per-piece {file, x, y, w, h} = stitching GT
# + per-page adjacency [[i, j], ...] neighbor pairs
README.txt # reassembly snippet
Each page also lists adjacency - undirected [i, j] piece-index pairs that
share a torn border (4-connectivity). Use as positive pairs for pairwise /
graph-based stitching models (Deepzzle / PairingNet style); any unlisted pair is
a negative. Computed in one vectorized Θ(H·W) pass from the partition map - no
measurable pipeline overhead.
Run locally
Use Python 3.10–3.12. Python 3.13/3.14 have no PyMuPDF wheel yet and fall back to a source build.
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python app.py # http://127.0.0.1:7860
pytest -q # invariant + queue tests
Pinned web stack (don't loosen)
gradio==4.44.1 needs a matching server stack. Newer auto-resolved versions
break it, so these are pinned in requirements.txt:
| pin | why |
|---|---|
fastapi==0.112.4 / starlette==0.38.6 |
starlette ≥0.29 reordered TemplateResponse args → gradio passes a dict as template name → TypeError: unhashable type: 'dict' on every page load |
huggingface_hub==0.25.2 |
hub ≥1.0 removed HfFolder that gradio 4.44 imports |
pydantic==2.10.6 |
pydantic ≥2.11 emits bool additionalProperties → gradio_client 1.3.0 get_api_info() crashes |
Layout
app.py Gradio UI, queue config, theme
src/pdf_loader.py PDF -> A4 RGB pages (PyMuPDF), tall-page slicing
src/sampling.py Bridson Poisson-disk seed sampling
src/noise.py Vectorized value noise (domain warp)
src/tearing.py Voronoi partition + warp + piece extraction ← core
src/optimizer.py PNG optimization / quantization
src/queue_manager.py Priority job queue (min-heap)
src/packager.py ZIP + manifest builder
src/pipeline.py PDF -> torn pages orchestration
tests/ partition no-overlap + queue ordering
License
MIT