Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

App Files Files Community

zeroshotGPU / README.md

Arjunvir Singh

README: fill in production benchmark numbers from live Space validation

83059c2 12 days ago

preview code

raw

history blame contribute delete

14.9 kB

	---
	title: zeroshotGPU
	sdk: gradio
	app_file: app.py
	python_version: 3.11
	suggested_hardware: l4x1
	short_description: Agentic zero-shot document parser with GT metrics + repair.
	---

	# Zero-Shot GPU Document Parser

	A self-hosted parsing control plane that profiles documents, routes pages to
	parser experts, normalizes outputs, verifies quality with GT-comparison
	metrics, repairs weak regions through a bounded verify/repair loop (with
	optional GPU escalation), and emits auditable parsed-document artifacts plus
	strategy-aware chunks. Implements the project described in
	[`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md).

	The codebase is intentionally dependency-light by default. Text and Markdown
	work with the standard library; PyMuPDF, Docling, Marker, MinerU, olmOCR,
	PaddleOCR, and Unstructured plug in via optional extras. Live GPU repair
	(Qwen2.5-VL-3B) and the embedding retriever (jina-embeddings-v3) are gated
	behind explicit config flags so a fresh clone never silently downloads
	multi-gigabyte weights.

	---

	## Install

	For the local MVP (text + PyMuPDF + Docling):

	```bash
	python -m pip install -e ".[pdf,yaml,docling,dev]"
	```

	Optional extras:

	\| Extra \| Adds \| Required for \|
	\|---------------\|--------------------------------------------------\|-----------------------------------------------\|
	\| `embedding` \| `sentence-transformers`, `transformers` \| `benchmarks.retriever.backend=embedding` \|
	\| `gpu_repair` \| `transformers` \| `repair.execute_gpu_escalations=true` \|
	\| `spaces` \| mirrors `requirements.txt` for HF Spaces parity \| running `app.py` locally as a Space simulant \|

	External parser CLIs (Marker, MinerU, olmOCR, PaddleOCR) install separately;
	configure each via `parsers.<name>.command`, `output_args`, and `extra_args`
	in your YAML config.

	Secrets:

	```bash
	cp .env.example .env
	# Set HF_TOKEN if you'll use gated models (jina-embeddings-v3, private repos).
	```

	`.env` is gitignored. The CLI and `app.py` load it on startup; pre-set
	environment variables (e.g. Space-side secrets) always win.

	---

	## Quick start

	### Parse one document or a folder

	```bash
	python -m zsgdp.cli parse --input ./docs/sample.md --output ./out/sample
	python -m zsgdp.cli parse-folder --input ./docs --output ./parsed --workers 4
	python -m zsgdp.cli parse --input ./docs/report.pdf --output ./out/report --config configs/docling.yaml
	```

	Each parse writes a full artifact bundle. `parsed_document.json` is the
	canonical record; `chunks.jsonl` is the retrieval-ready output;
	`quality_report.json` carries every metric the verifier computed.

	### Run a benchmark

	```bash
	# Custom corpus, no GT — runs every metric that doesn't need labels:
	python -m zsgdp.cli benchmark --input ./docs --output ./bench

	# Labelled datasets — adds layout F1 / table structure / formula CER:
	python -m zsgdp.cli benchmark --input ./omnidocbench --dataset omnidocbench --output ./bench/omni
	python -m zsgdp.cli benchmark --input ./doclaynet --dataset doclaynet --output ./bench/doclay
	```

	### Compare parsers (ablation)

	```bash
	python -m zsgdp.cli benchmark-ablate \
	--input ./docs --output ./bench/ablation \
	--parser docling --parser pymupdf --parser text
	```

	Runs the benchmark once per parser plus a merged arm; emits
	`ablation_comparison.csv`.

	### Compare across datasets

	```bash
	python -m zsgdp.cli combine-benchmarks \
	--input ./bench/omni --label omnidocbench \
	--input ./bench/doclay --label doclaynet \
	--output ./bench/cross
	```

	Emits `dataset_summary.csv` and `parser_matrix.csv` (parser × dataset).

	### Before pushing to a Space — preflight

	```bash
	make preflight # unit + regression + space-check + parsers (~10s)
	make preflight-full # ...plus an end-to-end benchmark smoke
	```

	A green preflight is the local signal that the branch is ready for the
	Space. Pre-commit and pre-push hooks (see [CONTRIBUTING.md](CONTRIBUTING.md))
	make this automatic on every `git push`.

	### On the Space — smoke validation

	Once deployed, exercise the deferred GPU/model paths:

	```bash
	make space-smoke # runs whichever of 5 smokes have their deps
	python -m scripts.run_space_smoke --strict --output ./space_smoke.json
	```

	See [docs/space_smoke.md](docs/space_smoke.md) for the manual fallback
	procedure (real PDF uploads, full Marker parses) and per-smoke
	acceptance criteria.

	---

	## Opt-ins

	### Embedding retriever

	Default retriever is lexical TF-IDF (zero deps). To use a real embedder:

	```yaml
	# configs/myrun.yaml
	benchmarks:
	retriever:
	backend: embedding
	model_id: jinaai/jina-embeddings-v3 # or any sentence-transformers model
	task: retrieval.passage
	```

	```bash
	python -m pip install -e ".[embedding]"
	python -m zsgdp.cli benchmark --input ./docs --output ./bench --config configs/myrun.yaml
	```

	The first call lazy-loads the model; subsequent calls reuse it in-process.
	Set `HF_TOKEN` in `.env` for gated models.

	### Live GPU repair

	The repair controller plans GPU tasks for verification failures (invalid
	tables, OCR coverage gaps, reading-order issues, missing figure captions).
	By default these are dry-run only. To execute:

	```yaml
	# configs/myrun.yaml
	repair:
	gpu_escalation: true
	execute_gpu_escalations: true # invokes the configured backend
	gpu:
	backend: transformers # or "vllm" for OpenAI-compat
	models:
	table:
	model_id: Qwen/Qwen2.5-VL-3B-Instruct
	```

	Each executed task writes its output back into the merged document with a
	`gpu_repair_task_id` provenance field.

	---

	## Outputs

	Every parse writes:

	- `parsed_document.json` — canonical record (carries `schema_version`).
	- `document.md` — human-readable Markdown reconstruction.
	- `elements.jsonl` / `tables.jsonl` / `figures.jsonl` / `chunks.jsonl` — JSONL streams.
	- `chunking_plan.json` — strategy ladder + per-strategy metadata.
	- `parser_metrics.json` — per-parser candidate-level stats.
	- `quality_report.json` — every verifier metric (text coverage, reading order, table validity, parser disagreement, repair resolution/regression rates, GT-comparison metrics when applicable).
	- `routing_report.json` — page → parser routing decisions.
	- `profile.json` — document profiler output.
	- `gpu_runtime.json` — detected GPU/device state at parse time.
	- `gpu_tasks.jsonl` (when model-backed work is planned) and `gpu_task_report.json` (preflight validation).
	- `conflict_report.json` (when multiple parsers ran).
	- `artifact_manifest.json` with SHA-256 checksums and the parsed-document schema version.
	- `assets/pages/.png`, `assets/tables/.png`, `assets/figures/*.png` — rendered PDF page and region crops.

	Benchmark runs additionally write:

	- `results.json` — full structured summary including aggregate means.
	- `leaderboard.csv` and `per_parser_gt_leaderboard.csv` — parser leaderboards (without and with GT comparison).
	- `per_parser_metrics.csv` — per-document, per-parser GT-comparison breakdown.
	- `layout_runs.csv`, `table_structure_runs.csv`, `formula_runs.csv`, `retrieval_runs.csv`, `repair_runs.csv` — per-document detail per metric family.
	- `parser_runs.csv`, `chunk_runs.csv`, `structure_runs.csv`, `chunk_quality.csv`, `throughput_runs.csv`, `ablations.json` — additional detail.

	`benchmark-ablate` adds `ablation_comparison.csv`. `combine-benchmarks`
	adds `dataset_summary.csv`, `parser_matrix.csv`, and
	`cross_dataset_comparison.json`.

	---

	## Architecture map

	\| Module \| Responsibility \|
	\|-------------------------\|-------------------------------------------------------------------------\|
	\| `zsgdp/profiling/` \| Cheap per-page features (scanned-score, table density, columns, etc.) \|
	\| `zsgdp/routing/` \| Deterministic page → parser-expert decisions with budget \|
	\| `zsgdp/parsers/` \| Adapters; one canonical schema regardless of source \|
	\| `zsgdp/normalize/` \| Convert each parser's output into the schema \|
	\| `zsgdp/merge/` \| Align candidates, dedupe, detect conflicts \|
	\| `zsgdp/verify/` \| Coverage / reading order / table / figure / formula / chunk readiness, plus GT-comparison: layout F1, table structure, formula CER, retrieval recall, parser disagreement, repair success \|
	\| `zsgdp/repair/` \| Deterministic header/table fixes plus GPU escalation through `gpu/worker.py` \|
	\| `zsgdp/chunking/` \| Agentic planner + structure / parent-child / table / figure / page chunkers, with semantic / late / vision / proposition deterministic stubs \|
	\| `zsgdp/gpu/` \| Task planning, batching, dry-run worker, transformers + vLLM clients \|
	\| `zsgdp/benchmarks/` \| Dataset loaders, metric runners, ablation, cross-dataset, retrieval \|
	\| `zsgdp/cli.py` \| All entry points \|
	\| `app.py` \| Gradio Space UI \|

	The full spec is in [`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md). §10 (schema) and §17 (chunking ladder) are the most useful sections to skim before touching those modules.

	---

	## Production benchmark numbers

	Spec §29 success criteria for reference:

	- MVP: full agentic loop improves table QA by ≥20% over best single parser; agentic chunking improves citation accuracy by ≥10% over recursive baseline.
	- Production-style (HR / financial reports / etc.): retrieval recall@5 ≥ 90%, citation accuracy ≥ 90%, table QA exactness ≥ 85%, manual review rate ≤ 10%, parser blocking failure rate ≤ 5%.

	### Smoke validation — `arjun10g/zeroshotGPU` ZeroGPU Space, `2026-05-07`

	Ran via `/gradio_api/call/run_smokes_in_space` against commit `de03f34`.

	\| Smoke \| Status \| Detail \|
	\|---------------\|--------\|-----------------------------------------------------------------------------------\|
	\| `lexical` \| pass \| `mean_quality_score=0.972`, `mean_retrieval_recall_at_1=1.0`, n=1 doc \|
	\| `ablation` \| pass \| 3 arms (text / pymupdf / merged), comparison CSV emitted \|
	\| `embedding` \| pass \| `mean_retrieval_recall_at_5=1.0`, `mean_retrieval_recall_at_1=1.0`, 17.06s on A10g \|
	\| `gpu_repair` \| pass \| wiring verified: `dry_run_task_count=1`, `repair_iterations=1` \|
	\| `marker` \| skip \| `marker_single` not installed (expected; opt-in via `requirements.txt`) \|

	### Benchmark — regression fixtures, `2026-05-07`

	Ran via `/gradio_api/call/run_benchmark_in_space` against commit `abadf36`.

	\| Metric \| Value \| Notes \|
	\|---------------------------------\|--------\|--------------------------------------------------------\|
	\| `document_count` \| 1 \| `markdown_basic.input.md` only \|
	\| `mean_quality_score` \| 0.964 \| passes the §29 quality bar trivially on this fixture \|
	\| `mean_layout_f1` \| 0.000 \| no GT for `custom_folder` dataset \|
	\| `mean_table_structure_score` \| 0.000 \| no GT \|
	\| `mean_formula_cer` \| 0.000 \| no GT \|
	\| `mean_retrieval_recall_at_1` \| 0.000 \| first-rank miss; truth at rank 3 (single-doc corpus) \|
	\| `mean_retrieval_recall_at_5` \| 1.000 \| found within top-5 \|
	\| `mean_retrieval_mrr` \| 0.333 \| implies truth rank ≈ 3 \|
	\| `mean_parser_disagreement_rate` \| 0.000 \| only one parser (`text`) ran on `.md` \|
	\| `mean_repair_resolution_rate` \| 1.000 \| every blocking issue pre-repair was resolved \|
	\| `mean_repair_regression_rate` \| 0.000 \| repair introduced no new blocking issues \|

	Reading the numbers: this fixture-only benchmark validates the metric pipeline end-to-end on production hardware. The `mean_layout_f1`/`table_structure_score`/`formula_cer` zeros are because `custom_folder` has no GT — those metrics activate when running against `omnidocbench` or `doclaynet`. To benchmark a real corpus, drop documents into a Space-side directory and call `run_benchmark_in_space` (or run `zsgdp benchmark --input ./corpus --output ./bench` from a Pro-tier Dev Mode terminal).

	### Live GPU repair invocation — verified, deferred end-to-end

	`Live GPU repair` pipeline mode loads `configs/live_gpu_repair.yaml`, runs the iterative verify/repair loop, detects `cuda` runtime, and dispatches GPU tasks when needed. On committed test fixtures, the deterministic markdown-table normalizer resolves invalid-table issues without needing the GPU — `repair_resolution_rate=1.0` after one iteration, zero GPU tasks executed.

	Genuinely firing the live `Qwen2.5-VL-3B` invocation requires either a PDF with bbox-aware table extraction that pymupdf can't deterministically normalize, or a config that explicitly skips `repair.table_repair` so the GPU path is the only resolution route. Both are tractable follow-ups when a real malformed PDF is available; the wiring on production is confirmed up to that point.

	---

	## Deployment

	Targeted: Hugging Face Spaces, hardware `l4x1`, GPU/model target
	`zeroshotGPU`.

	Pre-deploy gate:

	1. `make preflight` (local).
	2. `make preflight-full` (local with end-to-end benchmark smoke).
	3. Duplicate the Space, set `HF_TOKEN` and any other secrets in Variables and secrets.
	4. Push.
	5. `make space-smoke` from the Space's JupyterLab terminal.
	6. Inspect [docs/space_smoke.md](docs/space_smoke.md) Smoke 3 (live GPU repair) manually if the runner-level wiring smoke passed but you want full model-invocation validation.
	7. Run `python -m zsgdp.cli benchmark` against your representative corpus and update the table above.

	The Space defaults to `configs/docling.yaml` (Docling + PyMuPDF
	co-enabled so the parser disagreement rate has signal). Override via
	`ZSGDP_CONFIG_PATH` in Space variables for custom configs.

	---

	## Contributing

	See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, hooks, test layout,
	fixture format, parser/metric/schema-bump procedures, and the PR checklist.

	For changes touching the on-disk schema, bump `zsgdp.schema.SCHEMA_VERSION`
	and add an entry under `### Schema` in [CHANGELOG.md](CHANGELOG.md). The
	artifact manifest surfaces the version under
	`parsed_document_schema_version` so downstream consumers can gate.