Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.14.0
title: zeroshotGPU
sdk: gradio
app_file: app.py
python_version: 3.11
suggested_hardware: l4x1
short_description: Agentic zero-shot document parser with GT metrics + repair.
Zero-Shot GPU Document Parser
A self-hosted parsing control plane that profiles documents, routes pages to
parser experts, normalizes outputs, verifies quality with GT-comparison
metrics, repairs weak regions through a bounded verify/repair loop (with
optional GPU escalation), and emits auditable parsed-document artifacts plus
strategy-aware chunks. Implements the project described in
zero_shot_gpu_document_parser_project_spec.md.
The codebase is intentionally dependency-light by default. Text and Markdown work with the standard library; PyMuPDF, Docling, Marker, MinerU, olmOCR, PaddleOCR, and Unstructured plug in via optional extras. Live GPU repair (Qwen2.5-VL-3B) and the embedding retriever (jina-embeddings-v3) are gated behind explicit config flags so a fresh clone never silently downloads multi-gigabyte weights.
Install
For the local MVP (text + PyMuPDF + Docling):
python -m pip install -e ".[pdf,yaml,docling,dev]"
Optional extras:
| Extra | Adds | Required for |
|---|---|---|
embedding |
sentence-transformers, transformers |
benchmarks.retriever.backend=embedding |
gpu_repair |
transformers |
repair.execute_gpu_escalations=true |
spaces |
mirrors requirements.txt for HF Spaces parity |
running app.py locally as a Space simulant |
External parser CLIs (Marker, MinerU, olmOCR, PaddleOCR) install separately;
configure each via parsers.<name>.command, output_args, and extra_args
in your YAML config.
Secrets:
cp .env.example .env
# Set HF_TOKEN if you'll use gated models (jina-embeddings-v3, private repos).
.env is gitignored. The CLI and app.py load it on startup; pre-set
environment variables (e.g. Space-side secrets) always win.
Quick start
Parse one document or a folder
python -m zsgdp.cli parse --input ./docs/sample.md --output ./out/sample
python -m zsgdp.cli parse-folder --input ./docs --output ./parsed --workers 4
python -m zsgdp.cli parse --input ./docs/report.pdf --output ./out/report --config configs/docling.yaml
Each parse writes a full artifact bundle. parsed_document.json is the
canonical record; chunks.jsonl is the retrieval-ready output;
quality_report.json carries every metric the verifier computed.
Run a benchmark
# Custom corpus, no GT β runs every metric that doesn't need labels:
python -m zsgdp.cli benchmark --input ./docs --output ./bench
# Labelled datasets β adds layout F1 / table structure / formula CER:
python -m zsgdp.cli benchmark --input ./omnidocbench --dataset omnidocbench --output ./bench/omni
python -m zsgdp.cli benchmark --input ./doclaynet --dataset doclaynet --output ./bench/doclay
Compare parsers (ablation)
python -m zsgdp.cli benchmark-ablate \
--input ./docs --output ./bench/ablation \
--parser docling --parser pymupdf --parser text
Runs the benchmark once per parser plus a merged arm; emits
ablation_comparison.csv.
Compare across datasets
python -m zsgdp.cli combine-benchmarks \
--input ./bench/omni --label omnidocbench \
--input ./bench/doclay --label doclaynet \
--output ./bench/cross
Emits dataset_summary.csv and parser_matrix.csv (parser Γ dataset).
Before pushing to a Space β preflight
make preflight # unit + regression + space-check + parsers (~10s)
make preflight-full # ...plus an end-to-end benchmark smoke
A green preflight is the local signal that the branch is ready for the
Space. Pre-commit and pre-push hooks (see CONTRIBUTING.md)
make this automatic on every git push.
On the Space β smoke validation
Once deployed, exercise the deferred GPU/model paths:
make space-smoke # runs whichever of 5 smokes have their deps
python -m scripts.run_space_smoke --strict --output ./space_smoke.json
See docs/space_smoke.md for the manual fallback procedure (real PDF uploads, full Marker parses) and per-smoke acceptance criteria.
Opt-ins
Embedding retriever
Default retriever is lexical TF-IDF (zero deps). To use a real embedder:
# configs/myrun.yaml
benchmarks:
retriever:
backend: embedding
model_id: jinaai/jina-embeddings-v3 # or any sentence-transformers model
task: retrieval.passage
python -m pip install -e ".[embedding]"
python -m zsgdp.cli benchmark --input ./docs --output ./bench --config configs/myrun.yaml
The first call lazy-loads the model; subsequent calls reuse it in-process.
Set HF_TOKEN in .env for gated models.
Live GPU repair
The repair controller plans GPU tasks for verification failures (invalid tables, OCR coverage gaps, reading-order issues, missing figure captions). By default these are dry-run only. To execute:
# configs/myrun.yaml
repair:
gpu_escalation: true
execute_gpu_escalations: true # invokes the configured backend
gpu:
backend: transformers # or "vllm" for OpenAI-compat
models:
table:
model_id: Qwen/Qwen2.5-VL-3B-Instruct
Each executed task writes its output back into the merged document with a
gpu_repair_task_id provenance field.
Outputs
Every parse writes:
parsed_document.jsonβ canonical record (carriesschema_version).document.mdβ human-readable Markdown reconstruction.elements.jsonl/tables.jsonl/figures.jsonl/chunks.jsonlβ JSONL streams.chunking_plan.jsonβ strategy ladder + per-strategy metadata.parser_metrics.jsonβ per-parser candidate-level stats.quality_report.jsonβ every verifier metric (text coverage, reading order, table validity, parser disagreement, repair resolution/regression rates, GT-comparison metrics when applicable).routing_report.jsonβ page β parser routing decisions.profile.jsonβ document profiler output.gpu_runtime.jsonβ detected GPU/device state at parse time.gpu_tasks.jsonl(when model-backed work is planned) andgpu_task_report.json(preflight validation).conflict_report.json(when multiple parsers ran).artifact_manifest.jsonwith SHA-256 checksums and the parsed-document schema version.assets/pages/*.png,assets/tables/*.png,assets/figures/*.pngβ rendered PDF page and region crops.
Benchmark runs additionally write:
results.jsonβ full structured summary including aggregate means.leaderboard.csvandper_parser_gt_leaderboard.csvβ parser leaderboards (without and with GT comparison).per_parser_metrics.csvβ per-document, per-parser GT-comparison breakdown.layout_runs.csv,table_structure_runs.csv,formula_runs.csv,retrieval_runs.csv,repair_runs.csvβ per-document detail per metric family.parser_runs.csv,chunk_runs.csv,structure_runs.csv,chunk_quality.csv,throughput_runs.csv,ablations.jsonβ additional detail.
benchmark-ablate adds ablation_comparison.csv. combine-benchmarks
adds dataset_summary.csv, parser_matrix.csv, and
cross_dataset_comparison.json.
Architecture map
| Module | Responsibility |
|---|---|
zsgdp/profiling/ |
Cheap per-page features (scanned-score, table density, columns, etc.) |
zsgdp/routing/ |
Deterministic page β parser-expert decisions with budget |
zsgdp/parsers/ |
Adapters; one canonical schema regardless of source |
zsgdp/normalize/ |
Convert each parser's output into the schema |
zsgdp/merge/ |
Align candidates, dedupe, detect conflicts |
zsgdp/verify/ |
Coverage / reading order / table / figure / formula / chunk readiness, plus GT-comparison: layout F1, table structure, formula CER, retrieval recall, parser disagreement, repair success |
zsgdp/repair/ |
Deterministic header/table fixes plus GPU escalation through gpu/worker.py |
zsgdp/chunking/ |
Agentic planner + structure / parent-child / table / figure / page chunkers, with semantic / late / vision / proposition deterministic stubs |
zsgdp/gpu/ |
Task planning, batching, dry-run worker, transformers + vLLM clients |
zsgdp/benchmarks/ |
Dataset loaders, metric runners, ablation, cross-dataset, retrieval |
zsgdp/cli.py |
All entry points |
app.py |
Gradio Space UI |
The full spec is in zero_shot_gpu_document_parser_project_spec.md. Β§10 (schema) and Β§17 (chunking ladder) are the most useful sections to skim before touching those modules.
Production benchmark numbers
Spec Β§29 success criteria for reference:
- MVP: full agentic loop improves table QA by β₯20% over best single parser; agentic chunking improves citation accuracy by β₯10% over recursive baseline.
- Production-style (HR / financial reports / etc.): retrieval recall@5 β₯ 90%, citation accuracy β₯ 90%, table QA exactness β₯ 85%, manual review rate β€ 10%, parser blocking failure rate β€ 5%.
Smoke validation β arjun10g/zeroshotGPU ZeroGPU Space, 2026-05-07
Ran via /gradio_api/call/run_smokes_in_space against commit de03f34.
| Smoke | Status | Detail |
|---|---|---|
lexical |
pass | mean_quality_score=0.972, mean_retrieval_recall_at_1=1.0, n=1 doc |
ablation |
pass | 3 arms (text / pymupdf / merged), comparison CSV emitted |
embedding |
pass | mean_retrieval_recall_at_5=1.0, mean_retrieval_recall_at_1=1.0, 17.06s on A10g |
gpu_repair |
pass | wiring verified: dry_run_task_count=1, repair_iterations=1 |
marker |
skip | marker_single not installed (expected; opt-in via requirements.txt) |
Benchmark β regression fixtures, 2026-05-07
Ran via /gradio_api/call/run_benchmark_in_space against commit abadf36.
| Metric | Value | Notes |
|---|---|---|
document_count |
1 | markdown_basic.input.md only |
mean_quality_score |
0.964 | passes the Β§29 quality bar trivially on this fixture |
mean_layout_f1 |
0.000 | no GT for custom_folder dataset |
mean_table_structure_score |
0.000 | no GT |
mean_formula_cer |
0.000 | no GT |
mean_retrieval_recall_at_1 |
0.000 | first-rank miss; truth at rank 3 (single-doc corpus) |
mean_retrieval_recall_at_5 |
1.000 | found within top-5 |
mean_retrieval_mrr |
0.333 | implies truth rank β 3 |
mean_parser_disagreement_rate |
0.000 | only one parser (text) ran on .md |
mean_repair_resolution_rate |
1.000 | every blocking issue pre-repair was resolved |
mean_repair_regression_rate |
0.000 | repair introduced no new blocking issues |
Reading the numbers: this fixture-only benchmark validates the metric pipeline end-to-end on production hardware. The mean_layout_f1/table_structure_score/formula_cer zeros are because custom_folder has no GT β those metrics activate when running against omnidocbench or doclaynet. To benchmark a real corpus, drop documents into a Space-side directory and call run_benchmark_in_space (or run zsgdp benchmark --input ./corpus --output ./bench from a Pro-tier Dev Mode terminal).
Live GPU repair invocation β verified, deferred end-to-end
Live GPU repair pipeline mode loads configs/live_gpu_repair.yaml, runs the iterative verify/repair loop, detects cuda runtime, and dispatches GPU tasks when needed. On committed test fixtures, the deterministic markdown-table normalizer resolves invalid-table issues without needing the GPU β repair_resolution_rate=1.0 after one iteration, zero GPU tasks executed.
Genuinely firing the live Qwen2.5-VL-3B invocation requires either a PDF with bbox-aware table extraction that pymupdf can't deterministically normalize, or a config that explicitly skips repair.table_repair so the GPU path is the only resolution route. Both are tractable follow-ups when a real malformed PDF is available; the wiring on production is confirmed up to that point.
Deployment
Targeted: Hugging Face Spaces, hardware l4x1, GPU/model target
zeroshotGPU.
Pre-deploy gate:
make preflight(local).make preflight-full(local with end-to-end benchmark smoke).- Duplicate the Space, set
HF_TOKENand any other secrets in Variables and secrets. - Push.
make space-smokefrom the Space's JupyterLab terminal.- Inspect docs/space_smoke.md Smoke 3 (live GPU repair) manually if the runner-level wiring smoke passed but you want full model-invocation validation.
- Run
python -m zsgdp.cli benchmarkagainst your representative corpus and update the table above.
The Space defaults to configs/docling.yaml (Docling + PyMuPDF
co-enabled so the parser disagreement rate has signal). Override via
ZSGDP_CONFIG_PATH in Space variables for custom configs.
Contributing
See CONTRIBUTING.md for setup, hooks, test layout, fixture format, parser/metric/schema-bump procedures, and the PR checklist.
For changes touching the on-disk schema, bump zsgdp.schema.SCHEMA_VERSION
and add an entry under ### Schema in CHANGELOG.md. The
artifact manifest surfaces the version under
parsed_document_schema_version so downstream consumers can gate.