Spaces:
Running on Zero
Running on Zero
| title: zeroshotGPU | |
| sdk: gradio | |
| app_file: app.py | |
| python_version: 3.11 | |
| suggested_hardware: l4x1 | |
| short_description: Agentic zero-shot document parser with GT metrics + repair. | |
| # Zero-Shot GPU Document Parser | |
| A self-hosted parsing control plane that profiles documents, routes pages to | |
| parser experts, normalizes outputs, verifies quality with GT-comparison | |
| metrics, repairs weak regions through a bounded verify/repair loop (with | |
| optional GPU escalation), and emits auditable parsed-document artifacts plus | |
| strategy-aware chunks. Implements the project described in | |
| [`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md). | |
| The codebase is intentionally dependency-light by default. Text and Markdown | |
| work with the standard library; PyMuPDF, Docling, Marker, MinerU, olmOCR, | |
| PaddleOCR, and Unstructured plug in via optional extras. Live GPU repair | |
| (Qwen2.5-VL-3B) and the embedding retriever (jina-embeddings-v3) are gated | |
| behind explicit config flags so a fresh clone never silently downloads | |
| multi-gigabyte weights. | |
| --- | |
| ## Install | |
| For the local MVP (text + PyMuPDF + Docling): | |
| ```bash | |
| python -m pip install -e ".[pdf,yaml,docling,dev]" | |
| ``` | |
| Optional extras: | |
| | Extra | Adds | Required for | | |
| |---------------|--------------------------------------------------|-----------------------------------------------| | |
| | `embedding` | `sentence-transformers`, `transformers` | `benchmarks.retriever.backend=embedding` | | |
| | `gpu_repair` | `transformers` | `repair.execute_gpu_escalations=true` | | |
| | `spaces` | mirrors `requirements.txt` for HF Spaces parity | running `app.py` locally as a Space simulant | | |
| External parser CLIs (Marker, MinerU, olmOCR, PaddleOCR) install separately; | |
| configure each via `parsers.<name>.command`, `output_args`, and `extra_args` | |
| in your YAML config. | |
| Secrets: | |
| ```bash | |
| cp .env.example .env | |
| # Set HF_TOKEN if you'll use gated models (jina-embeddings-v3, private repos). | |
| ``` | |
| `.env` is gitignored. The CLI and `app.py` load it on startup; pre-set | |
| environment variables (e.g. Space-side secrets) always win. | |
| --- | |
| ## Quick start | |
| ### Parse one document or a folder | |
| ```bash | |
| python -m zsgdp.cli parse --input ./docs/sample.md --output ./out/sample | |
| python -m zsgdp.cli parse-folder --input ./docs --output ./parsed --workers 4 | |
| python -m zsgdp.cli parse --input ./docs/report.pdf --output ./out/report --config configs/docling.yaml | |
| ``` | |
| Each parse writes a full artifact bundle. `parsed_document.json` is the | |
| canonical record; `chunks.jsonl` is the retrieval-ready output; | |
| `quality_report.json` carries every metric the verifier computed. | |
| ### Run a benchmark | |
| ```bash | |
| # Custom corpus, no GT β runs every metric that doesn't need labels: | |
| python -m zsgdp.cli benchmark --input ./docs --output ./bench | |
| # Labelled datasets β adds layout F1 / table structure / formula CER: | |
| python -m zsgdp.cli benchmark --input ./omnidocbench --dataset omnidocbench --output ./bench/omni | |
| python -m zsgdp.cli benchmark --input ./doclaynet --dataset doclaynet --output ./bench/doclay | |
| ``` | |
| ### Compare parsers (ablation) | |
| ```bash | |
| python -m zsgdp.cli benchmark-ablate \ | |
| --input ./docs --output ./bench/ablation \ | |
| --parser docling --parser pymupdf --parser text | |
| ``` | |
| Runs the benchmark once per parser plus a merged arm; emits | |
| `ablation_comparison.csv`. | |
| ### Compare across datasets | |
| ```bash | |
| python -m zsgdp.cli combine-benchmarks \ | |
| --input ./bench/omni --label omnidocbench \ | |
| --input ./bench/doclay --label doclaynet \ | |
| --output ./bench/cross | |
| ``` | |
| Emits `dataset_summary.csv` and `parser_matrix.csv` (parser Γ dataset). | |
| ### Before pushing to a Space β preflight | |
| ```bash | |
| make preflight # unit + regression + space-check + parsers (~10s) | |
| make preflight-full # ...plus an end-to-end benchmark smoke | |
| ``` | |
| A green preflight is the local signal that the branch is ready for the | |
| Space. Pre-commit and pre-push hooks (see [CONTRIBUTING.md](CONTRIBUTING.md)) | |
| make this automatic on every `git push`. | |
| ### On the Space β smoke validation | |
| Once deployed, exercise the deferred GPU/model paths: | |
| ```bash | |
| make space-smoke # runs whichever of 5 smokes have their deps | |
| python -m scripts.run_space_smoke --strict --output ./space_smoke.json | |
| ``` | |
| See [docs/space_smoke.md](docs/space_smoke.md) for the manual fallback | |
| procedure (real PDF uploads, full Marker parses) and per-smoke | |
| acceptance criteria. | |
| --- | |
| ## Opt-ins | |
| ### Embedding retriever | |
| Default retriever is lexical TF-IDF (zero deps). To use a real embedder: | |
| ```yaml | |
| # configs/myrun.yaml | |
| benchmarks: | |
| retriever: | |
| backend: embedding | |
| model_id: jinaai/jina-embeddings-v3 # or any sentence-transformers model | |
| task: retrieval.passage | |
| ``` | |
| ```bash | |
| python -m pip install -e ".[embedding]" | |
| python -m zsgdp.cli benchmark --input ./docs --output ./bench --config configs/myrun.yaml | |
| ``` | |
| The first call lazy-loads the model; subsequent calls reuse it in-process. | |
| Set `HF_TOKEN` in `.env` for gated models. | |
| ### Live GPU repair | |
| The repair controller plans GPU tasks for verification failures (invalid | |
| tables, OCR coverage gaps, reading-order issues, missing figure captions). | |
| By default these are dry-run only. To execute: | |
| ```yaml | |
| # configs/myrun.yaml | |
| repair: | |
| gpu_escalation: true | |
| execute_gpu_escalations: true # invokes the configured backend | |
| gpu: | |
| backend: transformers # or "vllm" for OpenAI-compat | |
| models: | |
| table: | |
| model_id: Qwen/Qwen2.5-VL-3B-Instruct | |
| ``` | |
| Each executed task writes its output back into the merged document with a | |
| `gpu_repair_task_id` provenance field. | |
| --- | |
| ## Outputs | |
| Every parse writes: | |
| - `parsed_document.json` β canonical record (carries `schema_version`). | |
| - `document.md` β human-readable Markdown reconstruction. | |
| - `elements.jsonl` / `tables.jsonl` / `figures.jsonl` / `chunks.jsonl` β JSONL streams. | |
| - `chunking_plan.json` β strategy ladder + per-strategy metadata. | |
| - `parser_metrics.json` β per-parser candidate-level stats. | |
| - `quality_report.json` β every verifier metric (text coverage, reading order, table validity, parser disagreement, repair resolution/regression rates, GT-comparison metrics when applicable). | |
| - `routing_report.json` β page β parser routing decisions. | |
| - `profile.json` β document profiler output. | |
| - `gpu_runtime.json` β detected GPU/device state at parse time. | |
| - `gpu_tasks.jsonl` (when model-backed work is planned) and `gpu_task_report.json` (preflight validation). | |
| - `conflict_report.json` (when multiple parsers ran). | |
| - `artifact_manifest.json` with SHA-256 checksums and the parsed-document schema version. | |
| - `assets/pages/*.png`, `assets/tables/*.png`, `assets/figures/*.png` β rendered PDF page and region crops. | |
| Benchmark runs additionally write: | |
| - `results.json` β full structured summary including aggregate means. | |
| - `leaderboard.csv` and `per_parser_gt_leaderboard.csv` β parser leaderboards (without and with GT comparison). | |
| - `per_parser_metrics.csv` β per-document, per-parser GT-comparison breakdown. | |
| - `layout_runs.csv`, `table_structure_runs.csv`, `formula_runs.csv`, `retrieval_runs.csv`, `repair_runs.csv` β per-document detail per metric family. | |
| - `parser_runs.csv`, `chunk_runs.csv`, `structure_runs.csv`, `chunk_quality.csv`, `throughput_runs.csv`, `ablations.json` β additional detail. | |
| `benchmark-ablate` adds `ablation_comparison.csv`. `combine-benchmarks` | |
| adds `dataset_summary.csv`, `parser_matrix.csv`, and | |
| `cross_dataset_comparison.json`. | |
| --- | |
| ## Architecture map | |
| | Module | Responsibility | | |
| |-------------------------|-------------------------------------------------------------------------| | |
| | `zsgdp/profiling/` | Cheap per-page features (scanned-score, table density, columns, etc.) | | |
| | `zsgdp/routing/` | Deterministic page β parser-expert decisions with budget | | |
| | `zsgdp/parsers/` | Adapters; one canonical schema regardless of source | | |
| | `zsgdp/normalize/` | Convert each parser's output into the schema | | |
| | `zsgdp/merge/` | Align candidates, dedupe, detect conflicts | | |
| | `zsgdp/verify/` | Coverage / reading order / table / figure / formula / chunk readiness, plus GT-comparison: layout F1, table structure, formula CER, retrieval recall, parser disagreement, repair success | | |
| | `zsgdp/repair/` | Deterministic header/table fixes plus GPU escalation through `gpu/worker.py` | | |
| | `zsgdp/chunking/` | Agentic planner + structure / parent-child / table / figure / page chunkers, with semantic / late / vision / proposition deterministic stubs | | |
| | `zsgdp/gpu/` | Task planning, batching, dry-run worker, transformers + vLLM clients | | |
| | `zsgdp/benchmarks/` | Dataset loaders, metric runners, ablation, cross-dataset, retrieval | | |
| | `zsgdp/cli.py` | All entry points | | |
| | `app.py` | Gradio Space UI | | |
| The full spec is in [`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md). Β§10 (schema) and Β§17 (chunking ladder) are the most useful sections to skim before touching those modules. | |
| --- | |
| ## Production benchmark numbers | |
| Spec Β§29 success criteria for reference: | |
| - **MVP:** full agentic loop improves table QA by β₯20% over best single parser; agentic chunking improves citation accuracy by β₯10% over recursive baseline. | |
| - **Production-style (HR / financial reports / etc.):** retrieval recall@5 β₯ 90%, citation accuracy β₯ 90%, table QA exactness β₯ 85%, manual review rate β€ 10%, parser blocking failure rate β€ 5%. | |
| ### Smoke validation β `arjun10g/zeroshotGPU` ZeroGPU Space, `2026-05-07` | |
| Ran via `/gradio_api/call/run_smokes_in_space` against commit `de03f34`. | |
| | Smoke | Status | Detail | | |
| |---------------|--------|-----------------------------------------------------------------------------------| | |
| | `lexical` | pass | `mean_quality_score=0.972`, `mean_retrieval_recall_at_1=1.0`, n=1 doc | | |
| | `ablation` | pass | 3 arms (text / pymupdf / merged), comparison CSV emitted | | |
| | `embedding` | pass | `mean_retrieval_recall_at_5=1.0`, `mean_retrieval_recall_at_1=1.0`, 17.06s on A10g | | |
| | `gpu_repair` | pass | wiring verified: `dry_run_task_count=1`, `repair_iterations=1` | | |
| | `marker` | skip | `marker_single` not installed (expected; opt-in via `requirements.txt`) | | |
| ### Benchmark β regression fixtures, `2026-05-07` | |
| Ran via `/gradio_api/call/run_benchmark_in_space` against commit `abadf36`. | |
| | Metric | Value | Notes | | |
| |---------------------------------|--------|--------------------------------------------------------| | |
| | `document_count` | 1 | `markdown_basic.input.md` only | | |
| | `mean_quality_score` | 0.964 | passes the Β§29 quality bar trivially on this fixture | | |
| | `mean_layout_f1` | 0.000 | no GT for `custom_folder` dataset | | |
| | `mean_table_structure_score` | 0.000 | no GT | | |
| | `mean_formula_cer` | 0.000 | no GT | | |
| | `mean_retrieval_recall_at_1` | 0.000 | first-rank miss; truth at rank 3 (single-doc corpus) | | |
| | `mean_retrieval_recall_at_5` | 1.000 | found within top-5 | | |
| | `mean_retrieval_mrr` | 0.333 | implies truth rank β 3 | | |
| | `mean_parser_disagreement_rate` | 0.000 | only one parser (`text`) ran on `.md` | | |
| | `mean_repair_resolution_rate` | 1.000 | every blocking issue pre-repair was resolved | | |
| | `mean_repair_regression_rate` | 0.000 | repair introduced no new blocking issues | | |
| **Reading the numbers:** this fixture-only benchmark validates the metric pipeline end-to-end on production hardware. The `mean_layout_f1`/`table_structure_score`/`formula_cer` zeros are because `custom_folder` has no GT β those metrics activate when running against `omnidocbench` or `doclaynet`. To benchmark a real corpus, drop documents into a Space-side directory and call `run_benchmark_in_space` (or run `zsgdp benchmark --input ./corpus --output ./bench` from a Pro-tier Dev Mode terminal). | |
| ### Live GPU repair invocation β verified, deferred end-to-end | |
| `Live GPU repair` pipeline mode loads `configs/live_gpu_repair.yaml`, runs the iterative verify/repair loop, detects `cuda` runtime, and dispatches GPU tasks when needed. On committed test fixtures, the deterministic markdown-table normalizer resolves invalid-table issues without needing the GPU β `repair_resolution_rate=1.0` after one iteration, zero GPU tasks executed. | |
| Genuinely firing the live `Qwen2.5-VL-3B` invocation requires either a PDF with bbox-aware table extraction that pymupdf can't deterministically normalize, or a config that explicitly skips `repair.table_repair` so the GPU path is the only resolution route. Both are tractable follow-ups when a real malformed PDF is available; the wiring on production is confirmed up to that point. | |
| --- | |
| ## Deployment | |
| Targeted: Hugging Face Spaces, hardware `l4x1`, GPU/model target | |
| `zeroshotGPU`. | |
| Pre-deploy gate: | |
| 1. `make preflight` (local). | |
| 2. `make preflight-full` (local with end-to-end benchmark smoke). | |
| 3. Duplicate the Space, set `HF_TOKEN` and any other secrets in **Variables and secrets**. | |
| 4. Push. | |
| 5. `make space-smoke` from the Space's JupyterLab terminal. | |
| 6. Inspect [docs/space_smoke.md](docs/space_smoke.md) Smoke 3 (live GPU repair) manually if the runner-level wiring smoke passed but you want full model-invocation validation. | |
| 7. Run `python -m zsgdp.cli benchmark` against your representative corpus and update the table above. | |
| The Space defaults to `configs/docling.yaml` (Docling + PyMuPDF | |
| co-enabled so the parser disagreement rate has signal). Override via | |
| `ZSGDP_CONFIG_PATH` in Space variables for custom configs. | |
| --- | |
| ## Contributing | |
| See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, hooks, test layout, | |
| fixture format, parser/metric/schema-bump procedures, and the PR checklist. | |
| For changes touching the on-disk schema, bump `zsgdp.schema.SCHEMA_VERSION` | |
| and add an entry under `### Schema` in [CHANGELOG.md](CHANGELOG.md). The | |
| artifact manifest surfaces the version under | |
| `parsed_document_schema_version` so downstream consumers can gate. | |