Spaces:
Running on Zero
Running on Zero
File size: 14,904 Bytes
db06ffa 6a4c21f db06ffa 83059c2 db06ffa 83059c2 db06ffa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 | ---
title: zeroshotGPU
sdk: gradio
app_file: app.py
python_version: 3.11
suggested_hardware: l4x1
short_description: Agentic zero-shot document parser with GT metrics + repair.
---
# Zero-Shot GPU Document Parser
A self-hosted parsing control plane that profiles documents, routes pages to
parser experts, normalizes outputs, verifies quality with GT-comparison
metrics, repairs weak regions through a bounded verify/repair loop (with
optional GPU escalation), and emits auditable parsed-document artifacts plus
strategy-aware chunks. Implements the project described in
[`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md).
The codebase is intentionally dependency-light by default. Text and Markdown
work with the standard library; PyMuPDF, Docling, Marker, MinerU, olmOCR,
PaddleOCR, and Unstructured plug in via optional extras. Live GPU repair
(Qwen2.5-VL-3B) and the embedding retriever (jina-embeddings-v3) are gated
behind explicit config flags so a fresh clone never silently downloads
multi-gigabyte weights.
---
## Install
For the local MVP (text + PyMuPDF + Docling):
```bash
python -m pip install -e ".[pdf,yaml,docling,dev]"
```
Optional extras:
| Extra | Adds | Required for |
|---------------|--------------------------------------------------|-----------------------------------------------|
| `embedding` | `sentence-transformers`, `transformers` | `benchmarks.retriever.backend=embedding` |
| `gpu_repair` | `transformers` | `repair.execute_gpu_escalations=true` |
| `spaces` | mirrors `requirements.txt` for HF Spaces parity | running `app.py` locally as a Space simulant |
External parser CLIs (Marker, MinerU, olmOCR, PaddleOCR) install separately;
configure each via `parsers.<name>.command`, `output_args`, and `extra_args`
in your YAML config.
Secrets:
```bash
cp .env.example .env
# Set HF_TOKEN if you'll use gated models (jina-embeddings-v3, private repos).
```
`.env` is gitignored. The CLI and `app.py` load it on startup; pre-set
environment variables (e.g. Space-side secrets) always win.
---
## Quick start
### Parse one document or a folder
```bash
python -m zsgdp.cli parse --input ./docs/sample.md --output ./out/sample
python -m zsgdp.cli parse-folder --input ./docs --output ./parsed --workers 4
python -m zsgdp.cli parse --input ./docs/report.pdf --output ./out/report --config configs/docling.yaml
```
Each parse writes a full artifact bundle. `parsed_document.json` is the
canonical record; `chunks.jsonl` is the retrieval-ready output;
`quality_report.json` carries every metric the verifier computed.
### Run a benchmark
```bash
# Custom corpus, no GT β runs every metric that doesn't need labels:
python -m zsgdp.cli benchmark --input ./docs --output ./bench
# Labelled datasets β adds layout F1 / table structure / formula CER:
python -m zsgdp.cli benchmark --input ./omnidocbench --dataset omnidocbench --output ./bench/omni
python -m zsgdp.cli benchmark --input ./doclaynet --dataset doclaynet --output ./bench/doclay
```
### Compare parsers (ablation)
```bash
python -m zsgdp.cli benchmark-ablate \
--input ./docs --output ./bench/ablation \
--parser docling --parser pymupdf --parser text
```
Runs the benchmark once per parser plus a merged arm; emits
`ablation_comparison.csv`.
### Compare across datasets
```bash
python -m zsgdp.cli combine-benchmarks \
--input ./bench/omni --label omnidocbench \
--input ./bench/doclay --label doclaynet \
--output ./bench/cross
```
Emits `dataset_summary.csv` and `parser_matrix.csv` (parser Γ dataset).
### Before pushing to a Space β preflight
```bash
make preflight # unit + regression + space-check + parsers (~10s)
make preflight-full # ...plus an end-to-end benchmark smoke
```
A green preflight is the local signal that the branch is ready for the
Space. Pre-commit and pre-push hooks (see [CONTRIBUTING.md](CONTRIBUTING.md))
make this automatic on every `git push`.
### On the Space β smoke validation
Once deployed, exercise the deferred GPU/model paths:
```bash
make space-smoke # runs whichever of 5 smokes have their deps
python -m scripts.run_space_smoke --strict --output ./space_smoke.json
```
See [docs/space_smoke.md](docs/space_smoke.md) for the manual fallback
procedure (real PDF uploads, full Marker parses) and per-smoke
acceptance criteria.
---
## Opt-ins
### Embedding retriever
Default retriever is lexical TF-IDF (zero deps). To use a real embedder:
```yaml
# configs/myrun.yaml
benchmarks:
retriever:
backend: embedding
model_id: jinaai/jina-embeddings-v3 # or any sentence-transformers model
task: retrieval.passage
```
```bash
python -m pip install -e ".[embedding]"
python -m zsgdp.cli benchmark --input ./docs --output ./bench --config configs/myrun.yaml
```
The first call lazy-loads the model; subsequent calls reuse it in-process.
Set `HF_TOKEN` in `.env` for gated models.
### Live GPU repair
The repair controller plans GPU tasks for verification failures (invalid
tables, OCR coverage gaps, reading-order issues, missing figure captions).
By default these are dry-run only. To execute:
```yaml
# configs/myrun.yaml
repair:
gpu_escalation: true
execute_gpu_escalations: true # invokes the configured backend
gpu:
backend: transformers # or "vllm" for OpenAI-compat
models:
table:
model_id: Qwen/Qwen2.5-VL-3B-Instruct
```
Each executed task writes its output back into the merged document with a
`gpu_repair_task_id` provenance field.
---
## Outputs
Every parse writes:
- `parsed_document.json` β canonical record (carries `schema_version`).
- `document.md` β human-readable Markdown reconstruction.
- `elements.jsonl` / `tables.jsonl` / `figures.jsonl` / `chunks.jsonl` β JSONL streams.
- `chunking_plan.json` β strategy ladder + per-strategy metadata.
- `parser_metrics.json` β per-parser candidate-level stats.
- `quality_report.json` β every verifier metric (text coverage, reading order, table validity, parser disagreement, repair resolution/regression rates, GT-comparison metrics when applicable).
- `routing_report.json` β page β parser routing decisions.
- `profile.json` β document profiler output.
- `gpu_runtime.json` β detected GPU/device state at parse time.
- `gpu_tasks.jsonl` (when model-backed work is planned) and `gpu_task_report.json` (preflight validation).
- `conflict_report.json` (when multiple parsers ran).
- `artifact_manifest.json` with SHA-256 checksums and the parsed-document schema version.
- `assets/pages/*.png`, `assets/tables/*.png`, `assets/figures/*.png` β rendered PDF page and region crops.
Benchmark runs additionally write:
- `results.json` β full structured summary including aggregate means.
- `leaderboard.csv` and `per_parser_gt_leaderboard.csv` β parser leaderboards (without and with GT comparison).
- `per_parser_metrics.csv` β per-document, per-parser GT-comparison breakdown.
- `layout_runs.csv`, `table_structure_runs.csv`, `formula_runs.csv`, `retrieval_runs.csv`, `repair_runs.csv` β per-document detail per metric family.
- `parser_runs.csv`, `chunk_runs.csv`, `structure_runs.csv`, `chunk_quality.csv`, `throughput_runs.csv`, `ablations.json` β additional detail.
`benchmark-ablate` adds `ablation_comparison.csv`. `combine-benchmarks`
adds `dataset_summary.csv`, `parser_matrix.csv`, and
`cross_dataset_comparison.json`.
---
## Architecture map
| Module | Responsibility |
|-------------------------|-------------------------------------------------------------------------|
| `zsgdp/profiling/` | Cheap per-page features (scanned-score, table density, columns, etc.) |
| `zsgdp/routing/` | Deterministic page β parser-expert decisions with budget |
| `zsgdp/parsers/` | Adapters; one canonical schema regardless of source |
| `zsgdp/normalize/` | Convert each parser's output into the schema |
| `zsgdp/merge/` | Align candidates, dedupe, detect conflicts |
| `zsgdp/verify/` | Coverage / reading order / table / figure / formula / chunk readiness, plus GT-comparison: layout F1, table structure, formula CER, retrieval recall, parser disagreement, repair success |
| `zsgdp/repair/` | Deterministic header/table fixes plus GPU escalation through `gpu/worker.py` |
| `zsgdp/chunking/` | Agentic planner + structure / parent-child / table / figure / page chunkers, with semantic / late / vision / proposition deterministic stubs |
| `zsgdp/gpu/` | Task planning, batching, dry-run worker, transformers + vLLM clients |
| `zsgdp/benchmarks/` | Dataset loaders, metric runners, ablation, cross-dataset, retrieval |
| `zsgdp/cli.py` | All entry points |
| `app.py` | Gradio Space UI |
The full spec is in [`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md). Β§10 (schema) and Β§17 (chunking ladder) are the most useful sections to skim before touching those modules.
---
## Production benchmark numbers
Spec Β§29 success criteria for reference:
- **MVP:** full agentic loop improves table QA by β₯20% over best single parser; agentic chunking improves citation accuracy by β₯10% over recursive baseline.
- **Production-style (HR / financial reports / etc.):** retrieval recall@5 β₯ 90%, citation accuracy β₯ 90%, table QA exactness β₯ 85%, manual review rate β€ 10%, parser blocking failure rate β€ 5%.
### Smoke validation β `arjun10g/zeroshotGPU` ZeroGPU Space, `2026-05-07`
Ran via `/gradio_api/call/run_smokes_in_space` against commit `de03f34`.
| Smoke | Status | Detail |
|---------------|--------|-----------------------------------------------------------------------------------|
| `lexical` | pass | `mean_quality_score=0.972`, `mean_retrieval_recall_at_1=1.0`, n=1 doc |
| `ablation` | pass | 3 arms (text / pymupdf / merged), comparison CSV emitted |
| `embedding` | pass | `mean_retrieval_recall_at_5=1.0`, `mean_retrieval_recall_at_1=1.0`, 17.06s on A10g |
| `gpu_repair` | pass | wiring verified: `dry_run_task_count=1`, `repair_iterations=1` |
| `marker` | skip | `marker_single` not installed (expected; opt-in via `requirements.txt`) |
### Benchmark β regression fixtures, `2026-05-07`
Ran via `/gradio_api/call/run_benchmark_in_space` against commit `abadf36`.
| Metric | Value | Notes |
|---------------------------------|--------|--------------------------------------------------------|
| `document_count` | 1 | `markdown_basic.input.md` only |
| `mean_quality_score` | 0.964 | passes the Β§29 quality bar trivially on this fixture |
| `mean_layout_f1` | 0.000 | no GT for `custom_folder` dataset |
| `mean_table_structure_score` | 0.000 | no GT |
| `mean_formula_cer` | 0.000 | no GT |
| `mean_retrieval_recall_at_1` | 0.000 | first-rank miss; truth at rank 3 (single-doc corpus) |
| `mean_retrieval_recall_at_5` | 1.000 | found within top-5 |
| `mean_retrieval_mrr` | 0.333 | implies truth rank β 3 |
| `mean_parser_disagreement_rate` | 0.000 | only one parser (`text`) ran on `.md` |
| `mean_repair_resolution_rate` | 1.000 | every blocking issue pre-repair was resolved |
| `mean_repair_regression_rate` | 0.000 | repair introduced no new blocking issues |
**Reading the numbers:** this fixture-only benchmark validates the metric pipeline end-to-end on production hardware. The `mean_layout_f1`/`table_structure_score`/`formula_cer` zeros are because `custom_folder` has no GT β those metrics activate when running against `omnidocbench` or `doclaynet`. To benchmark a real corpus, drop documents into a Space-side directory and call `run_benchmark_in_space` (or run `zsgdp benchmark --input ./corpus --output ./bench` from a Pro-tier Dev Mode terminal).
### Live GPU repair invocation β verified, deferred end-to-end
`Live GPU repair` pipeline mode loads `configs/live_gpu_repair.yaml`, runs the iterative verify/repair loop, detects `cuda` runtime, and dispatches GPU tasks when needed. On committed test fixtures, the deterministic markdown-table normalizer resolves invalid-table issues without needing the GPU β `repair_resolution_rate=1.0` after one iteration, zero GPU tasks executed.
Genuinely firing the live `Qwen2.5-VL-3B` invocation requires either a PDF with bbox-aware table extraction that pymupdf can't deterministically normalize, or a config that explicitly skips `repair.table_repair` so the GPU path is the only resolution route. Both are tractable follow-ups when a real malformed PDF is available; the wiring on production is confirmed up to that point.
---
## Deployment
Targeted: Hugging Face Spaces, hardware `l4x1`, GPU/model target
`zeroshotGPU`.
Pre-deploy gate:
1. `make preflight` (local).
2. `make preflight-full` (local with end-to-end benchmark smoke).
3. Duplicate the Space, set `HF_TOKEN` and any other secrets in **Variables and secrets**.
4. Push.
5. `make space-smoke` from the Space's JupyterLab terminal.
6. Inspect [docs/space_smoke.md](docs/space_smoke.md) Smoke 3 (live GPU repair) manually if the runner-level wiring smoke passed but you want full model-invocation validation.
7. Run `python -m zsgdp.cli benchmark` against your representative corpus and update the table above.
The Space defaults to `configs/docling.yaml` (Docling + PyMuPDF
co-enabled so the parser disagreement rate has signal). Override via
`ZSGDP_CONFIG_PATH` in Space variables for custom configs.
---
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, hooks, test layout,
fixture format, parser/metric/schema-bump procedures, and the PR checklist.
For changes touching the on-disk schema, bump `zsgdp.schema.SCHEMA_VERSION`
and add an entry under `### Schema` in [CHANGELOG.md](CHANGELOG.md). The
artifact manifest surfaces the version under
`parsed_document_schema_version` so downstream consumers can gate.
|