File size: 14,904 Bytes
db06ffa
 
 
 
 
 
6a4c21f
db06ffa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83059c2
db06ffa
 
 
 
83059c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db06ffa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
title: zeroshotGPU
sdk: gradio
app_file: app.py
python_version: 3.11
suggested_hardware: l4x1
short_description: Agentic zero-shot document parser with GT metrics + repair.
---

# Zero-Shot GPU Document Parser

A self-hosted parsing control plane that profiles documents, routes pages to
parser experts, normalizes outputs, verifies quality with GT-comparison
metrics, repairs weak regions through a bounded verify/repair loop (with
optional GPU escalation), and emits auditable parsed-document artifacts plus
strategy-aware chunks. Implements the project described in
[`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md).

The codebase is intentionally dependency-light by default. Text and Markdown
work with the standard library; PyMuPDF, Docling, Marker, MinerU, olmOCR,
PaddleOCR, and Unstructured plug in via optional extras. Live GPU repair
(Qwen2.5-VL-3B) and the embedding retriever (jina-embeddings-v3) are gated
behind explicit config flags so a fresh clone never silently downloads
multi-gigabyte weights.

---

## Install

For the local MVP (text + PyMuPDF + Docling):

```bash
python -m pip install -e ".[pdf,yaml,docling,dev]"
```

Optional extras:

| Extra         | Adds                                            | Required for                                  |
|---------------|--------------------------------------------------|-----------------------------------------------|
| `embedding`   | `sentence-transformers`, `transformers`         | `benchmarks.retriever.backend=embedding`      |
| `gpu_repair`  | `transformers`                                   | `repair.execute_gpu_escalations=true`         |
| `spaces`      | mirrors `requirements.txt` for HF Spaces parity  | running `app.py` locally as a Space simulant  |

External parser CLIs (Marker, MinerU, olmOCR, PaddleOCR) install separately;
configure each via `parsers.<name>.command`, `output_args`, and `extra_args`
in your YAML config.

Secrets:

```bash
cp .env.example .env
# Set HF_TOKEN if you'll use gated models (jina-embeddings-v3, private repos).
```

`.env` is gitignored. The CLI and `app.py` load it on startup; pre-set
environment variables (e.g. Space-side secrets) always win.

---

## Quick start

### Parse one document or a folder

```bash
python -m zsgdp.cli parse --input ./docs/sample.md --output ./out/sample
python -m zsgdp.cli parse-folder --input ./docs --output ./parsed --workers 4
python -m zsgdp.cli parse --input ./docs/report.pdf --output ./out/report --config configs/docling.yaml
```

Each parse writes a full artifact bundle. `parsed_document.json` is the
canonical record; `chunks.jsonl` is the retrieval-ready output;
`quality_report.json` carries every metric the verifier computed.

### Run a benchmark

```bash
# Custom corpus, no GT β€” runs every metric that doesn't need labels:
python -m zsgdp.cli benchmark --input ./docs --output ./bench

# Labelled datasets β€” adds layout F1 / table structure / formula CER:
python -m zsgdp.cli benchmark --input ./omnidocbench --dataset omnidocbench --output ./bench/omni
python -m zsgdp.cli benchmark --input ./doclaynet --dataset doclaynet --output ./bench/doclay
```

### Compare parsers (ablation)

```bash
python -m zsgdp.cli benchmark-ablate \
  --input ./docs --output ./bench/ablation \
  --parser docling --parser pymupdf --parser text
```

Runs the benchmark once per parser plus a merged arm; emits
`ablation_comparison.csv`.

### Compare across datasets

```bash
python -m zsgdp.cli combine-benchmarks \
  --input ./bench/omni --label omnidocbench \
  --input ./bench/doclay --label doclaynet \
  --output ./bench/cross
```

Emits `dataset_summary.csv` and `parser_matrix.csv` (parser Γ— dataset).

### Before pushing to a Space β€” preflight

```bash
make preflight              # unit + regression + space-check + parsers (~10s)
make preflight-full         # ...plus an end-to-end benchmark smoke
```

A green preflight is the local signal that the branch is ready for the
Space. Pre-commit and pre-push hooks (see [CONTRIBUTING.md](CONTRIBUTING.md))
make this automatic on every `git push`.

### On the Space β€” smoke validation

Once deployed, exercise the deferred GPU/model paths:

```bash
make space-smoke            # runs whichever of 5 smokes have their deps
python -m scripts.run_space_smoke --strict --output ./space_smoke.json
```

See [docs/space_smoke.md](docs/space_smoke.md) for the manual fallback
procedure (real PDF uploads, full Marker parses) and per-smoke
acceptance criteria.

---

## Opt-ins

### Embedding retriever

Default retriever is lexical TF-IDF (zero deps). To use a real embedder:

```yaml
# configs/myrun.yaml
benchmarks:
  retriever:
    backend: embedding
    model_id: jinaai/jina-embeddings-v3   # or any sentence-transformers model
    task: retrieval.passage
```

```bash
python -m pip install -e ".[embedding]"
python -m zsgdp.cli benchmark --input ./docs --output ./bench --config configs/myrun.yaml
```

The first call lazy-loads the model; subsequent calls reuse it in-process.
Set `HF_TOKEN` in `.env` for gated models.

### Live GPU repair

The repair controller plans GPU tasks for verification failures (invalid
tables, OCR coverage gaps, reading-order issues, missing figure captions).
By default these are dry-run only. To execute:

```yaml
# configs/myrun.yaml
repair:
  gpu_escalation: true
  execute_gpu_escalations: true     # invokes the configured backend
gpu:
  backend: transformers              # or "vllm" for OpenAI-compat
  models:
    table:
      model_id: Qwen/Qwen2.5-VL-3B-Instruct
```

Each executed task writes its output back into the merged document with a
`gpu_repair_task_id` provenance field.

---

## Outputs

Every parse writes:

- `parsed_document.json` β€” canonical record (carries `schema_version`).
- `document.md` β€” human-readable Markdown reconstruction.
- `elements.jsonl` / `tables.jsonl` / `figures.jsonl` / `chunks.jsonl` β€” JSONL streams.
- `chunking_plan.json` β€” strategy ladder + per-strategy metadata.
- `parser_metrics.json` β€” per-parser candidate-level stats.
- `quality_report.json` β€” every verifier metric (text coverage, reading order, table validity, parser disagreement, repair resolution/regression rates, GT-comparison metrics when applicable).
- `routing_report.json` β€” page β†’ parser routing decisions.
- `profile.json` β€” document profiler output.
- `gpu_runtime.json` β€” detected GPU/device state at parse time.
- `gpu_tasks.jsonl` (when model-backed work is planned) and `gpu_task_report.json` (preflight validation).
- `conflict_report.json` (when multiple parsers ran).
- `artifact_manifest.json` with SHA-256 checksums and the parsed-document schema version.
- `assets/pages/*.png`, `assets/tables/*.png`, `assets/figures/*.png` β€” rendered PDF page and region crops.

Benchmark runs additionally write:

- `results.json` β€” full structured summary including aggregate means.
- `leaderboard.csv` and `per_parser_gt_leaderboard.csv` β€” parser leaderboards (without and with GT comparison).
- `per_parser_metrics.csv` β€” per-document, per-parser GT-comparison breakdown.
- `layout_runs.csv`, `table_structure_runs.csv`, `formula_runs.csv`, `retrieval_runs.csv`, `repair_runs.csv` β€” per-document detail per metric family.
- `parser_runs.csv`, `chunk_runs.csv`, `structure_runs.csv`, `chunk_quality.csv`, `throughput_runs.csv`, `ablations.json` β€” additional detail.

`benchmark-ablate` adds `ablation_comparison.csv`. `combine-benchmarks`
adds `dataset_summary.csv`, `parser_matrix.csv`, and
`cross_dataset_comparison.json`.

---

## Architecture map

| Module                  | Responsibility                                                          |
|-------------------------|-------------------------------------------------------------------------|
| `zsgdp/profiling/`      | Cheap per-page features (scanned-score, table density, columns, etc.)   |
| `zsgdp/routing/`        | Deterministic page β†’ parser-expert decisions with budget                |
| `zsgdp/parsers/`        | Adapters; one canonical schema regardless of source                     |
| `zsgdp/normalize/`      | Convert each parser's output into the schema                            |
| `zsgdp/merge/`          | Align candidates, dedupe, detect conflicts                              |
| `zsgdp/verify/`         | Coverage / reading order / table / figure / formula / chunk readiness, plus GT-comparison: layout F1, table structure, formula CER, retrieval recall, parser disagreement, repair success |
| `zsgdp/repair/`         | Deterministic header/table fixes plus GPU escalation through `gpu/worker.py` |
| `zsgdp/chunking/`       | Agentic planner + structure / parent-child / table / figure / page chunkers, with semantic / late / vision / proposition deterministic stubs |
| `zsgdp/gpu/`            | Task planning, batching, dry-run worker, transformers + vLLM clients    |
| `zsgdp/benchmarks/`     | Dataset loaders, metric runners, ablation, cross-dataset, retrieval     |
| `zsgdp/cli.py`          | All entry points                                                        |
| `app.py`                | Gradio Space UI                                                         |

The full spec is in [`zero_shot_gpu_document_parser_project_spec.md`](zero_shot_gpu_document_parser_project_spec.md). Β§10 (schema) and Β§17 (chunking ladder) are the most useful sections to skim before touching those modules.

---

## Production benchmark numbers

Spec Β§29 success criteria for reference:

- **MVP:** full agentic loop improves table QA by β‰₯20% over best single parser; agentic chunking improves citation accuracy by β‰₯10% over recursive baseline.
- **Production-style (HR / financial reports / etc.):** retrieval recall@5 β‰₯ 90%, citation accuracy β‰₯ 90%, table QA exactness β‰₯ 85%, manual review rate ≀ 10%, parser blocking failure rate ≀ 5%.

### Smoke validation β€” `arjun10g/zeroshotGPU` ZeroGPU Space, `2026-05-07`

Ran via `/gradio_api/call/run_smokes_in_space` against commit `de03f34`.

| Smoke         | Status | Detail                                                                            |
|---------------|--------|-----------------------------------------------------------------------------------|
| `lexical`     | pass   | `mean_quality_score=0.972`, `mean_retrieval_recall_at_1=1.0`, n=1 doc              |
| `ablation`    | pass   | 3 arms (text / pymupdf / merged), comparison CSV emitted                           |
| `embedding`   | pass   | `mean_retrieval_recall_at_5=1.0`, `mean_retrieval_recall_at_1=1.0`, 17.06s on A10g |
| `gpu_repair`  | pass   | wiring verified: `dry_run_task_count=1`, `repair_iterations=1`                     |
| `marker`      | skip   | `marker_single` not installed (expected; opt-in via `requirements.txt`)            |

### Benchmark β€” regression fixtures, `2026-05-07`

Ran via `/gradio_api/call/run_benchmark_in_space` against commit `abadf36`.

| Metric                          | Value  | Notes                                                  |
|---------------------------------|--------|--------------------------------------------------------|
| `document_count`                | 1      | `markdown_basic.input.md` only                         |
| `mean_quality_score`            | 0.964  | passes the Β§29 quality bar trivially on this fixture   |
| `mean_layout_f1`                | 0.000  | no GT for `custom_folder` dataset                      |
| `mean_table_structure_score`    | 0.000  | no GT                                                  |
| `mean_formula_cer`              | 0.000  | no GT                                                  |
| `mean_retrieval_recall_at_1`    | 0.000  | first-rank miss; truth at rank 3 (single-doc corpus)   |
| `mean_retrieval_recall_at_5`    | 1.000  | found within top-5                                     |
| `mean_retrieval_mrr`            | 0.333  | implies truth rank β‰ˆ 3                                 |
| `mean_parser_disagreement_rate` | 0.000  | only one parser (`text`) ran on `.md`                  |
| `mean_repair_resolution_rate`   | 1.000  | every blocking issue pre-repair was resolved           |
| `mean_repair_regression_rate`   | 0.000  | repair introduced no new blocking issues               |

**Reading the numbers:** this fixture-only benchmark validates the metric pipeline end-to-end on production hardware. The `mean_layout_f1`/`table_structure_score`/`formula_cer` zeros are because `custom_folder` has no GT β€” those metrics activate when running against `omnidocbench` or `doclaynet`. To benchmark a real corpus, drop documents into a Space-side directory and call `run_benchmark_in_space` (or run `zsgdp benchmark --input ./corpus --output ./bench` from a Pro-tier Dev Mode terminal).

### Live GPU repair invocation β€” verified, deferred end-to-end

`Live GPU repair` pipeline mode loads `configs/live_gpu_repair.yaml`, runs the iterative verify/repair loop, detects `cuda` runtime, and dispatches GPU tasks when needed. On committed test fixtures, the deterministic markdown-table normalizer resolves invalid-table issues without needing the GPU β€” `repair_resolution_rate=1.0` after one iteration, zero GPU tasks executed.

Genuinely firing the live `Qwen2.5-VL-3B` invocation requires either a PDF with bbox-aware table extraction that pymupdf can't deterministically normalize, or a config that explicitly skips `repair.table_repair` so the GPU path is the only resolution route. Both are tractable follow-ups when a real malformed PDF is available; the wiring on production is confirmed up to that point.

---

## Deployment

Targeted: Hugging Face Spaces, hardware `l4x1`, GPU/model target
`zeroshotGPU`.

Pre-deploy gate:

1. `make preflight` (local).
2. `make preflight-full` (local with end-to-end benchmark smoke).
3. Duplicate the Space, set `HF_TOKEN` and any other secrets in **Variables and secrets**.
4. Push.
5. `make space-smoke` from the Space's JupyterLab terminal.
6. Inspect [docs/space_smoke.md](docs/space_smoke.md) Smoke 3 (live GPU repair) manually if the runner-level wiring smoke passed but you want full model-invocation validation.
7. Run `python -m zsgdp.cli benchmark` against your representative corpus and update the table above.

The Space defaults to `configs/docling.yaml` (Docling + PyMuPDF
co-enabled so the parser disagreement rate has signal). Override via
`ZSGDP_CONFIG_PATH` in Space variables for custom configs.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, hooks, test layout,
fixture format, parser/metric/schema-bump procedures, and the PR checklist.

For changes touching the on-disk schema, bump `zsgdp.schema.SCHEMA_VERSION`
and add an entry under `### Schema` in [CHANGELOG.md](CHANGELOG.md). The
artifact manifest surfaces the version under
`parsed_document_schema_version` so downstream consumers can gate.