File size: 9,791 Bytes
db06ffa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# Hugging Face Space smoke-test checklist

This is the deferred deployment-readiness work that can only be exercised on
real GPU hardware against real models / external CLIs. Run each smoke once
against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry
gives the exact env vars / config flips, the command to trigger, and the
structured log lines you should expect.

All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and
`ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in
the environment, so on a normal Space you do not need to set them yourself.
The HF Spaces logs page will surface the JSON records on stderr.

---

## Pre-flight

1. Duplicate the Space, give it `l4x1` hardware.
2. Make sure these are set in **Space settings β†’ Variables and secrets**:
   - `ZSGDP_LOG_LEVEL=INFO`
   - `ZSGDP_LOG_JSON=1`
   - (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`.
3. In the Space's `requirements.txt`, uncomment the dependency block matching
   the smoke you are running. Do **one smoke per Space deploy** β€” combining
   them risks an OOM or slow cold-start on the L4.
4. Push and wait for the Space to build. First-build cold-start with a model
   download is ~5-10 minutes; subsequent restarts are seconds.

After deploy, watch the **Logs** tab for the `parse_start` event. If you do
not see structured JSON lines there, the logging config is not active β€”
double-check `ZSGDP_LOG_JSON=1` in the Space variables.

## Automated runner

Each smoke below has an automated counterpart in
`scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any
shell with the project installed):

```bash
# Run all smokes whose deps are installed; skip the rest with hints:
python -m scripts.run_space_smoke --output ./space_smoke_report.json

# Run only specific smokes:
python -m scripts.run_space_smoke --smoke lexical --smoke ablation

# CI-strict mode: treat skipped smokes as failures (use after you've
# uncommented the deps for the smoke you intend to run):
python -m scripts.run_space_smoke --smoke embedding --strict
```

The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus
elapsed seconds and a `detail` block with the metrics it gathered. The
manual procedure below is the fallback when you want to inspect the UI
directly or test something the runner doesn't cover (e.g. uploading a
specific real PDF rather than a synthetic fixture).

---

## Smoke 1 β€” Lexical retriever benchmark (model-free)

Confirms the Space's parsing + benchmark plumbing works end-to-end before
adding any model dependency.

**Setup:**
- Default `requirements.txt` (no uncommenting needed).
- Default config (no flips).

**Trigger:** upload a small markdown file via the Gradio UI.

**Expected log lines (in order):**
- `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`).
- One `parser_candidate` per parser that ran (typically `text`, possibly
  `pymupdf` and `docling` if the file was a PDF).
- Possibly one or more `repair_iteration` records if quality < threshold.
- `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`.

**Pass criteria:**
- All log lines appear with `doc_id` populated.
- `parse_end.quality_score >= 0.85` for a clean markdown doc.
- No `parser_failed` or `gpu_task_blocked` records.

---

## Smoke 2 β€” Embedding retriever (jina-embeddings-v3)

Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically
runs on the L4 with `trust_remote_code=True`.

**Setup:**
- In `requirements.txt`, uncomment `transformers` and `sentence-transformers`
  lines.
- Add `configs/space_embedding.yaml` to the repo with:

  ```yaml
  benchmarks:
    retriever:
      backend: embedding
      model_id: jinaai/jina-embeddings-v3
      task: retrieval.passage
  ```

- In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`,
  or pass via the env var configured in Space variables.

**Trigger:** upload any markdown / PDF; the benchmark CLI is not reachable
from the Gradio UI today, so for the embedding-retriever smoke you'd need
to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space
**JupyterLab** session against a small input dir.

**Expected log lines:**
- First call: a 30–90s pause while jina-v3 weights download (no log lines
  during this β€” torch logs go to its own logger). Then `parse_start`.
- After the first parse, subsequent calls are fast (model is in memory).

**Pass criteria:**
- Benchmark completes without an exception.
- `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text
  corpus.
- No `gpu_task_blocked` records (those are repair-related, not retrieval).
- The parse_end record's `device` field reads `cuda`.

**Failure modes to watch:**
- `RuntimeError: EmbeddingRetriever requires sentence-transformers` β†’
  package not in `requirements.txt`.
- CUDA OOM β†’ switch to a smaller embedding model
  (`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the
  wiring before retrying jina-v3.

---

## Smoke 3 β€” Live GPU repair on a malformed table

Confirms the repair loop's GPU escalation path actually invokes the
configured VLM and that the result is applied to the merged document.

**Setup:**
- In `requirements.txt`, uncomment `transformers` (sentence-transformers
  not needed for this smoke).
- Add `configs/space_gpu_repair.yaml`:

  ```yaml
  parsers:
    docling:
      enabled: true
    pymupdf:
      enabled: true
  repair:
    enabled: true
    gpu_escalation: true
    execute_gpu_escalations: true   # the bit that flips the live path on
  gpu:
    backend: transformers
    models:
      table:
        model_id: Qwen/Qwen2.5-VL-3B-Instruct
        task: table-repair
        device: auto
        dtype: bfloat16
  ```

- Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space.

**Trigger:** upload a PDF that contains a table the parsers will likely
mangle. A two-column financial statement page works well; if you don't
have one handy, take a Wikipedia article PDF that has a comparison table.

**Expected log lines (in order):**
- `parse_start`.
- `parser_candidate` for docling and pymupdf (both should fire on a PDF).
- `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`,
  `gpu_dry_run=false`.
- One `gpu_task_executed` record per GPU task. `status` should be
  `executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4.
- A second `repair_iteration` with `iteration=2` only if iteration 1
  changed something and quality is still below threshold; otherwise the
  loop terminates.
- `parse_end` with `repair_iterations >= 1`.

**Pass criteria:**
- At least one `gpu_task_executed` with `status=executed`.
- The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set.
- No `gpu_task_blocked` records (would mean missing image_path or doc_id).

**Failure modes to watch:**
- All `gpu_task_executed` records show `status=execution_failed` β†’
  inspect `output.error` field; common causes are missing image_path
  (the PDF doesn't render page crops because `pdf.crop_tables=true` isn't
  set) or a CUDA OOM.
- No `repair_iteration` records β†’ the verifier didn't flag any
  blocking issues; pick a different input PDF.

---

## Smoke 4 β€” Per-parser ablation across docling + pymupdf

Confirms the ablation runner produces a comparison CSV and that each arm's
artifacts are isolated. No GPU dependency, runs on default Space hardware.

**Setup:** default config, no requirements.txt changes.

**Trigger:** Space JupyterLab terminal:

```bash
zsgdp benchmark-ablate \
  --input ./fixtures/pdfs \
  --output ./out/ablation \
  --parser docling --parser pymupdf
```

**Expected log lines:** one parse cycle per arm (parse_start through
parse_end), three arms total (docling-only, pymupdf-only, merged).

**Pass criteria:**
- `out/ablation/ablation_comparison.csv` has 3 rows.
- Each arm's `mean_quality_score` is non-zero.
- The merged arm's `mean_quality_score` is `>= max(per-parser arms)`.

---

## Smoke 5 β€” External parser CLI (Marker)

The riskiest of the four external adapters because Marker's argv schema
has changed several times. Per-Space, do not bundle with other smokes.

**Setup:**
- Uncomment `marker-pdf` in `requirements.txt`.
- Add `configs/space_marker.yaml`:

  ```yaml
  parsers:
    text:
      enabled: false
    pymupdf:
      enabled: false
    marker:
      enabled: true
      timeout_seconds: 300
      output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"]
      extra_args: []
  ```

- Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`.

**Trigger:** upload a small PDF (1–3 pages) via the Gradio UI.

**Expected log lines:**
- `parse_start`.
- `parser_candidate` for `marker` with non-zero `element_count`.
- `parse_end` with `candidate_parsers=["marker"]`.

**Pass criteria:**
- No `parser_failed` record for marker.
- Output Markdown has reasonable content (open the artifact zip and check).
- If `parser_failed` fires, look at `extra.error` β€” most common cause is
  argv schema drift; tweak `output_args` in the config and retry.

---

## What "deployment ready" means after this checklist

If smokes 1–3 pass on a fresh duplicated Space, the project is genuinely
deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4
and 5 are nice-to-have β€” the per-parser ablation works locally too, and
external parsers stay flagged "experimental" until you actively need them.

Open the `parsed_document.json` from each smoke, copy the `quality_score`,
`mean_layout_f1` (where applicable), and any Β§29-relevant metric into
`README.md` under a new "Production benchmark numbers" section. That
publishes evidence that the success criteria are met against real data.