Spaces:
Running on Zero
Running on Zero
File size: 9,791 Bytes
db06ffa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | # Hugging Face Space smoke-test checklist
This is the deferred deployment-readiness work that can only be exercised on
real GPU hardware against real models / external CLIs. Run each smoke once
against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry
gives the exact env vars / config flips, the command to trigger, and the
structured log lines you should expect.
All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and
`ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in
the environment, so on a normal Space you do not need to set them yourself.
The HF Spaces logs page will surface the JSON records on stderr.
---
## Pre-flight
1. Duplicate the Space, give it `l4x1` hardware.
2. Make sure these are set in **Space settings β Variables and secrets**:
- `ZSGDP_LOG_LEVEL=INFO`
- `ZSGDP_LOG_JSON=1`
- (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`.
3. In the Space's `requirements.txt`, uncomment the dependency block matching
the smoke you are running. Do **one smoke per Space deploy** β combining
them risks an OOM or slow cold-start on the L4.
4. Push and wait for the Space to build. First-build cold-start with a model
download is ~5-10 minutes; subsequent restarts are seconds.
After deploy, watch the **Logs** tab for the `parse_start` event. If you do
not see structured JSON lines there, the logging config is not active β
double-check `ZSGDP_LOG_JSON=1` in the Space variables.
## Automated runner
Each smoke below has an automated counterpart in
`scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any
shell with the project installed):
```bash
# Run all smokes whose deps are installed; skip the rest with hints:
python -m scripts.run_space_smoke --output ./space_smoke_report.json
# Run only specific smokes:
python -m scripts.run_space_smoke --smoke lexical --smoke ablation
# CI-strict mode: treat skipped smokes as failures (use after you've
# uncommented the deps for the smoke you intend to run):
python -m scripts.run_space_smoke --smoke embedding --strict
```
The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus
elapsed seconds and a `detail` block with the metrics it gathered. The
manual procedure below is the fallback when you want to inspect the UI
directly or test something the runner doesn't cover (e.g. uploading a
specific real PDF rather than a synthetic fixture).
---
## Smoke 1 β Lexical retriever benchmark (model-free)
Confirms the Space's parsing + benchmark plumbing works end-to-end before
adding any model dependency.
**Setup:**
- Default `requirements.txt` (no uncommenting needed).
- Default config (no flips).
**Trigger:** upload a small markdown file via the Gradio UI.
**Expected log lines (in order):**
- `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`).
- One `parser_candidate` per parser that ran (typically `text`, possibly
`pymupdf` and `docling` if the file was a PDF).
- Possibly one or more `repair_iteration` records if quality < threshold.
- `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`.
**Pass criteria:**
- All log lines appear with `doc_id` populated.
- `parse_end.quality_score >= 0.85` for a clean markdown doc.
- No `parser_failed` or `gpu_task_blocked` records.
---
## Smoke 2 β Embedding retriever (jina-embeddings-v3)
Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically
runs on the L4 with `trust_remote_code=True`.
**Setup:**
- In `requirements.txt`, uncomment `transformers` and `sentence-transformers`
lines.
- Add `configs/space_embedding.yaml` to the repo with:
```yaml
benchmarks:
retriever:
backend: embedding
model_id: jinaai/jina-embeddings-v3
task: retrieval.passage
```
- In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`,
or pass via the env var configured in Space variables.
**Trigger:** upload any markdown / PDF; the benchmark CLI is not reachable
from the Gradio UI today, so for the embedding-retriever smoke you'd need
to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space
**JupyterLab** session against a small input dir.
**Expected log lines:**
- First call: a 30β90s pause while jina-v3 weights download (no log lines
during this β torch logs go to its own logger). Then `parse_start`.
- After the first parse, subsequent calls are fast (model is in memory).
**Pass criteria:**
- Benchmark completes without an exception.
- `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text
corpus.
- No `gpu_task_blocked` records (those are repair-related, not retrieval).
- The parse_end record's `device` field reads `cuda`.
**Failure modes to watch:**
- `RuntimeError: EmbeddingRetriever requires sentence-transformers` β
package not in `requirements.txt`.
- CUDA OOM β switch to a smaller embedding model
(`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the
wiring before retrying jina-v3.
---
## Smoke 3 β Live GPU repair on a malformed table
Confirms the repair loop's GPU escalation path actually invokes the
configured VLM and that the result is applied to the merged document.
**Setup:**
- In `requirements.txt`, uncomment `transformers` (sentence-transformers
not needed for this smoke).
- Add `configs/space_gpu_repair.yaml`:
```yaml
parsers:
docling:
enabled: true
pymupdf:
enabled: true
repair:
enabled: true
gpu_escalation: true
execute_gpu_escalations: true # the bit that flips the live path on
gpu:
backend: transformers
models:
table:
model_id: Qwen/Qwen2.5-VL-3B-Instruct
task: table-repair
device: auto
dtype: bfloat16
```
- Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space.
**Trigger:** upload a PDF that contains a table the parsers will likely
mangle. A two-column financial statement page works well; if you don't
have one handy, take a Wikipedia article PDF that has a comparison table.
**Expected log lines (in order):**
- `parse_start`.
- `parser_candidate` for docling and pymupdf (both should fire on a PDF).
- `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`,
`gpu_dry_run=false`.
- One `gpu_task_executed` record per GPU task. `status` should be
`executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4.
- A second `repair_iteration` with `iteration=2` only if iteration 1
changed something and quality is still below threshold; otherwise the
loop terminates.
- `parse_end` with `repair_iterations >= 1`.
**Pass criteria:**
- At least one `gpu_task_executed` with `status=executed`.
- The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set.
- No `gpu_task_blocked` records (would mean missing image_path or doc_id).
**Failure modes to watch:**
- All `gpu_task_executed` records show `status=execution_failed` β
inspect `output.error` field; common causes are missing image_path
(the PDF doesn't render page crops because `pdf.crop_tables=true` isn't
set) or a CUDA OOM.
- No `repair_iteration` records β the verifier didn't flag any
blocking issues; pick a different input PDF.
---
## Smoke 4 β Per-parser ablation across docling + pymupdf
Confirms the ablation runner produces a comparison CSV and that each arm's
artifacts are isolated. No GPU dependency, runs on default Space hardware.
**Setup:** default config, no requirements.txt changes.
**Trigger:** Space JupyterLab terminal:
```bash
zsgdp benchmark-ablate \
--input ./fixtures/pdfs \
--output ./out/ablation \
--parser docling --parser pymupdf
```
**Expected log lines:** one parse cycle per arm (parse_start through
parse_end), three arms total (docling-only, pymupdf-only, merged).
**Pass criteria:**
- `out/ablation/ablation_comparison.csv` has 3 rows.
- Each arm's `mean_quality_score` is non-zero.
- The merged arm's `mean_quality_score` is `>= max(per-parser arms)`.
---
## Smoke 5 β External parser CLI (Marker)
The riskiest of the four external adapters because Marker's argv schema
has changed several times. Per-Space, do not bundle with other smokes.
**Setup:**
- Uncomment `marker-pdf` in `requirements.txt`.
- Add `configs/space_marker.yaml`:
```yaml
parsers:
text:
enabled: false
pymupdf:
enabled: false
marker:
enabled: true
timeout_seconds: 300
output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"]
extra_args: []
```
- Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`.
**Trigger:** upload a small PDF (1β3 pages) via the Gradio UI.
**Expected log lines:**
- `parse_start`.
- `parser_candidate` for `marker` with non-zero `element_count`.
- `parse_end` with `candidate_parsers=["marker"]`.
**Pass criteria:**
- No `parser_failed` record for marker.
- Output Markdown has reasonable content (open the artifact zip and check).
- If `parser_failed` fires, look at `extra.error` β most common cause is
argv schema drift; tweak `output_args` in the config and retry.
---
## What "deployment ready" means after this checklist
If smokes 1β3 pass on a fresh duplicated Space, the project is genuinely
deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4
and 5 are nice-to-have β the per-parser ablation works locally too, and
external parsers stay flagged "experimental" until you actively need them.
Open the `parsed_document.json` from each smoke, copy the `quality_score`,
`mean_layout_f1` (where applicable), and any Β§29-relevant metric into
`README.md` under a new "Production benchmark numbers" section. That
publishes evidence that the success criteria are met against real data.
|