zeroshotGPU / docs /space_smoke.md
Arjunvir Singh
Initial commit: zeroshotGPU MVP with full eval surface
db06ffa
# Hugging Face Space smoke-test checklist
This is the deferred deployment-readiness work that can only be exercised on
real GPU hardware against real models / external CLIs. Run each smoke once
against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry
gives the exact env vars / config flips, the command to trigger, and the
structured log lines you should expect.
All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and
`ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in
the environment, so on a normal Space you do not need to set them yourself.
The HF Spaces logs page will surface the JSON records on stderr.
---
## Pre-flight
1. Duplicate the Space, give it `l4x1` hardware.
2. Make sure these are set in **Space settings β†’ Variables and secrets**:
- `ZSGDP_LOG_LEVEL=INFO`
- `ZSGDP_LOG_JSON=1`
- (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`.
3. In the Space's `requirements.txt`, uncomment the dependency block matching
the smoke you are running. Do **one smoke per Space deploy** β€” combining
them risks an OOM or slow cold-start on the L4.
4. Push and wait for the Space to build. First-build cold-start with a model
download is ~5-10 minutes; subsequent restarts are seconds.
After deploy, watch the **Logs** tab for the `parse_start` event. If you do
not see structured JSON lines there, the logging config is not active β€”
double-check `ZSGDP_LOG_JSON=1` in the Space variables.
## Automated runner
Each smoke below has an automated counterpart in
`scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any
shell with the project installed):
```bash
# Run all smokes whose deps are installed; skip the rest with hints:
python -m scripts.run_space_smoke --output ./space_smoke_report.json
# Run only specific smokes:
python -m scripts.run_space_smoke --smoke lexical --smoke ablation
# CI-strict mode: treat skipped smokes as failures (use after you've
# uncommented the deps for the smoke you intend to run):
python -m scripts.run_space_smoke --smoke embedding --strict
```
The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus
elapsed seconds and a `detail` block with the metrics it gathered. The
manual procedure below is the fallback when you want to inspect the UI
directly or test something the runner doesn't cover (e.g. uploading a
specific real PDF rather than a synthetic fixture).
---
## Smoke 1 β€” Lexical retriever benchmark (model-free)
Confirms the Space's parsing + benchmark plumbing works end-to-end before
adding any model dependency.
**Setup:**
- Default `requirements.txt` (no uncommenting needed).
- Default config (no flips).
**Trigger:** upload a small markdown file via the Gradio UI.
**Expected log lines (in order):**
- `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`).
- One `parser_candidate` per parser that ran (typically `text`, possibly
`pymupdf` and `docling` if the file was a PDF).
- Possibly one or more `repair_iteration` records if quality < threshold.
- `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`.
**Pass criteria:**
- All log lines appear with `doc_id` populated.
- `parse_end.quality_score >= 0.85` for a clean markdown doc.
- No `parser_failed` or `gpu_task_blocked` records.
---
## Smoke 2 β€” Embedding retriever (jina-embeddings-v3)
Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically
runs on the L4 with `trust_remote_code=True`.
**Setup:**
- In `requirements.txt`, uncomment `transformers` and `sentence-transformers`
lines.
- Add `configs/space_embedding.yaml` to the repo with:
```yaml
benchmarks:
retriever:
backend: embedding
model_id: jinaai/jina-embeddings-v3
task: retrieval.passage
```
- In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`,
or pass via the env var configured in Space variables.
**Trigger:** upload any markdown / PDF; the benchmark CLI is not reachable
from the Gradio UI today, so for the embedding-retriever smoke you'd need
to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space
**JupyterLab** session against a small input dir.
**Expected log lines:**
- First call: a 30–90s pause while jina-v3 weights download (no log lines
during this β€” torch logs go to its own logger). Then `parse_start`.
- After the first parse, subsequent calls are fast (model is in memory).
**Pass criteria:**
- Benchmark completes without an exception.
- `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text
corpus.
- No `gpu_task_blocked` records (those are repair-related, not retrieval).
- The parse_end record's `device` field reads `cuda`.
**Failure modes to watch:**
- `RuntimeError: EmbeddingRetriever requires sentence-transformers` β†’
package not in `requirements.txt`.
- CUDA OOM β†’ switch to a smaller embedding model
(`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the
wiring before retrying jina-v3.
---
## Smoke 3 β€” Live GPU repair on a malformed table
Confirms the repair loop's GPU escalation path actually invokes the
configured VLM and that the result is applied to the merged document.
**Setup:**
- In `requirements.txt`, uncomment `transformers` (sentence-transformers
not needed for this smoke).
- Add `configs/space_gpu_repair.yaml`:
```yaml
parsers:
docling:
enabled: true
pymupdf:
enabled: true
repair:
enabled: true
gpu_escalation: true
execute_gpu_escalations: true # the bit that flips the live path on
gpu:
backend: transformers
models:
table:
model_id: Qwen/Qwen2.5-VL-3B-Instruct
task: table-repair
device: auto
dtype: bfloat16
```
- Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space.
**Trigger:** upload a PDF that contains a table the parsers will likely
mangle. A two-column financial statement page works well; if you don't
have one handy, take a Wikipedia article PDF that has a comparison table.
**Expected log lines (in order):**
- `parse_start`.
- `parser_candidate` for docling and pymupdf (both should fire on a PDF).
- `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`,
`gpu_dry_run=false`.
- One `gpu_task_executed` record per GPU task. `status` should be
`executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4.
- A second `repair_iteration` with `iteration=2` only if iteration 1
changed something and quality is still below threshold; otherwise the
loop terminates.
- `parse_end` with `repair_iterations >= 1`.
**Pass criteria:**
- At least one `gpu_task_executed` with `status=executed`.
- The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set.
- No `gpu_task_blocked` records (would mean missing image_path or doc_id).
**Failure modes to watch:**
- All `gpu_task_executed` records show `status=execution_failed` β†’
inspect `output.error` field; common causes are missing image_path
(the PDF doesn't render page crops because `pdf.crop_tables=true` isn't
set) or a CUDA OOM.
- No `repair_iteration` records β†’ the verifier didn't flag any
blocking issues; pick a different input PDF.
---
## Smoke 4 β€” Per-parser ablation across docling + pymupdf
Confirms the ablation runner produces a comparison CSV and that each arm's
artifacts are isolated. No GPU dependency, runs on default Space hardware.
**Setup:** default config, no requirements.txt changes.
**Trigger:** Space JupyterLab terminal:
```bash
zsgdp benchmark-ablate \
--input ./fixtures/pdfs \
--output ./out/ablation \
--parser docling --parser pymupdf
```
**Expected log lines:** one parse cycle per arm (parse_start through
parse_end), three arms total (docling-only, pymupdf-only, merged).
**Pass criteria:**
- `out/ablation/ablation_comparison.csv` has 3 rows.
- Each arm's `mean_quality_score` is non-zero.
- The merged arm's `mean_quality_score` is `>= max(per-parser arms)`.
---
## Smoke 5 β€” External parser CLI (Marker)
The riskiest of the four external adapters because Marker's argv schema
has changed several times. Per-Space, do not bundle with other smokes.
**Setup:**
- Uncomment `marker-pdf` in `requirements.txt`.
- Add `configs/space_marker.yaml`:
```yaml
parsers:
text:
enabled: false
pymupdf:
enabled: false
marker:
enabled: true
timeout_seconds: 300
output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"]
extra_args: []
```
- Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`.
**Trigger:** upload a small PDF (1–3 pages) via the Gradio UI.
**Expected log lines:**
- `parse_start`.
- `parser_candidate` for `marker` with non-zero `element_count`.
- `parse_end` with `candidate_parsers=["marker"]`.
**Pass criteria:**
- No `parser_failed` record for marker.
- Output Markdown has reasonable content (open the artifact zip and check).
- If `parser_failed` fires, look at `extra.error` β€” most common cause is
argv schema drift; tweak `output_args` in the config and retry.
---
## What "deployment ready" means after this checklist
If smokes 1–3 pass on a fresh duplicated Space, the project is genuinely
deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4
and 5 are nice-to-have β€” the per-parser ablation works locally too, and
external parsers stay flagged "experimental" until you actively need them.
Open the `parsed_document.json` from each smoke, copy the `quality_score`,
`mean_layout_f1` (where applicable), and any Β§29-relevant metric into
`README.md` under a new "Production benchmark numbers" section. That
publishes evidence that the success criteria are met against real data.