# Hugging Face Space smoke-test checklist

This is the deferred deployment-readiness work that can only be exercised on
real GPU hardware against real models / external CLIs. Run each smoke once
against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry
gives the exact env vars / config flips, the command to trigger, and the
structured log lines you should expect.

All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and
`ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in
the environment, so on a normal Space you do not need to set them yourself.
The HF Spaces logs page will surface the JSON records on stderr.

---

## Pre-flight

1. Duplicate the Space, give it `l4x1` hardware.
2. Make sure these are set in **Space settings → Variables and secrets**:
   - `ZSGDP_LOG_LEVEL=INFO`
   - `ZSGDP_LOG_JSON=1`
   - (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`.
3. In the Space's `requirements.txt`, uncomment the dependency block matching
   the smoke you are running. Do **one smoke per Space deploy** — combining
   them risks an OOM or slow cold-start on the L4.
4. Push and wait for the Space to build. First-build cold-start with a model
   download is ~5-10 minutes; subsequent restarts are seconds.

After deploy, watch the **Logs** tab for the `parse_start` event. If you do
not see structured JSON lines there, the logging config is not active —
double-check `ZSGDP_LOG_JSON=1` in the Space variables.

## Automated runner

Each smoke below has an automated counterpart in
`scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any
shell with the project installed):

```bash
# Run all smokes whose deps are installed; skip the rest with hints:
python -m scripts.run_space_smoke --output ./space_smoke_report.json

# Run only specific smokes:
python -m scripts.run_space_smoke --smoke lexical --smoke ablation

# CI-strict mode: treat skipped smokes as failures (use after you've
# uncommented the deps for the smoke you intend to run):
python -m scripts.run_space_smoke --smoke embedding --strict
```

The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus
elapsed seconds and a `detail` block with the metrics it gathered. The
manual procedure below is the fallback when you want to inspect the UI
directly or test something the runner doesn't cover (e.g. uploading a
specific real PDF rather than a synthetic fixture).

---

## Smoke 1 — Lexical retriever benchmark (model-free)

Confirms the Space's parsing + benchmark plumbing works end-to-end before
adding any model dependency.

**Setup:**
- Default `requirements.txt` (no uncommenting needed).
- Default config (no flips).

**Trigger:** upload a small markdown file via the Gradio UI.

**Expected log lines (in order):**
- `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`).
- One `parser_candidate` per parser that ran (typically `text`, possibly
  `pymupdf` and `docling` if the file was a PDF).
- Possibly one or more `repair_iteration` records if quality < threshold.
- `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`.

**Pass criteria:**
- All log lines appear with `doc_id` populated.
- `parse_end.quality_score >= 0.85` for a clean markdown doc.
- No `parser_failed` or `gpu_task_blocked` records.

---

## Smoke 2 — Embedding retriever (jina-embeddings-v3)

Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically
runs on the L4 with `trust_remote_code=True`.

**Setup:**
- In `requirements.txt`, uncomment `transformers` and `sentence-transformers`
  lines.
- Add `configs/space_embedding.yaml` to the repo with:

  ```yaml
  benchmarks:
    retriever:
      backend: embedding
      model_id: jinaai/jina-embeddings-v3
      task: retrieval.passage
  ```

- In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`,
  or pass via the env var configured in Space variables.

**Trigger:** upload any markdown / PDF; the benchmark CLI is not reachable
from the Gradio UI today, so for the embedding-retriever smoke you'd need
to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space
**JupyterLab** session against a small input dir.

**Expected log lines:**
- First call: a 30–90s pause while jina-v3 weights download (no log lines
  during this — torch logs go to its own logger). Then `parse_start`.
- After the first parse, subsequent calls are fast (model is in memory).

**Pass criteria:**
- Benchmark completes without an exception.
- `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text
  corpus.
- No `gpu_task_blocked` records (those are repair-related, not retrieval).
- The parse_end record's `device` field reads `cuda`.

**Failure modes to watch:**
- `RuntimeError: EmbeddingRetriever requires sentence-transformers` →
  package not in `requirements.txt`.
- CUDA OOM → switch to a smaller embedding model
  (`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the
  wiring before retrying jina-v3.

---

## Smoke 3 — Live GPU repair on a malformed table

Confirms the repair loop's GPU escalation path actually invokes the
configured VLM and that the result is applied to the merged document.

**Setup:**
- In `requirements.txt`, uncomment `transformers` (sentence-transformers
  not needed for this smoke).
- Add `configs/space_gpu_repair.yaml`:

  ```yaml
  parsers:
    docling:
      enabled: true
    pymupdf:
      enabled: true
  repair:
    enabled: true
    gpu_escalation: true
    execute_gpu_escalations: true   # the bit that flips the live path on
  gpu:
    backend: transformers
    models:
      table:
        model_id: Qwen/Qwen2.5-VL-3B-Instruct
        task: table-repair
        device: auto
        dtype: bfloat16
  ```

- Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space.

**Trigger:** upload a PDF that contains a table the parsers will likely
mangle. A two-column financial statement page works well; if you don't
have one handy, take a Wikipedia article PDF that has a comparison table.

**Expected log lines (in order):**
- `parse_start`.
- `parser_candidate` for docling and pymupdf (both should fire on a PDF).
- `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`,
  `gpu_dry_run=false`.
- One `gpu_task_executed` record per GPU task. `status` should be
  `executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4.
- A second `repair_iteration` with `iteration=2` only if iteration 1
  changed something and quality is still below threshold; otherwise the
  loop terminates.
- `parse_end` with `repair_iterations >= 1`.

**Pass criteria:**
- At least one `gpu_task_executed` with `status=executed`.
- The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set.
- No `gpu_task_blocked` records (would mean missing image_path or doc_id).

**Failure modes to watch:**
- All `gpu_task_executed` records show `status=execution_failed` →
  inspect `output.error` field; common causes are missing image_path
  (the PDF doesn't render page crops because `pdf.crop_tables=true` isn't
  set) or a CUDA OOM.
- No `repair_iteration` records → the verifier didn't flag any
  blocking issues; pick a different input PDF.

---

## Smoke 4 — Per-parser ablation across docling + pymupdf

Confirms the ablation runner produces a comparison CSV and that each arm's
artifacts are isolated. No GPU dependency, runs on default Space hardware.

**Setup:** default config, no requirements.txt changes.

**Trigger:** Space JupyterLab terminal:

```bash
zsgdp benchmark-ablate \
  --input ./fixtures/pdfs \
  --output ./out/ablation \
  --parser docling --parser pymupdf
```

**Expected log lines:** one parse cycle per arm (parse_start through
parse_end), three arms total (docling-only, pymupdf-only, merged).

**Pass criteria:**
- `out/ablation/ablation_comparison.csv` has 3 rows.
- Each arm's `mean_quality_score` is non-zero.
- The merged arm's `mean_quality_score` is `>= max(per-parser arms)`.

---

## Smoke 5 — External parser CLI (Marker)

The riskiest of the four external adapters because Marker's argv schema
has changed several times. Per-Space, do not bundle with other smokes.

**Setup:**
- Uncomment `marker-pdf` in `requirements.txt`.
- Add `configs/space_marker.yaml`:

  ```yaml
  parsers:
    text:
      enabled: false
    pymupdf:
      enabled: false
    marker:
      enabled: true
      timeout_seconds: 300
      output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"]
      extra_args: []
  ```

- Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`.

**Trigger:** upload a small PDF (1–3 pages) via the Gradio UI.

**Expected log lines:**
- `parse_start`.
- `parser_candidate` for `marker` with non-zero `element_count`.
- `parse_end` with `candidate_parsers=["marker"]`.

**Pass criteria:**
- No `parser_failed` record for marker.
- Output Markdown has reasonable content (open the artifact zip and check).
- If `parser_failed` fires, look at `extra.error` — most common cause is
  argv schema drift; tweak `output_args` in the config and retry.

---

## What "deployment ready" means after this checklist

If smokes 1–3 pass on a fresh duplicated Space, the project is genuinely
deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4
and 5 are nice-to-have — the per-parser ablation works locally too, and
external parsers stay flagged "experimental" until you actively need them.

Open the `parsed_document.json` from each smoke, copy the `quality_score`,
`mean_layout_f1` (where applicable), and any §29-relevant metric into
`README.md` under a new "Production benchmark numbers" section. That
publishes evidence that the success criteria are met against real data.