# Hugging Face Space smoke-test checklist This is the deferred deployment-readiness work that can only be exercised on real GPU hardware against real models / external CLIs. Run each smoke once against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry gives the exact env vars / config flips, the command to trigger, and the structured log lines you should expect. All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and `ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in the environment, so on a normal Space you do not need to set them yourself. The HF Spaces logs page will surface the JSON records on stderr. --- ## Pre-flight 1. Duplicate the Space, give it `l4x1` hardware. 2. Make sure these are set in **Space settings → Variables and secrets**: - `ZSGDP_LOG_LEVEL=INFO` - `ZSGDP_LOG_JSON=1` - (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`. 3. In the Space's `requirements.txt`, uncomment the dependency block matching the smoke you are running. Do **one smoke per Space deploy** — combining them risks an OOM or slow cold-start on the L4. 4. Push and wait for the Space to build. First-build cold-start with a model download is ~5-10 minutes; subsequent restarts are seconds. After deploy, watch the **Logs** tab for the `parse_start` event. If you do not see structured JSON lines there, the logging config is not active — double-check `ZSGDP_LOG_JSON=1` in the Space variables. ## Automated runner Each smoke below has an automated counterpart in `scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any shell with the project installed): ```bash # Run all smokes whose deps are installed; skip the rest with hints: python -m scripts.run_space_smoke --output ./space_smoke_report.json # Run only specific smokes: python -m scripts.run_space_smoke --smoke lexical --smoke ablation # CI-strict mode: treat skipped smokes as failures (use after you've # uncommented the deps for the smoke you intend to run): python -m scripts.run_space_smoke --smoke embedding --strict ``` The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus elapsed seconds and a `detail` block with the metrics it gathered. The manual procedure below is the fallback when you want to inspect the UI directly or test something the runner doesn't cover (e.g. uploading a specific real PDF rather than a synthetic fixture). --- ## Smoke 1 — Lexical retriever benchmark (model-free) Confirms the Space's parsing + benchmark plumbing works end-to-end before adding any model dependency. **Setup:** - Default `requirements.txt` (no uncommenting needed). - Default config (no flips). **Trigger:** upload a small markdown file via the Gradio UI. **Expected log lines (in order):** - `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`). - One `parser_candidate` per parser that ran (typically `text`, possibly `pymupdf` and `docling` if the file was a PDF). - Possibly one or more `repair_iteration` records if quality < threshold. - `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`. **Pass criteria:** - All log lines appear with `doc_id` populated. - `parse_end.quality_score >= 0.85` for a clean markdown doc. - No `parser_failed` or `gpu_task_blocked` records. --- ## Smoke 2 — Embedding retriever (jina-embeddings-v3) Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically runs on the L4 with `trust_remote_code=True`. **Setup:** - In `requirements.txt`, uncomment `transformers` and `sentence-transformers` lines. - Add `configs/space_embedding.yaml` to the repo with: ```yaml benchmarks: retriever: backend: embedding model_id: jinaai/jina-embeddings-v3 task: retrieval.passage ``` - In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`, or pass via the env var configured in Space variables. **Trigger:** upload any markdown / PDF; the benchmark CLI is not reachable from the Gradio UI today, so for the embedding-retriever smoke you'd need to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space **JupyterLab** session against a small input dir. **Expected log lines:** - First call: a 30–90s pause while jina-v3 weights download (no log lines during this — torch logs go to its own logger). Then `parse_start`. - After the first parse, subsequent calls are fast (model is in memory). **Pass criteria:** - Benchmark completes without an exception. - `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text corpus. - No `gpu_task_blocked` records (those are repair-related, not retrieval). - The parse_end record's `device` field reads `cuda`. **Failure modes to watch:** - `RuntimeError: EmbeddingRetriever requires sentence-transformers` → package not in `requirements.txt`. - CUDA OOM → switch to a smaller embedding model (`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the wiring before retrying jina-v3. --- ## Smoke 3 — Live GPU repair on a malformed table Confirms the repair loop's GPU escalation path actually invokes the configured VLM and that the result is applied to the merged document. **Setup:** - In `requirements.txt`, uncomment `transformers` (sentence-transformers not needed for this smoke). - Add `configs/space_gpu_repair.yaml`: ```yaml parsers: docling: enabled: true pymupdf: enabled: true repair: enabled: true gpu_escalation: true execute_gpu_escalations: true # the bit that flips the live path on gpu: backend: transformers models: table: model_id: Qwen/Qwen2.5-VL-3B-Instruct task: table-repair device: auto dtype: bfloat16 ``` - Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space. **Trigger:** upload a PDF that contains a table the parsers will likely mangle. A two-column financial statement page works well; if you don't have one handy, take a Wikipedia article PDF that has a comparison table. **Expected log lines (in order):** - `parse_start`. - `parser_candidate` for docling and pymupdf (both should fire on a PDF). - `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`, `gpu_dry_run=false`. - One `gpu_task_executed` record per GPU task. `status` should be `executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4. - A second `repair_iteration` with `iteration=2` only if iteration 1 changed something and quality is still below threshold; otherwise the loop terminates. - `parse_end` with `repair_iterations >= 1`. **Pass criteria:** - At least one `gpu_task_executed` with `status=executed`. - The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set. - No `gpu_task_blocked` records (would mean missing image_path or doc_id). **Failure modes to watch:** - All `gpu_task_executed` records show `status=execution_failed` → inspect `output.error` field; common causes are missing image_path (the PDF doesn't render page crops because `pdf.crop_tables=true` isn't set) or a CUDA OOM. - No `repair_iteration` records → the verifier didn't flag any blocking issues; pick a different input PDF. --- ## Smoke 4 — Per-parser ablation across docling + pymupdf Confirms the ablation runner produces a comparison CSV and that each arm's artifacts are isolated. No GPU dependency, runs on default Space hardware. **Setup:** default config, no requirements.txt changes. **Trigger:** Space JupyterLab terminal: ```bash zsgdp benchmark-ablate \ --input ./fixtures/pdfs \ --output ./out/ablation \ --parser docling --parser pymupdf ``` **Expected log lines:** one parse cycle per arm (parse_start through parse_end), three arms total (docling-only, pymupdf-only, merged). **Pass criteria:** - `out/ablation/ablation_comparison.csv` has 3 rows. - Each arm's `mean_quality_score` is non-zero. - The merged arm's `mean_quality_score` is `>= max(per-parser arms)`. --- ## Smoke 5 — External parser CLI (Marker) The riskiest of the four external adapters because Marker's argv schema has changed several times. Per-Space, do not bundle with other smokes. **Setup:** - Uncomment `marker-pdf` in `requirements.txt`. - Add `configs/space_marker.yaml`: ```yaml parsers: text: enabled: false pymupdf: enabled: false marker: enabled: true timeout_seconds: 300 output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"] extra_args: [] ``` - Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`. **Trigger:** upload a small PDF (1–3 pages) via the Gradio UI. **Expected log lines:** - `parse_start`. - `parser_candidate` for `marker` with non-zero `element_count`. - `parse_end` with `candidate_parsers=["marker"]`. **Pass criteria:** - No `parser_failed` record for marker. - Output Markdown has reasonable content (open the artifact zip and check). - If `parser_failed` fires, look at `extra.error` — most common cause is argv schema drift; tweak `output_args` in the config and retry. --- ## What "deployment ready" means after this checklist If smokes 1–3 pass on a fresh duplicated Space, the project is genuinely deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4 and 5 are nice-to-have — the per-parser ablation works locally too, and external parsers stay flagged "experimental" until you actively need them. Open the `parsed_document.json` from each smoke, copy the `quality_score`, `mean_layout_f1` (where applicable), and any §29-relevant metric into `README.md` under a new "Production benchmark numbers" section. That publishes evidence that the success criteria are met against real data.