Spaces:
Running on Zero
Running on Zero
| # Hugging Face Space smoke-test checklist | |
| This is the deferred deployment-readiness work that can only be exercised on | |
| real GPU hardware against real models / external CLIs. Run each smoke once | |
| against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry | |
| gives the exact env vars / config flips, the command to trigger, and the | |
| structured log lines you should expect. | |
| All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and | |
| `ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in | |
| the environment, so on a normal Space you do not need to set them yourself. | |
| The HF Spaces logs page will surface the JSON records on stderr. | |
| --- | |
| ## Pre-flight | |
| 1. Duplicate the Space, give it `l4x1` hardware. | |
| 2. Make sure these are set in **Space settings β Variables and secrets**: | |
| - `ZSGDP_LOG_LEVEL=INFO` | |
| - `ZSGDP_LOG_JSON=1` | |
| - (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`. | |
| 3. In the Space's `requirements.txt`, uncomment the dependency block matching | |
| the smoke you are running. Do **one smoke per Space deploy** β combining | |
| them risks an OOM or slow cold-start on the L4. | |
| 4. Push and wait for the Space to build. First-build cold-start with a model | |
| download is ~5-10 minutes; subsequent restarts are seconds. | |
| After deploy, watch the **Logs** tab for the `parse_start` event. If you do | |
| not see structured JSON lines there, the logging config is not active β | |
| double-check `ZSGDP_LOG_JSON=1` in the Space variables. | |
| ## Automated runner | |
| Each smoke below has an automated counterpart in | |
| `scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any | |
| shell with the project installed): | |
| ```bash | |
| # Run all smokes whose deps are installed; skip the rest with hints: | |
| python -m scripts.run_space_smoke --output ./space_smoke_report.json | |
| # Run only specific smokes: | |
| python -m scripts.run_space_smoke --smoke lexical --smoke ablation | |
| # CI-strict mode: treat skipped smokes as failures (use after you've | |
| # uncommented the deps for the smoke you intend to run): | |
| python -m scripts.run_space_smoke --smoke embedding --strict | |
| ``` | |
| The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus | |
| elapsed seconds and a `detail` block with the metrics it gathered. The | |
| manual procedure below is the fallback when you want to inspect the UI | |
| directly or test something the runner doesn't cover (e.g. uploading a | |
| specific real PDF rather than a synthetic fixture). | |
| --- | |
| ## Smoke 1 β Lexical retriever benchmark (model-free) | |
| Confirms the Space's parsing + benchmark plumbing works end-to-end before | |
| adding any model dependency. | |
| **Setup:** | |
| - Default `requirements.txt` (no uncommenting needed). | |
| - Default config (no flips). | |
| **Trigger:** upload a small markdown file via the Gradio UI. | |
| **Expected log lines (in order):** | |
| - `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`). | |
| - One `parser_candidate` per parser that ran (typically `text`, possibly | |
| `pymupdf` and `docling` if the file was a PDF). | |
| - Possibly one or more `repair_iteration` records if quality < threshold. | |
| - `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`. | |
| **Pass criteria:** | |
| - All log lines appear with `doc_id` populated. | |
| - `parse_end.quality_score >= 0.85` for a clean markdown doc. | |
| - No `parser_failed` or `gpu_task_blocked` records. | |
| --- | |
| ## Smoke 2 β Embedding retriever (jina-embeddings-v3) | |
| Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically | |
| runs on the L4 with `trust_remote_code=True`. | |
| **Setup:** | |
| - In `requirements.txt`, uncomment `transformers` and `sentence-transformers` | |
| lines. | |
| - Add `configs/space_embedding.yaml` to the repo with: | |
| ```yaml | |
| benchmarks: | |
| retriever: | |
| backend: embedding | |
| model_id: jinaai/jina-embeddings-v3 | |
| task: retrieval.passage | |
| ``` | |
| - In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`, | |
| or pass via the env var configured in Space variables. | |
| **Trigger:** upload any markdown / PDF; the benchmark CLI is not reachable | |
| from the Gradio UI today, so for the embedding-retriever smoke you'd need | |
| to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space | |
| **JupyterLab** session against a small input dir. | |
| **Expected log lines:** | |
| - First call: a 30β90s pause while jina-v3 weights download (no log lines | |
| during this β torch logs go to its own logger). Then `parse_start`. | |
| - After the first parse, subsequent calls are fast (model is in memory). | |
| **Pass criteria:** | |
| - Benchmark completes without an exception. | |
| - `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text | |
| corpus. | |
| - No `gpu_task_blocked` records (those are repair-related, not retrieval). | |
| - The parse_end record's `device` field reads `cuda`. | |
| **Failure modes to watch:** | |
| - `RuntimeError: EmbeddingRetriever requires sentence-transformers` β | |
| package not in `requirements.txt`. | |
| - CUDA OOM β switch to a smaller embedding model | |
| (`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the | |
| wiring before retrying jina-v3. | |
| --- | |
| ## Smoke 3 β Live GPU repair on a malformed table | |
| Confirms the repair loop's GPU escalation path actually invokes the | |
| configured VLM and that the result is applied to the merged document. | |
| **Setup:** | |
| - In `requirements.txt`, uncomment `transformers` (sentence-transformers | |
| not needed for this smoke). | |
| - Add `configs/space_gpu_repair.yaml`: | |
| ```yaml | |
| parsers: | |
| docling: | |
| enabled: true | |
| pymupdf: | |
| enabled: true | |
| repair: | |
| enabled: true | |
| gpu_escalation: true | |
| execute_gpu_escalations: true # the bit that flips the live path on | |
| gpu: | |
| backend: transformers | |
| models: | |
| table: | |
| model_id: Qwen/Qwen2.5-VL-3B-Instruct | |
| task: table-repair | |
| device: auto | |
| dtype: bfloat16 | |
| ``` | |
| - Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space. | |
| **Trigger:** upload a PDF that contains a table the parsers will likely | |
| mangle. A two-column financial statement page works well; if you don't | |
| have one handy, take a Wikipedia article PDF that has a comparison table. | |
| **Expected log lines (in order):** | |
| - `parse_start`. | |
| - `parser_candidate` for docling and pymupdf (both should fire on a PDF). | |
| - `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`, | |
| `gpu_dry_run=false`. | |
| - One `gpu_task_executed` record per GPU task. `status` should be | |
| `executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4. | |
| - A second `repair_iteration` with `iteration=2` only if iteration 1 | |
| changed something and quality is still below threshold; otherwise the | |
| loop terminates. | |
| - `parse_end` with `repair_iterations >= 1`. | |
| **Pass criteria:** | |
| - At least one `gpu_task_executed` with `status=executed`. | |
| - The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set. | |
| - No `gpu_task_blocked` records (would mean missing image_path or doc_id). | |
| **Failure modes to watch:** | |
| - All `gpu_task_executed` records show `status=execution_failed` β | |
| inspect `output.error` field; common causes are missing image_path | |
| (the PDF doesn't render page crops because `pdf.crop_tables=true` isn't | |
| set) or a CUDA OOM. | |
| - No `repair_iteration` records β the verifier didn't flag any | |
| blocking issues; pick a different input PDF. | |
| --- | |
| ## Smoke 4 β Per-parser ablation across docling + pymupdf | |
| Confirms the ablation runner produces a comparison CSV and that each arm's | |
| artifacts are isolated. No GPU dependency, runs on default Space hardware. | |
| **Setup:** default config, no requirements.txt changes. | |
| **Trigger:** Space JupyterLab terminal: | |
| ```bash | |
| zsgdp benchmark-ablate \ | |
| --input ./fixtures/pdfs \ | |
| --output ./out/ablation \ | |
| --parser docling --parser pymupdf | |
| ``` | |
| **Expected log lines:** one parse cycle per arm (parse_start through | |
| parse_end), three arms total (docling-only, pymupdf-only, merged). | |
| **Pass criteria:** | |
| - `out/ablation/ablation_comparison.csv` has 3 rows. | |
| - Each arm's `mean_quality_score` is non-zero. | |
| - The merged arm's `mean_quality_score` is `>= max(per-parser arms)`. | |
| --- | |
| ## Smoke 5 β External parser CLI (Marker) | |
| The riskiest of the four external adapters because Marker's argv schema | |
| has changed several times. Per-Space, do not bundle with other smokes. | |
| **Setup:** | |
| - Uncomment `marker-pdf` in `requirements.txt`. | |
| - Add `configs/space_marker.yaml`: | |
| ```yaml | |
| parsers: | |
| text: | |
| enabled: false | |
| pymupdf: | |
| enabled: false | |
| marker: | |
| enabled: true | |
| timeout_seconds: 300 | |
| output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"] | |
| extra_args: [] | |
| ``` | |
| - Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`. | |
| **Trigger:** upload a small PDF (1β3 pages) via the Gradio UI. | |
| **Expected log lines:** | |
| - `parse_start`. | |
| - `parser_candidate` for `marker` with non-zero `element_count`. | |
| - `parse_end` with `candidate_parsers=["marker"]`. | |
| **Pass criteria:** | |
| - No `parser_failed` record for marker. | |
| - Output Markdown has reasonable content (open the artifact zip and check). | |
| - If `parser_failed` fires, look at `extra.error` β most common cause is | |
| argv schema drift; tweak `output_args` in the config and retry. | |
| --- | |
| ## What "deployment ready" means after this checklist | |
| If smokes 1β3 pass on a fresh duplicated Space, the project is genuinely | |
| deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4 | |
| and 5 are nice-to-have β the per-parser ablation works locally too, and | |
| external parsers stay flagged "experimental" until you actively need them. | |
| Open the `parsed_document.json` from each smoke, copy the `quality_score`, | |
| `mean_layout_f1` (where applicable), and any Β§29-relevant metric into | |
| `README.md` under a new "Production benchmark numbers" section. That | |
| publishes evidence that the success criteria are met against real data. | |