zeroshotGPU / docs /space_smoke.md
Arjunvir Singh
Initial commit: zeroshotGPU MVP with full eval surface
db06ffa

A newer version of the Gradio SDK is available: 6.15.0

Upgrade

Hugging Face Space smoke-test checklist

This is the deferred deployment-readiness work that can only be exercised on real GPU hardware against real models / external CLIs. Run each smoke once against a duplicated zeroshotGPU Space (or your own dev Space). Each entry gives the exact env vars / config flips, the command to trigger, and the structured log lines you should expect.

All log lines below assume the Space is run with ZSGDP_LOG_LEVEL=INFO and ZSGDP_LOG_JSON=1. app.py sets these automatically when SPACE_ID is in the environment, so on a normal Space you do not need to set them yourself. The HF Spaces logs page will surface the JSON records on stderr.


Pre-flight

  1. Duplicate the Space, give it l4x1 hardware.
  2. Make sure these are set in Space settings β†’ Variables and secrets:
    • ZSGDP_LOG_LEVEL=INFO
    • ZSGDP_LOG_JSON=1
    • (Optional, only for parser smokes that hit a private repo) HF_TOKEN.
  3. In the Space's requirements.txt, uncomment the dependency block matching the smoke you are running. Do one smoke per Space deploy β€” combining them risks an OOM or slow cold-start on the L4.
  4. Push and wait for the Space to build. First-build cold-start with a model download is ~5-10 minutes; subsequent restarts are seconds.

After deploy, watch the Logs tab for the parse_start event. If you do not see structured JSON lines there, the logging config is not active β€” double-check ZSGDP_LOG_JSON=1 in the Space variables.

Automated runner

Each smoke below has an automated counterpart in scripts/run_space_smoke.py. From a Space JupyterLab terminal (or any shell with the project installed):

# Run all smokes whose deps are installed; skip the rest with hints:
python -m scripts.run_space_smoke --output ./space_smoke_report.json

# Run only specific smokes:
python -m scripts.run_space_smoke --smoke lexical --smoke ablation

# CI-strict mode: treat skipped smokes as failures (use after you've
# uncommented the deps for the smoke you intend to run):
python -m scripts.run_space_smoke --smoke embedding --strict

The runner reports pass / fail / skip / error per smoke, plus elapsed seconds and a detail block with the metrics it gathered. The manual procedure below is the fallback when you want to inspect the UI directly or test something the runner doesn't cover (e.g. uploading a specific real PDF rather than a synthetic fixture).


Smoke 1 β€” Lexical retriever benchmark (model-free)

Confirms the Space's parsing + benchmark plumbing works end-to-end before adding any model dependency.

Setup:

  • Default requirements.txt (no uncommenting needed).
  • Default config (no flips).

Trigger: upload a small markdown file via the Gradio UI.

Expected log lines (in order):

  • parse_start with doc_id, file_type, device (likely cuda).
  • One parser_candidate per parser that ran (typically text, possibly pymupdf and docling if the file was a PDF).
  • Possibly one or more repair_iteration records if quality < threshold.
  • parse_end with quality_score, repair_iterations, chunk_count.

Pass criteria:

  • All log lines appear with doc_id populated.
  • parse_end.quality_score >= 0.85 for a clean markdown doc.
  • No parser_failed or gpu_task_blocked records.

Smoke 2 β€” Embedding retriever (jina-embeddings-v3)

Confirms sentence-transformers lazy-load path and that jina-v3 specifically runs on the L4 with trust_remote_code=True.

Setup:

  • In requirements.txt, uncomment transformers and sentence-transformers lines.

  • Add configs/space_embedding.yaml to the repo with:

    benchmarks:
      retriever:
        backend: embedding
        model_id: jinaai/jina-embeddings-v3
        task: retrieval.passage
    
  • In app.py set os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml", or pass via the env var configured in Space variables.

Trigger: upload any markdown / PDF; the benchmark CLI is not reachable from the Gradio UI today, so for the embedding-retriever smoke you'd need to run zsgdp benchmark --input ./fixtures --output ./out from a Space JupyterLab session against a small input dir.

Expected log lines:

  • First call: a 30–90s pause while jina-v3 weights download (no log lines during this β€” torch logs go to its own logger). Then parse_start.
  • After the first parse, subsequent calls are fast (model is in memory).

Pass criteria:

  • Benchmark completes without an exception.
  • summary["mean_retrieval_recall_at_5"] >= 0.7 on a small distinct-text corpus.
  • No gpu_task_blocked records (those are repair-related, not retrieval).
  • The parse_end record's device field reads cuda.

Failure modes to watch:

  • RuntimeError: EmbeddingRetriever requires sentence-transformers β†’ package not in requirements.txt.
  • CUDA OOM β†’ switch to a smaller embedding model (sentence-transformers/all-MiniLM-L6-v2) for the smoke and confirm the wiring before retrying jina-v3.

Smoke 3 β€” Live GPU repair on a malformed table

Confirms the repair loop's GPU escalation path actually invokes the configured VLM and that the result is applied to the merged document.

Setup:

  • In requirements.txt, uncomment transformers (sentence-transformers not needed for this smoke).

  • Add configs/space_gpu_repair.yaml:

    parsers:
      docling:
        enabled: true
      pymupdf:
        enabled: true
    repair:
      enabled: true
      gpu_escalation: true
      execute_gpu_escalations: true   # the bit that flips the live path on
    gpu:
      backend: transformers
      models:
        table:
          model_id: Qwen/Qwen2.5-VL-3B-Instruct
          task: table-repair
          device: auto
          dtype: bfloat16
    
  • Set ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml on the Space.

Trigger: upload a PDF that contains a table the parsers will likely mangle. A two-column financial statement page works well; if you don't have one handy, take a Wikipedia article PDF that has a comparison table.

Expected log lines (in order):

  • parse_start.
  • parser_candidate for docling and pymupdf (both should fire on a PDF).
  • repair_iteration with iteration=1, gpu_task_count >= 1, gpu_dry_run=false.
  • One gpu_task_executed record per GPU task. status should be executed and elapsed_seconds 1-10s for a 3B-param VLM on L4.
  • A second repair_iteration with iteration=2 only if iteration 1 changed something and quality is still below threshold; otherwise the loop terminates.
  • parse_end with repair_iterations >= 1.

Pass criteria:

  • At least one gpu_task_executed with status=executed.
  • The output parsed_document.json shows parsed.tables[i].provenance.gpu_repair_task_id set.
  • No gpu_task_blocked records (would mean missing image_path or doc_id).

Failure modes to watch:

  • All gpu_task_executed records show status=execution_failed β†’ inspect output.error field; common causes are missing image_path (the PDF doesn't render page crops because pdf.crop_tables=true isn't set) or a CUDA OOM.
  • No repair_iteration records β†’ the verifier didn't flag any blocking issues; pick a different input PDF.

Smoke 4 β€” Per-parser ablation across docling + pymupdf

Confirms the ablation runner produces a comparison CSV and that each arm's artifacts are isolated. No GPU dependency, runs on default Space hardware.

Setup: default config, no requirements.txt changes.

Trigger: Space JupyterLab terminal:

zsgdp benchmark-ablate \
  --input ./fixtures/pdfs \
  --output ./out/ablation \
  --parser docling --parser pymupdf

Expected log lines: one parse cycle per arm (parse_start through parse_end), three arms total (docling-only, pymupdf-only, merged).

Pass criteria:

  • out/ablation/ablation_comparison.csv has 3 rows.
  • Each arm's mean_quality_score is non-zero.
  • The merged arm's mean_quality_score is >= max(per-parser arms).

Smoke 5 β€” External parser CLI (Marker)

The riskiest of the four external adapters because Marker's argv schema has changed several times. Per-Space, do not bundle with other smokes.

Setup:

  • Uncomment marker-pdf in requirements.txt.

  • Add configs/space_marker.yaml:

    parsers:
      text:
        enabled: false
      pymupdf:
        enabled: false
      marker:
        enabled: true
        timeout_seconds: 300
        output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"]
        extra_args: []
    
  • Set ZSGDP_CONFIG_PATH=configs/space_marker.yaml.

Trigger: upload a small PDF (1–3 pages) via the Gradio UI.

Expected log lines:

  • parse_start.
  • parser_candidate for marker with non-zero element_count.
  • parse_end with candidate_parsers=["marker"].

Pass criteria:

  • No parser_failed record for marker.
  • Output Markdown has reasonable content (open the artifact zip and check).
  • If parser_failed fires, look at extra.error β€” most common cause is argv schema drift; tweak output_args in the config and retry.

What "deployment ready" means after this checklist

If smokes 1–3 pass on a fresh duplicated Space, the project is genuinely deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4 and 5 are nice-to-have β€” the per-parser ablation works locally too, and external parsers stay flagged "experimental" until you actively need them.

Open the parsed_document.json from each smoke, copy the quality_score, mean_layout_f1 (where applicable), and any Β§29-relevant metric into README.md under a new "Production benchmark numbers" section. That publishes evidence that the success criteria are met against real data.