Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.15.0
Hugging Face Space smoke-test checklist
This is the deferred deployment-readiness work that can only be exercised on
real GPU hardware against real models / external CLIs. Run each smoke once
against a duplicated zeroshotGPU Space (or your own dev Space). Each entry
gives the exact env vars / config flips, the command to trigger, and the
structured log lines you should expect.
All log lines below assume the Space is run with ZSGDP_LOG_LEVEL=INFO and
ZSGDP_LOG_JSON=1. app.py sets these automatically when SPACE_ID is in
the environment, so on a normal Space you do not need to set them yourself.
The HF Spaces logs page will surface the JSON records on stderr.
Pre-flight
- Duplicate the Space, give it
l4x1hardware. - Make sure these are set in Space settings β Variables and secrets:
ZSGDP_LOG_LEVEL=INFOZSGDP_LOG_JSON=1- (Optional, only for parser smokes that hit a private repo)
HF_TOKEN.
- In the Space's
requirements.txt, uncomment the dependency block matching the smoke you are running. Do one smoke per Space deploy β combining them risks an OOM or slow cold-start on the L4. - Push and wait for the Space to build. First-build cold-start with a model download is ~5-10 minutes; subsequent restarts are seconds.
After deploy, watch the Logs tab for the parse_start event. If you do
not see structured JSON lines there, the logging config is not active β
double-check ZSGDP_LOG_JSON=1 in the Space variables.
Automated runner
Each smoke below has an automated counterpart in
scripts/run_space_smoke.py. From a Space JupyterLab terminal (or any
shell with the project installed):
# Run all smokes whose deps are installed; skip the rest with hints:
python -m scripts.run_space_smoke --output ./space_smoke_report.json
# Run only specific smokes:
python -m scripts.run_space_smoke --smoke lexical --smoke ablation
# CI-strict mode: treat skipped smokes as failures (use after you've
# uncommented the deps for the smoke you intend to run):
python -m scripts.run_space_smoke --smoke embedding --strict
The runner reports pass / fail / skip / error per smoke, plus
elapsed seconds and a detail block with the metrics it gathered. The
manual procedure below is the fallback when you want to inspect the UI
directly or test something the runner doesn't cover (e.g. uploading a
specific real PDF rather than a synthetic fixture).
Smoke 1 β Lexical retriever benchmark (model-free)
Confirms the Space's parsing + benchmark plumbing works end-to-end before adding any model dependency.
Setup:
- Default
requirements.txt(no uncommenting needed). - Default config (no flips).
Trigger: upload a small markdown file via the Gradio UI.
Expected log lines (in order):
parse_startwithdoc_id,file_type,device(likelycuda).- One
parser_candidateper parser that ran (typicallytext, possiblypymupdfanddoclingif the file was a PDF). - Possibly one or more
repair_iterationrecords if quality < threshold. parse_endwithquality_score,repair_iterations,chunk_count.
Pass criteria:
- All log lines appear with
doc_idpopulated. parse_end.quality_score >= 0.85for a clean markdown doc.- No
parser_failedorgpu_task_blockedrecords.
Smoke 2 β Embedding retriever (jina-embeddings-v3)
Confirms sentence-transformers lazy-load path and that jina-v3 specifically
runs on the L4 with trust_remote_code=True.
Setup:
In
requirements.txt, uncommenttransformersandsentence-transformerslines.Add
configs/space_embedding.yamlto the repo with:benchmarks: retriever: backend: embedding model_id: jinaai/jina-embeddings-v3 task: retrieval.passageIn
app.pysetos.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml", or pass via the env var configured in Space variables.
Trigger: upload any markdown / PDF; the benchmark CLI is not reachable
from the Gradio UI today, so for the embedding-retriever smoke you'd need
to run zsgdp benchmark --input ./fixtures --output ./out from a Space
JupyterLab session against a small input dir.
Expected log lines:
- First call: a 30β90s pause while jina-v3 weights download (no log lines
during this β torch logs go to its own logger). Then
parse_start. - After the first parse, subsequent calls are fast (model is in memory).
Pass criteria:
- Benchmark completes without an exception.
summary["mean_retrieval_recall_at_5"] >= 0.7on a small distinct-text corpus.- No
gpu_task_blockedrecords (those are repair-related, not retrieval). - The parse_end record's
devicefield readscuda.
Failure modes to watch:
RuntimeError: EmbeddingRetriever requires sentence-transformersβ package not inrequirements.txt.- CUDA OOM β switch to a smaller embedding model
(
sentence-transformers/all-MiniLM-L6-v2) for the smoke and confirm the wiring before retrying jina-v3.
Smoke 3 β Live GPU repair on a malformed table
Confirms the repair loop's GPU escalation path actually invokes the configured VLM and that the result is applied to the merged document.
Setup:
In
requirements.txt, uncommenttransformers(sentence-transformers not needed for this smoke).Add
configs/space_gpu_repair.yaml:parsers: docling: enabled: true pymupdf: enabled: true repair: enabled: true gpu_escalation: true execute_gpu_escalations: true # the bit that flips the live path on gpu: backend: transformers models: table: model_id: Qwen/Qwen2.5-VL-3B-Instruct task: table-repair device: auto dtype: bfloat16Set
ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yamlon the Space.
Trigger: upload a PDF that contains a table the parsers will likely mangle. A two-column financial statement page works well; if you don't have one handy, take a Wikipedia article PDF that has a comparison table.
Expected log lines (in order):
parse_start.parser_candidatefor docling and pymupdf (both should fire on a PDF).repair_iterationwithiteration=1,gpu_task_count >= 1,gpu_dry_run=false.- One
gpu_task_executedrecord per GPU task.statusshould beexecutedandelapsed_seconds1-10s for a 3B-param VLM on L4. - A second
repair_iterationwithiteration=2only if iteration 1 changed something and quality is still below threshold; otherwise the loop terminates. parse_endwithrepair_iterations >= 1.
Pass criteria:
- At least one
gpu_task_executedwithstatus=executed. - The output
parsed_document.jsonshowsparsed.tables[i].provenance.gpu_repair_task_idset. - No
gpu_task_blockedrecords (would mean missing image_path or doc_id).
Failure modes to watch:
- All
gpu_task_executedrecords showstatus=execution_failedβ inspectoutput.errorfield; common causes are missing image_path (the PDF doesn't render page crops becausepdf.crop_tables=trueisn't set) or a CUDA OOM. - No
repair_iterationrecords β the verifier didn't flag any blocking issues; pick a different input PDF.
Smoke 4 β Per-parser ablation across docling + pymupdf
Confirms the ablation runner produces a comparison CSV and that each arm's artifacts are isolated. No GPU dependency, runs on default Space hardware.
Setup: default config, no requirements.txt changes.
Trigger: Space JupyterLab terminal:
zsgdp benchmark-ablate \
--input ./fixtures/pdfs \
--output ./out/ablation \
--parser docling --parser pymupdf
Expected log lines: one parse cycle per arm (parse_start through parse_end), three arms total (docling-only, pymupdf-only, merged).
Pass criteria:
out/ablation/ablation_comparison.csvhas 3 rows.- Each arm's
mean_quality_scoreis non-zero. - The merged arm's
mean_quality_scoreis>= max(per-parser arms).
Smoke 5 β External parser CLI (Marker)
The riskiest of the four external adapters because Marker's argv schema has changed several times. Per-Space, do not bundle with other smokes.
Setup:
Uncomment
marker-pdfinrequirements.txt.Add
configs/space_marker.yaml:parsers: text: enabled: false pymupdf: enabled: false marker: enabled: true timeout_seconds: 300 output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"] extra_args: []Set
ZSGDP_CONFIG_PATH=configs/space_marker.yaml.
Trigger: upload a small PDF (1β3 pages) via the Gradio UI.
Expected log lines:
parse_start.parser_candidateformarkerwith non-zeroelement_count.parse_endwithcandidate_parsers=["marker"].
Pass criteria:
- No
parser_failedrecord for marker. - Output Markdown has reasonable content (open the artifact zip and check).
- If
parser_failedfires, look atextra.errorβ most common cause is argv schema drift; tweakoutput_argsin the config and retry.
What "deployment ready" means after this checklist
If smokes 1β3 pass on a fresh duplicated Space, the project is genuinely deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4 and 5 are nice-to-have β the per-parser ablation works locally too, and external parsers stay flagged "experimental" until you actively need them.
Open the parsed_document.json from each smoke, copy the quality_score,
mean_layout_f1 (where applicable), and any Β§29-relevant metric into
README.md under a new "Production benchmark numbers" section. That
publishes evidence that the success criteria are met against real data.