Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

App Files Files Community

zeroshotGPU / docs /space_smoke.md

Arjunvir Singh

Initial commit: zeroshotGPU MVP with full eval surface

db06ffa 20 days ago

preview code

raw

history blame contribute delete

9.79 kB

	# Hugging Face Space smoke-test checklist

	This is the deferred deployment-readiness work that can only be exercised on
	real GPU hardware against real models / external CLIs. Run each smoke once
	against a duplicated `zeroshotGPU` Space (or your own dev Space). Each entry
	gives the exact env vars / config flips, the command to trigger, and the
	structured log lines you should expect.

	All log lines below assume the Space is run with `ZSGDP_LOG_LEVEL=INFO` and
	`ZSGDP_LOG_JSON=1`. `app.py` sets these automatically when `SPACE_ID` is in
	the environment, so on a normal Space you do not need to set them yourself.
	The HF Spaces logs page will surface the JSON records on stderr.

	---

	## Pre-flight

	1. Duplicate the Space, give it `l4x1` hardware.
	2. Make sure these are set in Space settings → Variables and secrets:
	- `ZSGDP_LOG_LEVEL=INFO`
	- `ZSGDP_LOG_JSON=1`
	- (Optional, only for parser smokes that hit a private repo) `HF_TOKEN`.
	3. In the Space's `requirements.txt`, uncomment the dependency block matching
	the smoke you are running. Do one smoke per Space deploy — combining
	them risks an OOM or slow cold-start on the L4.
	4. Push and wait for the Space to build. First-build cold-start with a model
	download is ~5-10 minutes; subsequent restarts are seconds.

	After deploy, watch the Logs tab for the `parse_start` event. If you do
	not see structured JSON lines there, the logging config is not active —
	double-check `ZSGDP_LOG_JSON=1` in the Space variables.

	## Automated runner

	Each smoke below has an automated counterpart in
	`scripts/run_space_smoke.py`. From a Space JupyterLab terminal (or any
	shell with the project installed):

	```bash
	# Run all smokes whose deps are installed; skip the rest with hints:
	python -m scripts.run_space_smoke --output ./space_smoke_report.json

	# Run only specific smokes:
	python -m scripts.run_space_smoke --smoke lexical --smoke ablation

	# CI-strict mode: treat skipped smokes as failures (use after you've
	# uncommented the deps for the smoke you intend to run):
	python -m scripts.run_space_smoke --smoke embedding --strict
	```

	The runner reports `pass` / `fail` / `skip` / `error` per smoke, plus
	elapsed seconds and a `detail` block with the metrics it gathered. The
	manual procedure below is the fallback when you want to inspect the UI
	directly or test something the runner doesn't cover (e.g. uploading a
	specific real PDF rather than a synthetic fixture).

	---

	## Smoke 1 — Lexical retriever benchmark (model-free)

	Confirms the Space's parsing + benchmark plumbing works end-to-end before
	adding any model dependency.

	Setup:
	- Default `requirements.txt` (no uncommenting needed).
	- Default config (no flips).

	Trigger: upload a small markdown file via the Gradio UI.

	Expected log lines (in order):
	- `parse_start` with `doc_id`, `file_type`, `device` (likely `cuda`).
	- One `parser_candidate` per parser that ran (typically `text`, possibly
	`pymupdf` and `docling` if the file was a PDF).
	- Possibly one or more `repair_iteration` records if quality < threshold.
	- `parse_end` with `quality_score`, `repair_iterations`, `chunk_count`.

	Pass criteria:
	- All log lines appear with `doc_id` populated.
	- `parse_end.quality_score >= 0.85` for a clean markdown doc.
	- No `parser_failed` or `gpu_task_blocked` records.

	---

	## Smoke 2 — Embedding retriever (jina-embeddings-v3)

	Confirms `sentence-transformers` lazy-load path and that jina-v3 specifically
	runs on the L4 with `trust_remote_code=True`.

	Setup:
	- In `requirements.txt`, uncomment `transformers` and `sentence-transformers`
	lines.
	- Add `configs/space_embedding.yaml` to the repo with:

	```yaml
	benchmarks:
	retriever:
	backend: embedding
	model_id: jinaai/jina-embeddings-v3
	task: retrieval.passage
	```

	- In `app.py` set `os.environ["ZSGDP_CONFIG_PATH"] = "configs/space_embedding.yaml"`,
	or pass via the env var configured in Space variables.

	Trigger: upload any markdown / PDF; the benchmark CLI is not reachable
	from the Gradio UI today, so for the embedding-retriever smoke you'd need
	to run `zsgdp benchmark --input ./fixtures --output ./out` from a Space
	JupyterLab session against a small input dir.

	Expected log lines:
	- First call: a 30–90s pause while jina-v3 weights download (no log lines
	during this — torch logs go to its own logger). Then `parse_start`.
	- After the first parse, subsequent calls are fast (model is in memory).

	Pass criteria:
	- Benchmark completes without an exception.
	- `summary["mean_retrieval_recall_at_5"] >= 0.7` on a small distinct-text
	corpus.
	- No `gpu_task_blocked` records (those are repair-related, not retrieval).
	- The parse_end record's `device` field reads `cuda`.

	Failure modes to watch:
	- `RuntimeError: EmbeddingRetriever requires sentence-transformers` →
	package not in `requirements.txt`.
	- CUDA OOM → switch to a smaller embedding model
	(`sentence-transformers/all-MiniLM-L6-v2`) for the smoke and confirm the
	wiring before retrying jina-v3.

	---

	## Smoke 3 — Live GPU repair on a malformed table

	Confirms the repair loop's GPU escalation path actually invokes the
	configured VLM and that the result is applied to the merged document.

	Setup:
	- In `requirements.txt`, uncomment `transformers` (sentence-transformers
	not needed for this smoke).
	- Add `configs/space_gpu_repair.yaml`:

	```yaml
	parsers:
	docling:
	enabled: true
	pymupdf:
	enabled: true
	repair:
	enabled: true
	gpu_escalation: true
	execute_gpu_escalations: true # the bit that flips the live path on
	gpu:
	backend: transformers
	models:
	table:
	model_id: Qwen/Qwen2.5-VL-3B-Instruct
	task: table-repair
	device: auto
	dtype: bfloat16
	```

	- Set `ZSGDP_CONFIG_PATH=configs/space_gpu_repair.yaml` on the Space.

	Trigger: upload a PDF that contains a table the parsers will likely
	mangle. A two-column financial statement page works well; if you don't
	have one handy, take a Wikipedia article PDF that has a comparison table.

	Expected log lines (in order):
	- `parse_start`.
	- `parser_candidate` for docling and pymupdf (both should fire on a PDF).
	- `repair_iteration` with `iteration=1`, `gpu_task_count >= 1`,
	`gpu_dry_run=false`.
	- One `gpu_task_executed` record per GPU task. `status` should be
	`executed` and `elapsed_seconds` 1-10s for a 3B-param VLM on L4.
	- A second `repair_iteration` with `iteration=2` only if iteration 1
	changed something and quality is still below threshold; otherwise the
	loop terminates.
	- `parse_end` with `repair_iterations >= 1`.

	Pass criteria:
	- At least one `gpu_task_executed` with `status=executed`.
	- The output `parsed_document.json` shows `parsed.tables[i].provenance.gpu_repair_task_id` set.
	- No `gpu_task_blocked` records (would mean missing image_path or doc_id).

	Failure modes to watch:
	- All `gpu_task_executed` records show `status=execution_failed` →
	inspect `output.error` field; common causes are missing image_path
	(the PDF doesn't render page crops because `pdf.crop_tables=true` isn't
	set) or a CUDA OOM.
	- No `repair_iteration` records → the verifier didn't flag any
	blocking issues; pick a different input PDF.

	---

	## Smoke 4 — Per-parser ablation across docling + pymupdf

	Confirms the ablation runner produces a comparison CSV and that each arm's
	artifacts are isolated. No GPU dependency, runs on default Space hardware.

	Setup: default config, no requirements.txt changes.

	Trigger: Space JupyterLab terminal:

	```bash
	zsgdp benchmark-ablate \
	--input ./fixtures/pdfs \
	--output ./out/ablation \
	--parser docling --parser pymupdf
	```

	Expected log lines: one parse cycle per arm (parse_start through
	parse_end), three arms total (docling-only, pymupdf-only, merged).

	Pass criteria:
	- `out/ablation/ablation_comparison.csv` has 3 rows.
	- Each arm's `mean_quality_score` is non-zero.
	- The merged arm's `mean_quality_score` is `>= max(per-parser arms)`.

	---

	## Smoke 5 — External parser CLI (Marker)

	The riskiest of the four external adapters because Marker's argv schema
	has changed several times. Per-Space, do not bundle with other smokes.

	Setup:
	- Uncomment `marker-pdf` in `requirements.txt`.
	- Add `configs/space_marker.yaml`:

	```yaml
	parsers:
	text:
	enabled: false
	pymupdf:
	enabled: false
	marker:
	enabled: true
	timeout_seconds: 300
	output_args: ["--output_dir", "{output_dir}", "--output_format", "markdown"]
	extra_args: []
	```

	- Set `ZSGDP_CONFIG_PATH=configs/space_marker.yaml`.

	Trigger: upload a small PDF (1–3 pages) via the Gradio UI.

	Expected log lines:
	- `parse_start`.
	- `parser_candidate` for `marker` with non-zero `element_count`.
	- `parse_end` with `candidate_parsers=["marker"]`.

	Pass criteria:
	- No `parser_failed` record for marker.
	- Output Markdown has reasonable content (open the artifact zip and check).
	- If `parser_failed` fires, look at `extra.error` — most common cause is
	argv schema drift; tweak `output_args` in the config and retry.

	---

	## What "deployment ready" means after this checklist

	If smokes 1–3 pass on a fresh duplicated Space, the project is genuinely
	deployable for the Docling + PyMuPDF + Qwen2.5-VL-3B repair stack. Smokes 4
	and 5 are nice-to-have — the per-parser ablation works locally too, and
	external parsers stay flagged "experimental" until you actively need them.

	Open the `parsed_document.json` from each smoke, copy the `quality_score`,
	`mean_layout_f1` (where applicable), and any §29-relevant metric into
	`README.md` under a new "Production benchmark numbers" section. That
	publishes evidence that the success criteria are met against real data.