Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

App Files Files Community

zeroshotGPU / README.md

Arjunvir Singh

README: fill in production benchmark numbers from live Space validation

83059c2 11 days ago

preview code

raw

history blame contribute delete

14.9 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: zeroshotGPU
sdk: gradio
app_file: app.py
python_version: 3.11
suggested_hardware: l4x1
short_description: Agentic zero-shot document parser with GT metrics + repair.

Zero-Shot GPU Document Parser

A self-hosted parsing control plane that profiles documents, routes pages to parser experts, normalizes outputs, verifies quality with GT-comparison metrics, repairs weak regions through a bounded verify/repair loop (with optional GPU escalation), and emits auditable parsed-document artifacts plus strategy-aware chunks. Implements the project described in zero_shot_gpu_document_parser_project_spec.md.

The codebase is intentionally dependency-light by default. Text and Markdown work with the standard library; PyMuPDF, Docling, Marker, MinerU, olmOCR, PaddleOCR, and Unstructured plug in via optional extras. Live GPU repair (Qwen2.5-VL-3B) and the embedding retriever (jina-embeddings-v3) are gated behind explicit config flags so a fresh clone never silently downloads multi-gigabyte weights.

Install

For the local MVP (text + PyMuPDF + Docling):

python -m pip install -e ".[pdf,yaml,docling,dev]"

Optional extras:

Extra	Adds	Required for
`embedding`	`sentence-transformers`, `transformers`	`benchmarks.retriever.backend=embedding`
`gpu_repair`	`transformers`	`repair.execute_gpu_escalations=true`
`spaces`	mirrors `requirements.txt` for HF Spaces parity	running `app.py` locally as a Space simulant

External parser CLIs (Marker, MinerU, olmOCR, PaddleOCR) install separately; configure each via parsers.<name>.command, output_args, and extra_args in your YAML config.

Secrets:

cp .env.example .env
# Set HF_TOKEN if you'll use gated models (jina-embeddings-v3, private repos).

.env is gitignored. The CLI and app.py load it on startup; pre-set environment variables (e.g. Space-side secrets) always win.

Quick start

Parse one document or a folder

python -m zsgdp.cli parse --input ./docs/sample.md --output ./out/sample
python -m zsgdp.cli parse-folder --input ./docs --output ./parsed --workers 4
python -m zsgdp.cli parse --input ./docs/report.pdf --output ./out/report --config configs/docling.yaml

Each parse writes a full artifact bundle. parsed_document.json is the canonical record; chunks.jsonl is the retrieval-ready output; quality_report.json carries every metric the verifier computed.

Run a benchmark

# Custom corpus, no GT — runs every metric that doesn't need labels:
python -m zsgdp.cli benchmark --input ./docs --output ./bench

# Labelled datasets — adds layout F1 / table structure / formula CER:
python -m zsgdp.cli benchmark --input ./omnidocbench --dataset omnidocbench --output ./bench/omni
python -m zsgdp.cli benchmark --input ./doclaynet --dataset doclaynet --output ./bench/doclay

Compare parsers (ablation)

python -m zsgdp.cli benchmark-ablate \
  --input ./docs --output ./bench/ablation \
  --parser docling --parser pymupdf --parser text

Runs the benchmark once per parser plus a merged arm; emits ablation_comparison.csv.

Compare across datasets

python -m zsgdp.cli combine-benchmarks \
  --input ./bench/omni --label omnidocbench \
  --input ./bench/doclay --label doclaynet \
  --output ./bench/cross

Emits dataset_summary.csv and parser_matrix.csv (parser × dataset).

Before pushing to a Space — preflight

make preflight              # unit + regression + space-check + parsers (~10s)
make preflight-full         # ...plus an end-to-end benchmark smoke

A green preflight is the local signal that the branch is ready for the Space. Pre-commit and pre-push hooks (see CONTRIBUTING.md) make this automatic on every git push.

On the Space — smoke validation

Once deployed, exercise the deferred GPU/model paths:

make space-smoke            # runs whichever of 5 smokes have their deps
python -m scripts.run_space_smoke --strict --output ./space_smoke.json

See docs/space_smoke.md for the manual fallback procedure (real PDF uploads, full Marker parses) and per-smoke acceptance criteria.

Opt-ins

Embedding retriever

Default retriever is lexical TF-IDF (zero deps). To use a real embedder:

# configs/myrun.yaml
benchmarks:
  retriever:
    backend: embedding
    model_id: jinaai/jina-embeddings-v3   # or any sentence-transformers model
    task: retrieval.passage

python -m pip install -e ".[embedding]"
python -m zsgdp.cli benchmark --input ./docs --output ./bench --config configs/myrun.yaml

The first call lazy-loads the model; subsequent calls reuse it in-process. Set HF_TOKEN in .env for gated models.

Live GPU repair

The repair controller plans GPU tasks for verification failures (invalid tables, OCR coverage gaps, reading-order issues, missing figure captions). By default these are dry-run only. To execute:

# configs/myrun.yaml
repair:
  gpu_escalation: true
  execute_gpu_escalations: true     # invokes the configured backend
gpu:
  backend: transformers              # or "vllm" for OpenAI-compat
  models:
    table:
      model_id: Qwen/Qwen2.5-VL-3B-Instruct

Each executed task writes its output back into the merged document with a gpu_repair_task_id provenance field.

Outputs

Every parse writes:

parsed_document.json — canonical record (carries schema_version).
document.md — human-readable Markdown reconstruction.
elements.jsonl / tables.jsonl / figures.jsonl / chunks.jsonl — JSONL streams.
chunking_plan.json — strategy ladder + per-strategy metadata.
parser_metrics.json — per-parser candidate-level stats.
quality_report.json — every verifier metric (text coverage, reading order, table validity, parser disagreement, repair resolution/regression rates, GT-comparison metrics when applicable).
routing_report.json — page → parser routing decisions.
profile.json — document profiler output.
gpu_runtime.json — detected GPU/device state at parse time.
gpu_tasks.jsonl (when model-backed work is planned) and gpu_task_report.json (preflight validation).
conflict_report.json (when multiple parsers ran).
artifact_manifest.json with SHA-256 checksums and the parsed-document schema version.
assets/pages/*.png, assets/tables/*.png, assets/figures/*.png — rendered PDF page and region crops.

Benchmark runs additionally write:

results.json — full structured summary including aggregate means.
leaderboard.csv and per_parser_gt_leaderboard.csv — parser leaderboards (without and with GT comparison).
per_parser_metrics.csv — per-document, per-parser GT-comparison breakdown.
layout_runs.csv, table_structure_runs.csv, formula_runs.csv, retrieval_runs.csv, repair_runs.csv — per-document detail per metric family.
parser_runs.csv, chunk_runs.csv, structure_runs.csv, chunk_quality.csv, throughput_runs.csv, ablations.json — additional detail.

benchmark-ablate adds ablation_comparison.csv. combine-benchmarks adds dataset_summary.csv, parser_matrix.csv, and cross_dataset_comparison.json.

Architecture map

Module	Responsibility
`zsgdp/profiling/`	Cheap per-page features (scanned-score, table density, columns, etc.)
`zsgdp/routing/`	Deterministic page → parser-expert decisions with budget
`zsgdp/parsers/`	Adapters; one canonical schema regardless of source
`zsgdp/normalize/`	Convert each parser's output into the schema
`zsgdp/merge/`	Align candidates, dedupe, detect conflicts
`zsgdp/verify/`	Coverage / reading order / table / figure / formula / chunk readiness, plus GT-comparison: layout F1, table structure, formula CER, retrieval recall, parser disagreement, repair success
`zsgdp/repair/`	Deterministic header/table fixes plus GPU escalation through `gpu/worker.py`
`zsgdp/chunking/`	Agentic planner + structure / parent-child / table / figure / page chunkers, with semantic / late / vision / proposition deterministic stubs
`zsgdp/gpu/`	Task planning, batching, dry-run worker, transformers + vLLM clients
`zsgdp/benchmarks/`	Dataset loaders, metric runners, ablation, cross-dataset, retrieval
`zsgdp/cli.py`	All entry points
`app.py`	Gradio Space UI

The full spec is in zero_shot_gpu_document_parser_project_spec.md. §10 (schema) and §17 (chunking ladder) are the most useful sections to skim before touching those modules.

Production benchmark numbers

Spec §29 success criteria for reference:

MVP: full agentic loop improves table QA by ≥20% over best single parser; agentic chunking improves citation accuracy by ≥10% over recursive baseline.
Production-style (HR / financial reports / etc.): retrieval recall@5 ≥ 90%, citation accuracy ≥ 90%, table QA exactness ≥ 85%, manual review rate ≤ 10%, parser blocking failure rate ≤ 5%.

Smoke validation — `arjun10g/zeroshotGPU` ZeroGPU Space, `2026-05-07`

Ran via /gradio_api/call/run_smokes_in_space against commit de03f34.

Smoke	Status	Detail
`lexical`	pass	`mean_quality_score=0.972`, `mean_retrieval_recall_at_1=1.0`, n=1 doc
`ablation`	pass	3 arms (text / pymupdf / merged), comparison CSV emitted
`embedding`	pass	`mean_retrieval_recall_at_5=1.0`, `mean_retrieval_recall_at_1=1.0`, 17.06s on A10g
`gpu_repair`	pass	wiring verified: `dry_run_task_count=1`, `repair_iterations=1`
`marker`	skip	`marker_single` not installed (expected; opt-in via `requirements.txt`)

Benchmark — regression fixtures, `2026-05-07`

Ran via /gradio_api/call/run_benchmark_in_space against commit abadf36.

Metric	Value	Notes
`document_count`	1	`markdown_basic.input.md` only
`mean_quality_score`	0.964	passes the §29 quality bar trivially on this fixture
`mean_layout_f1`	0.000	no GT for `custom_folder` dataset
`mean_table_structure_score`	0.000	no GT
`mean_formula_cer`	0.000	no GT
`mean_retrieval_recall_at_1`	0.000	first-rank miss; truth at rank 3 (single-doc corpus)
`mean_retrieval_recall_at_5`	1.000	found within top-5
`mean_retrieval_mrr`	0.333	implies truth rank ≈ 3
`mean_parser_disagreement_rate`	0.000	only one parser (`text`) ran on `.md`
`mean_repair_resolution_rate`	1.000	every blocking issue pre-repair was resolved
`mean_repair_regression_rate`	0.000	repair introduced no new blocking issues

Reading the numbers: this fixture-only benchmark validates the metric pipeline end-to-end on production hardware. The mean_layout_f1/table_structure_score/formula_cer zeros are because custom_folder has no GT — those metrics activate when running against omnidocbench or doclaynet. To benchmark a real corpus, drop documents into a Space-side directory and call run_benchmark_in_space (or run zsgdp benchmark --input ./corpus --output ./bench from a Pro-tier Dev Mode terminal).

Live GPU repair invocation — verified, deferred end-to-end

Live GPU repair pipeline mode loads configs/live_gpu_repair.yaml, runs the iterative verify/repair loop, detects cuda runtime, and dispatches GPU tasks when needed. On committed test fixtures, the deterministic markdown-table normalizer resolves invalid-table issues without needing the GPU — repair_resolution_rate=1.0 after one iteration, zero GPU tasks executed.

Genuinely firing the live Qwen2.5-VL-3B invocation requires either a PDF with bbox-aware table extraction that pymupdf can't deterministically normalize, or a config that explicitly skips repair.table_repair so the GPU path is the only resolution route. Both are tractable follow-ups when a real malformed PDF is available; the wiring on production is confirmed up to that point.

Deployment

Targeted: Hugging Face Spaces, hardware l4x1, GPU/model target zeroshotGPU.

Pre-deploy gate:

make preflight (local).
make preflight-full (local with end-to-end benchmark smoke).
Duplicate the Space, set HF_TOKEN and any other secrets in Variables and secrets.
Push.
make space-smoke from the Space's JupyterLab terminal.
Inspect docs/space_smoke.md Smoke 3 (live GPU repair) manually if the runner-level wiring smoke passed but you want full model-invocation validation.
Run python -m zsgdp.cli benchmark against your representative corpus and update the table above.

The Space defaults to configs/docling.yaml (Docling + PyMuPDF co-enabled so the parser disagreement rate has signal). Override via ZSGDP_CONFIG_PATH in Space variables for custom configs.

Contributing

See CONTRIBUTING.md for setup, hooks, test layout, fixture format, parser/metric/schema-bump procedures, and the PR checklist.

For changes touching the on-disk schema, bump zsgdp.schema.SCHEMA_VERSION and add an entry under ### Schema in CHANGELOG.md. The artifact manifest surfaces the version under parsed_document_schema_version so downstream consumers can gate.