Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

Arjunvir Singh commited on 26 days ago

Commit

e9fcd5d

1 Parent(s): 9065d0f

Add Help tab to Gradio UI + source link

- New 'Help' tab with project overview, pipeline-mode comparison
table, per-tab explanations, API surface examples (curl), config
knob list, known-limits section, and a source link.
- Top-level instructions point at the Help tab.
- Source-on-HF badge link added near the title for visibility.

Files changed (1) hide show

app.py +109 -1

app.py CHANGED Viewed

@@ -673,6 +673,110 @@ def run_benchmark_in_space() -> dict:
     return headline
 with gr.Blocks(title="zeroshotGPU") as demo:
     gr.Markdown(
         "# zeroshotGPU\n\n"
@@ -682,7 +786,9 @@ with gr.Blocks(title="zeroshotGPU") as demo:
         "chunks (multi-strategy), a quality report with GT-comparison metrics "
         "where applicable, and a SHA-256-checksummed artifact manifest. "
         f"Per-file caps: {MAX_UPLOAD_BYTES // (1024 * 1024)} MB / "
-        f"{MAX_PAGE_COUNT} pages. Batch cap: {MAX_BATCH_DOCS} docs per request."
     )
     with gr.Row():
         upload = gr.File(
@@ -700,6 +806,8 @@ with gr.Blocks(title="zeroshotGPU") as demo:
             parse_button = gr.Button("Parse", variant="primary")
             archive = gr.File(label="Artifacts (zip)")
     with gr.Tabs():
         with gr.Tab("Markdown"):
             gr.Markdown(
                 "_Canonical markdown reconstruction of the parsed document. "

     return headline
+_HELP_MARKDOWN = f"""
+## What this is
+**zeroshotGPU** is an agentic document-parsing control plane. It does not rely
+on a single extraction engine — it profiles each document, routes pages to the
+best parser expert (Docling, PyMuPDF, optionally Marker / MinerU / olmOCR /
+PaddleOCR / Unstructured), normalizes outputs into a canonical schema, verifies
+quality, repairs weak regions through a bounded verify/repair loop (with
+optional GPU escalation), and emits retrieval-ready chunks with provenance.
+## How to use this Space
+**1. Pick a pipeline mode.**
+| Mode | What it does |
+|---|---|
+| `Docling + PyMuPDF` | Default. Runs both parsers so the parser-disagreement metric has a comparison surface. Good for general-purpose parsing. |
+| `Default lightweight` | Text + PyMuPDF only. Fastest. Use when you just need clean text extraction. |
+| `Live GPU repair` | Enables `repair.execute_gpu_escalations=true`. Verification failures (invalid tables, OCR coverage gaps, reading-order issues, missing figure captions) are dispatched to Qwen2.5-VL-3B on the GPU. Slower; requires the GPU path to actually be hit (deterministic repair handles markdown tables before this fires). |
+**2. Upload one or more documents.** Accepts `.pdf`, `.md`, `.txt`, `.html`,
+or a `.zip` of any of those. Multi-file selection works. Per-file cap:
+{MAX_UPLOAD_BYTES // (1024 * 1024)} MB / {MAX_PAGE_COUNT} pages. Batch cap:
+{MAX_BATCH_DOCS} docs per request.
+**3. Click Parse.** Watch the progress bar; first call may take longer if a
+model has to download.
+## What each tab shows
+- **Markdown** — canonical reconstruction of the parsed document. For batch
+  uploads, this shows the first document; the full set is in the artifacts zip.
+- **Run** — summary, quality report, parser metrics, and artifact manifest
+  validation. For batch uploads, `Summary.batch` lists every document parsed
+  in the request with its headline metrics + an aggregate block.
+- **Chunks** — per-strategy chunk breakdown: total / parent / child / table-linked
+  / figure-linked / visual-context counts, plus per-strategy blocks with token
+  count distribution (min/median/max) and 3 sample chunks per strategy with
+  240-char previews.
+- **Artifacts** — each top-level artifact (`parsed_document.json`, `chunks.jsonl`,
+  `quality_report.json`, etc.) downloadable individually. Nested asset crops
+  (page renders, table images) stay bundled in the zip above.
+- **Runtime** — detected GPU runtime, planned GPU tasks, preflight report.
+- **Smokes** — runs the project's smoke validation suite in-Space; reports
+  per-smoke pass/fail/skip + detail. API: `/gradio_api/call/run_smokes_in_space`.
+- **Benchmark** — two modes: against committed regression fixtures, OR against
+  an uploaded corpus you supply. Returns headline metrics (quality score,
+  retrieval recall, repair resolution rate, etc.) plus a per-doc breakdown.
+  API: `/gradio_api/call/run_benchmark_in_space` and `/gradio_api/call/run_benchmark_on_upload`.
+## API surface
+Every button is also a Gradio API endpoint, so AI agents and downstream tooling
+can invoke them programmatically. Discovery: `agents.md` at the Space root
+returns the calling instructions; `/gradio_api/info` returns the full schema.
+```bash
+# Parse a doc:
+curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/parse_uploaded_document \\
+  -H "Content-Type: application/json" \\
+  -d '{{"data": [{{file_data}}, "Default lightweight"]}}'
+# Run smokes:
+curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/run_smokes_in_space \\
+  -H "Content-Type: application/json" -d '{{"data": []}}'
+# Benchmark:
+curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/run_benchmark_in_space \\
+  -H "Content-Type: application/json" -d '{{"data": []}}'
+```
+## Configuration
+Defaults work out of the box. To change behavior, set Space variables:
+- `ZSGDP_CONFIG_PATH` — point at one of `configs/default.yaml`, `configs/docling.yaml`, `configs/live_gpu_repair.yaml`, or your own committed YAML.
+- `ZSGDP_LOG_LEVEL` — `INFO` (default on Spaces), `DEBUG`, `WARNING`, etc.
+- `ZSGDP_LOG_JSON` — `1` (default on Spaces) for one-line JSON log records.
+- `ZSGDP_MAX_UPLOAD_BYTES` / `ZSGDP_MAX_PAGE_COUNT` / `ZSGDP_MAX_BATCH_DOCS` — abuse guards.
+- `HF_TOKEN` — required for gated models (jina-embeddings-v3 may need it).
+## Known limits
+- **ZeroGPU duration cap.** Each `@spaces.GPU`-decorated call runs in a 60s
+  GPU slot. First-call cold-start for big models (Qwen2.5-VL-3B is ~6 GB)
+  exceeds this on a clean cache. Subsequent calls reuse the cached weights
+  and fit comfortably.
+- **Live GPU repair** only fires when the deterministic repair path can't
+  resolve an issue. For markdown tables, the deterministic normalizer
+  handles most malformations before GPU dispatch is needed.
+- **GT-comparison metrics** (layout F1, table structure score, formula CER)
+  require labelled datasets (`omnidocbench`, `doclaynet`). Uploaded
+  custom corpora produce all the GT-free metrics but those three.
+## Source
+[![View source on Hugging Face](https://img.shields.io/badge/HF%20Space-arjun10g%2FzeroshotGPU-blue)](https://huggingface.co/spaces/arjun10g/zeroshotGPU)
+The full project source — including the multi-step spec, contributor docs,
+and 250+ unit tests — is at the link above. The `Files` tab on the Space
+page shows the live deploy.
+"""
 with gr.Blocks(title="zeroshotGPU") as demo:
     gr.Markdown(
         "# zeroshotGPU\n\n"
         "chunks (multi-strategy), a quality report with GT-comparison metrics "
         "where applicable, and a SHA-256-checksummed artifact manifest. "
         f"Per-file caps: {MAX_UPLOAD_BYTES // (1024 * 1024)} MB / "
+        f"{MAX_PAGE_COUNT} pages. Batch cap: {MAX_BATCH_DOCS} docs per request. "
+        "**See the [Help] tab for full instructions.**\n\n"
+        "[Source on Hugging Face](https://huggingface.co/spaces/arjun10g/zeroshotGPU)"
     )
     with gr.Row():
         upload = gr.File(
             parse_button = gr.Button("Parse", variant="primary")
             archive = gr.File(label="Artifacts (zip)")
     with gr.Tabs():
+        with gr.Tab("Help"):
+            gr.Markdown(_HELP_MARKDOWN)
         with gr.Tab("Markdown"):
             gr.Markdown(
                 "_Canonical markdown reconstruction of the parsed document. "