Arjunvir Singh commited on
Commit
e9fcd5d
Β·
1 Parent(s): 9065d0f

Add Help tab to Gradio UI + source link

Browse files

- New 'Help' tab with project overview, pipeline-mode comparison
table, per-tab explanations, API surface examples (curl), config
knob list, known-limits section, and a source link.
- Top-level instructions point at the Help tab.
- Source-on-HF badge link added near the title for visibility.

Files changed (1) hide show
  1. app.py +109 -1
app.py CHANGED
@@ -673,6 +673,110 @@ def run_benchmark_in_space() -> dict:
673
  return headline
674
 
675
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
676
  with gr.Blocks(title="zeroshotGPU") as demo:
677
  gr.Markdown(
678
  "# zeroshotGPU\n\n"
@@ -682,7 +786,9 @@ with gr.Blocks(title="zeroshotGPU") as demo:
682
  "chunks (multi-strategy), a quality report with GT-comparison metrics "
683
  "where applicable, and a SHA-256-checksummed artifact manifest. "
684
  f"Per-file caps: {MAX_UPLOAD_BYTES // (1024 * 1024)} MB / "
685
- f"{MAX_PAGE_COUNT} pages. Batch cap: {MAX_BATCH_DOCS} docs per request."
 
 
686
  )
687
  with gr.Row():
688
  upload = gr.File(
@@ -700,6 +806,8 @@ with gr.Blocks(title="zeroshotGPU") as demo:
700
  parse_button = gr.Button("Parse", variant="primary")
701
  archive = gr.File(label="Artifacts (zip)")
702
  with gr.Tabs():
 
 
703
  with gr.Tab("Markdown"):
704
  gr.Markdown(
705
  "_Canonical markdown reconstruction of the parsed document. "
 
673
  return headline
674
 
675
 
676
+ _HELP_MARKDOWN = f"""
677
+ ## What this is
678
+
679
+ **zeroshotGPU** is an agentic document-parsing control plane. It does not rely
680
+ on a single extraction engine β€” it profiles each document, routes pages to the
681
+ best parser expert (Docling, PyMuPDF, optionally Marker / MinerU / olmOCR /
682
+ PaddleOCR / Unstructured), normalizes outputs into a canonical schema, verifies
683
+ quality, repairs weak regions through a bounded verify/repair loop (with
684
+ optional GPU escalation), and emits retrieval-ready chunks with provenance.
685
+
686
+ ## How to use this Space
687
+
688
+ **1. Pick a pipeline mode.**
689
+
690
+ | Mode | What it does |
691
+ |---|---|
692
+ | `Docling + PyMuPDF` | Default. Runs both parsers so the parser-disagreement metric has a comparison surface. Good for general-purpose parsing. |
693
+ | `Default lightweight` | Text + PyMuPDF only. Fastest. Use when you just need clean text extraction. |
694
+ | `Live GPU repair` | Enables `repair.execute_gpu_escalations=true`. Verification failures (invalid tables, OCR coverage gaps, reading-order issues, missing figure captions) are dispatched to Qwen2.5-VL-3B on the GPU. Slower; requires the GPU path to actually be hit (deterministic repair handles markdown tables before this fires). |
695
+
696
+ **2. Upload one or more documents.** Accepts `.pdf`, `.md`, `.txt`, `.html`,
697
+ or a `.zip` of any of those. Multi-file selection works. Per-file cap:
698
+ {MAX_UPLOAD_BYTES // (1024 * 1024)} MB / {MAX_PAGE_COUNT} pages. Batch cap:
699
+ {MAX_BATCH_DOCS} docs per request.
700
+
701
+ **3. Click Parse.** Watch the progress bar; first call may take longer if a
702
+ model has to download.
703
+
704
+ ## What each tab shows
705
+
706
+ - **Markdown** β€” canonical reconstruction of the parsed document. For batch
707
+ uploads, this shows the first document; the full set is in the artifacts zip.
708
+ - **Run** β€” summary, quality report, parser metrics, and artifact manifest
709
+ validation. For batch uploads, `Summary.batch` lists every document parsed
710
+ in the request with its headline metrics + an aggregate block.
711
+ - **Chunks** β€” per-strategy chunk breakdown: total / parent / child / table-linked
712
+ / figure-linked / visual-context counts, plus per-strategy blocks with token
713
+ count distribution (min/median/max) and 3 sample chunks per strategy with
714
+ 240-char previews.
715
+ - **Artifacts** β€” each top-level artifact (`parsed_document.json`, `chunks.jsonl`,
716
+ `quality_report.json`, etc.) downloadable individually. Nested asset crops
717
+ (page renders, table images) stay bundled in the zip above.
718
+ - **Runtime** β€” detected GPU runtime, planned GPU tasks, preflight report.
719
+ - **Smokes** β€” runs the project's smoke validation suite in-Space; reports
720
+ per-smoke pass/fail/skip + detail. API: `/gradio_api/call/run_smokes_in_space`.
721
+ - **Benchmark** β€” two modes: against committed regression fixtures, OR against
722
+ an uploaded corpus you supply. Returns headline metrics (quality score,
723
+ retrieval recall, repair resolution rate, etc.) plus a per-doc breakdown.
724
+ API: `/gradio_api/call/run_benchmark_in_space` and `/gradio_api/call/run_benchmark_on_upload`.
725
+
726
+ ## API surface
727
+
728
+ Every button is also a Gradio API endpoint, so AI agents and downstream tooling
729
+ can invoke them programmatically. Discovery: `agents.md` at the Space root
730
+ returns the calling instructions; `/gradio_api/info` returns the full schema.
731
+
732
+ ```bash
733
+ # Parse a doc:
734
+ curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/parse_uploaded_document \\
735
+ -H "Content-Type: application/json" \\
736
+ -d '{{"data": [{{file_data}}, "Default lightweight"]}}'
737
+
738
+ # Run smokes:
739
+ curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/run_smokes_in_space \\
740
+ -H "Content-Type: application/json" -d '{{"data": []}}'
741
+
742
+ # Benchmark:
743
+ curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/run_benchmark_in_space \\
744
+ -H "Content-Type: application/json" -d '{{"data": []}}'
745
+ ```
746
+
747
+ ## Configuration
748
+
749
+ Defaults work out of the box. To change behavior, set Space variables:
750
+
751
+ - `ZSGDP_CONFIG_PATH` β€” point at one of `configs/default.yaml`, `configs/docling.yaml`, `configs/live_gpu_repair.yaml`, or your own committed YAML.
752
+ - `ZSGDP_LOG_LEVEL` β€” `INFO` (default on Spaces), `DEBUG`, `WARNING`, etc.
753
+ - `ZSGDP_LOG_JSON` β€” `1` (default on Spaces) for one-line JSON log records.
754
+ - `ZSGDP_MAX_UPLOAD_BYTES` / `ZSGDP_MAX_PAGE_COUNT` / `ZSGDP_MAX_BATCH_DOCS` β€” abuse guards.
755
+ - `HF_TOKEN` β€” required for gated models (jina-embeddings-v3 may need it).
756
+
757
+ ## Known limits
758
+
759
+ - **ZeroGPU duration cap.** Each `@spaces.GPU`-decorated call runs in a 60s
760
+ GPU slot. First-call cold-start for big models (Qwen2.5-VL-3B is ~6 GB)
761
+ exceeds this on a clean cache. Subsequent calls reuse the cached weights
762
+ and fit comfortably.
763
+ - **Live GPU repair** only fires when the deterministic repair path can't
764
+ resolve an issue. For markdown tables, the deterministic normalizer
765
+ handles most malformations before GPU dispatch is needed.
766
+ - **GT-comparison metrics** (layout F1, table structure score, formula CER)
767
+ require labelled datasets (`omnidocbench`, `doclaynet`). Uploaded
768
+ custom corpora produce all the GT-free metrics but those three.
769
+
770
+ ## Source
771
+
772
+ [![View source on Hugging Face](https://img.shields.io/badge/HF%20Space-arjun10g%2FzeroshotGPU-blue)](https://huggingface.co/spaces/arjun10g/zeroshotGPU)
773
+
774
+ The full project source β€” including the multi-step spec, contributor docs,
775
+ and 250+ unit tests β€” is at the link above. The `Files` tab on the Space
776
+ page shows the live deploy.
777
+ """
778
+
779
+
780
  with gr.Blocks(title="zeroshotGPU") as demo:
781
  gr.Markdown(
782
  "# zeroshotGPU\n\n"
 
786
  "chunks (multi-strategy), a quality report with GT-comparison metrics "
787
  "where applicable, and a SHA-256-checksummed artifact manifest. "
788
  f"Per-file caps: {MAX_UPLOAD_BYTES // (1024 * 1024)} MB / "
789
+ f"{MAX_PAGE_COUNT} pages. Batch cap: {MAX_BATCH_DOCS} docs per request. "
790
+ "**See the [Help] tab for full instructions.**\n\n"
791
+ "[Source on Hugging Face](https://huggingface.co/spaces/arjun10g/zeroshotGPU)"
792
  )
793
  with gr.Row():
794
  upload = gr.File(
 
806
  parse_button = gr.Button("Parse", variant="primary")
807
  archive = gr.File(label="Artifacts (zip)")
808
  with gr.Tabs():
809
+ with gr.Tab("Help"):
810
+ gr.Markdown(_HELP_MARKDOWN)
811
  with gr.Tab("Markdown"):
812
  gr.Markdown(
813
  "_Canonical markdown reconstruction of the parsed document. "