Spaces:
Running on Zero
Running on Zero
Arjunvir Singh commited on
Commit Β·
e9fcd5d
1
Parent(s): 9065d0f
Add Help tab to Gradio UI + source link
Browse files- New 'Help' tab with project overview, pipeline-mode comparison
table, per-tab explanations, API surface examples (curl), config
knob list, known-limits section, and a source link.
- Top-level instructions point at the Help tab.
- Source-on-HF badge link added near the title for visibility.
app.py
CHANGED
|
@@ -673,6 +673,110 @@ def run_benchmark_in_space() -> dict:
|
|
| 673 |
return headline
|
| 674 |
|
| 675 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 676 |
with gr.Blocks(title="zeroshotGPU") as demo:
|
| 677 |
gr.Markdown(
|
| 678 |
"# zeroshotGPU\n\n"
|
|
@@ -682,7 +786,9 @@ with gr.Blocks(title="zeroshotGPU") as demo:
|
|
| 682 |
"chunks (multi-strategy), a quality report with GT-comparison metrics "
|
| 683 |
"where applicable, and a SHA-256-checksummed artifact manifest. "
|
| 684 |
f"Per-file caps: {MAX_UPLOAD_BYTES // (1024 * 1024)} MB / "
|
| 685 |
-
f"{MAX_PAGE_COUNT} pages. Batch cap: {MAX_BATCH_DOCS} docs per request."
|
|
|
|
|
|
|
| 686 |
)
|
| 687 |
with gr.Row():
|
| 688 |
upload = gr.File(
|
|
@@ -700,6 +806,8 @@ with gr.Blocks(title="zeroshotGPU") as demo:
|
|
| 700 |
parse_button = gr.Button("Parse", variant="primary")
|
| 701 |
archive = gr.File(label="Artifacts (zip)")
|
| 702 |
with gr.Tabs():
|
|
|
|
|
|
|
| 703 |
with gr.Tab("Markdown"):
|
| 704 |
gr.Markdown(
|
| 705 |
"_Canonical markdown reconstruction of the parsed document. "
|
|
|
|
| 673 |
return headline
|
| 674 |
|
| 675 |
|
| 676 |
+
_HELP_MARKDOWN = f"""
|
| 677 |
+
## What this is
|
| 678 |
+
|
| 679 |
+
**zeroshotGPU** is an agentic document-parsing control plane. It does not rely
|
| 680 |
+
on a single extraction engine β it profiles each document, routes pages to the
|
| 681 |
+
best parser expert (Docling, PyMuPDF, optionally Marker / MinerU / olmOCR /
|
| 682 |
+
PaddleOCR / Unstructured), normalizes outputs into a canonical schema, verifies
|
| 683 |
+
quality, repairs weak regions through a bounded verify/repair loop (with
|
| 684 |
+
optional GPU escalation), and emits retrieval-ready chunks with provenance.
|
| 685 |
+
|
| 686 |
+
## How to use this Space
|
| 687 |
+
|
| 688 |
+
**1. Pick a pipeline mode.**
|
| 689 |
+
|
| 690 |
+
| Mode | What it does |
|
| 691 |
+
|---|---|
|
| 692 |
+
| `Docling + PyMuPDF` | Default. Runs both parsers so the parser-disagreement metric has a comparison surface. Good for general-purpose parsing. |
|
| 693 |
+
| `Default lightweight` | Text + PyMuPDF only. Fastest. Use when you just need clean text extraction. |
|
| 694 |
+
| `Live GPU repair` | Enables `repair.execute_gpu_escalations=true`. Verification failures (invalid tables, OCR coverage gaps, reading-order issues, missing figure captions) are dispatched to Qwen2.5-VL-3B on the GPU. Slower; requires the GPU path to actually be hit (deterministic repair handles markdown tables before this fires). |
|
| 695 |
+
|
| 696 |
+
**2. Upload one or more documents.** Accepts `.pdf`, `.md`, `.txt`, `.html`,
|
| 697 |
+
or a `.zip` of any of those. Multi-file selection works. Per-file cap:
|
| 698 |
+
{MAX_UPLOAD_BYTES // (1024 * 1024)} MB / {MAX_PAGE_COUNT} pages. Batch cap:
|
| 699 |
+
{MAX_BATCH_DOCS} docs per request.
|
| 700 |
+
|
| 701 |
+
**3. Click Parse.** Watch the progress bar; first call may take longer if a
|
| 702 |
+
model has to download.
|
| 703 |
+
|
| 704 |
+
## What each tab shows
|
| 705 |
+
|
| 706 |
+
- **Markdown** β canonical reconstruction of the parsed document. For batch
|
| 707 |
+
uploads, this shows the first document; the full set is in the artifacts zip.
|
| 708 |
+
- **Run** β summary, quality report, parser metrics, and artifact manifest
|
| 709 |
+
validation. For batch uploads, `Summary.batch` lists every document parsed
|
| 710 |
+
in the request with its headline metrics + an aggregate block.
|
| 711 |
+
- **Chunks** β per-strategy chunk breakdown: total / parent / child / table-linked
|
| 712 |
+
/ figure-linked / visual-context counts, plus per-strategy blocks with token
|
| 713 |
+
count distribution (min/median/max) and 3 sample chunks per strategy with
|
| 714 |
+
240-char previews.
|
| 715 |
+
- **Artifacts** β each top-level artifact (`parsed_document.json`, `chunks.jsonl`,
|
| 716 |
+
`quality_report.json`, etc.) downloadable individually. Nested asset crops
|
| 717 |
+
(page renders, table images) stay bundled in the zip above.
|
| 718 |
+
- **Runtime** β detected GPU runtime, planned GPU tasks, preflight report.
|
| 719 |
+
- **Smokes** β runs the project's smoke validation suite in-Space; reports
|
| 720 |
+
per-smoke pass/fail/skip + detail. API: `/gradio_api/call/run_smokes_in_space`.
|
| 721 |
+
- **Benchmark** β two modes: against committed regression fixtures, OR against
|
| 722 |
+
an uploaded corpus you supply. Returns headline metrics (quality score,
|
| 723 |
+
retrieval recall, repair resolution rate, etc.) plus a per-doc breakdown.
|
| 724 |
+
API: `/gradio_api/call/run_benchmark_in_space` and `/gradio_api/call/run_benchmark_on_upload`.
|
| 725 |
+
|
| 726 |
+
## API surface
|
| 727 |
+
|
| 728 |
+
Every button is also a Gradio API endpoint, so AI agents and downstream tooling
|
| 729 |
+
can invoke them programmatically. Discovery: `agents.md` at the Space root
|
| 730 |
+
returns the calling instructions; `/gradio_api/info` returns the full schema.
|
| 731 |
+
|
| 732 |
+
```bash
|
| 733 |
+
# Parse a doc:
|
| 734 |
+
curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/parse_uploaded_document \\
|
| 735 |
+
-H "Content-Type: application/json" \\
|
| 736 |
+
-d '{{"data": [{{file_data}}, "Default lightweight"]}}'
|
| 737 |
+
|
| 738 |
+
# Run smokes:
|
| 739 |
+
curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/run_smokes_in_space \\
|
| 740 |
+
-H "Content-Type: application/json" -d '{{"data": []}}'
|
| 741 |
+
|
| 742 |
+
# Benchmark:
|
| 743 |
+
curl -X POST https://arjun10g-zeroshotgpu.hf.space/gradio_api/call/run_benchmark_in_space \\
|
| 744 |
+
-H "Content-Type: application/json" -d '{{"data": []}}'
|
| 745 |
+
```
|
| 746 |
+
|
| 747 |
+
## Configuration
|
| 748 |
+
|
| 749 |
+
Defaults work out of the box. To change behavior, set Space variables:
|
| 750 |
+
|
| 751 |
+
- `ZSGDP_CONFIG_PATH` β point at one of `configs/default.yaml`, `configs/docling.yaml`, `configs/live_gpu_repair.yaml`, or your own committed YAML.
|
| 752 |
+
- `ZSGDP_LOG_LEVEL` β `INFO` (default on Spaces), `DEBUG`, `WARNING`, etc.
|
| 753 |
+
- `ZSGDP_LOG_JSON` β `1` (default on Spaces) for one-line JSON log records.
|
| 754 |
+
- `ZSGDP_MAX_UPLOAD_BYTES` / `ZSGDP_MAX_PAGE_COUNT` / `ZSGDP_MAX_BATCH_DOCS` β abuse guards.
|
| 755 |
+
- `HF_TOKEN` β required for gated models (jina-embeddings-v3 may need it).
|
| 756 |
+
|
| 757 |
+
## Known limits
|
| 758 |
+
|
| 759 |
+
- **ZeroGPU duration cap.** Each `@spaces.GPU`-decorated call runs in a 60s
|
| 760 |
+
GPU slot. First-call cold-start for big models (Qwen2.5-VL-3B is ~6 GB)
|
| 761 |
+
exceeds this on a clean cache. Subsequent calls reuse the cached weights
|
| 762 |
+
and fit comfortably.
|
| 763 |
+
- **Live GPU repair** only fires when the deterministic repair path can't
|
| 764 |
+
resolve an issue. For markdown tables, the deterministic normalizer
|
| 765 |
+
handles most malformations before GPU dispatch is needed.
|
| 766 |
+
- **GT-comparison metrics** (layout F1, table structure score, formula CER)
|
| 767 |
+
require labelled datasets (`omnidocbench`, `doclaynet`). Uploaded
|
| 768 |
+
custom corpora produce all the GT-free metrics but those three.
|
| 769 |
+
|
| 770 |
+
## Source
|
| 771 |
+
|
| 772 |
+
[](https://huggingface.co/spaces/arjun10g/zeroshotGPU)
|
| 773 |
+
|
| 774 |
+
The full project source β including the multi-step spec, contributor docs,
|
| 775 |
+
and 250+ unit tests β is at the link above. The `Files` tab on the Space
|
| 776 |
+
page shows the live deploy.
|
| 777 |
+
"""
|
| 778 |
+
|
| 779 |
+
|
| 780 |
with gr.Blocks(title="zeroshotGPU") as demo:
|
| 781 |
gr.Markdown(
|
| 782 |
"# zeroshotGPU\n\n"
|
|
|
|
| 786 |
"chunks (multi-strategy), a quality report with GT-comparison metrics "
|
| 787 |
"where applicable, and a SHA-256-checksummed artifact manifest. "
|
| 788 |
f"Per-file caps: {MAX_UPLOAD_BYTES // (1024 * 1024)} MB / "
|
| 789 |
+
f"{MAX_PAGE_COUNT} pages. Batch cap: {MAX_BATCH_DOCS} docs per request. "
|
| 790 |
+
"**See the [Help] tab for full instructions.**\n\n"
|
| 791 |
+
"[Source on Hugging Face](https://huggingface.co/spaces/arjun10g/zeroshotGPU)"
|
| 792 |
)
|
| 793 |
with gr.Row():
|
| 794 |
upload = gr.File(
|
|
|
|
| 806 |
parse_button = gr.Button("Parse", variant="primary")
|
| 807 |
archive = gr.File(label="Artifacts (zip)")
|
| 808 |
with gr.Tabs():
|
| 809 |
+
with gr.Tab("Help"):
|
| 810 |
+
gr.Markdown(_HELP_MARKDOWN)
|
| 811 |
with gr.Tab("Markdown"):
|
| 812 |
gr.Markdown(
|
| 813 |
"_Canonical markdown reconstruction of the parsed document. "
|