Spaces:
Running on Zero
Running on Zero
Arjunvir Singh commited on
Commit ·
9065d0f
1
Parent(s): 83059c2
Upload-driven benchmark + CHANGELOG catch-up
Browse files- run_benchmark_on_upload: takes a .zip / list / single file, runs the
parser benchmark on the uploaded corpus, returns headline metrics +
per-doc breakdown. GT-comparison metrics intentionally absent
(require labelled datasets). Surfaced in the Benchmark tab as a
second button alongside the regression-fixture button.
- Both benchmark buttons are also reachable via /gradio_api/call/
for remote validation: run_benchmark_in_space (fixtures) and
run_benchmark_on_upload (user-supplied corpus).
- CHANGELOG: catches up the past N pushes worth of work — ZeroGPU
integration, frontend overhaul, smoke/benchmark surfaces,
production-numbers section in the README.
- CHANGELOG.md +62 -0
- app.py +126 -5
CHANGELOG.md
CHANGED
|
@@ -7,6 +7,68 @@ pre-1.0 so minor bumps may include breaking changes.
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
### Documentation — README restructured
|
| 11 |
|
| 12 |
- Reorganised into Install → Quick start → Opt-ins → Outputs →
|
|
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
| 10 |
+
### Added — Live ZeroGPU integration + Space deploy
|
| 11 |
+
|
| 12 |
+
- Deployed to https://huggingface.co/spaces/arjun10g/zeroshotGPU on
|
| 13 |
+
`zero-a10g` hardware (ZeroGPU free tier).
|
| 14 |
+
- ZeroGPU integration helper (`zsgdp/gpu/zero_gpu.py`): no-op decorator
|
| 15 |
+
off-Space, `@spaces.GPU(duration=N)` on-Space.
|
| 16 |
+
- Two stateless GPU helpers — `_gpu_encode_batch` (embedding) and
|
| 17 |
+
`_gpu_run_pipeline` (transformers VLM/LLM) — that take only picklable
|
| 18 |
+
inputs and return only picklable outputs. State stays in the calling
|
| 19 |
+
process; the GPU worker is invoked just for compute.
|
| 20 |
+
- `EmbeddingRetriever.index/query` and `TransformersClient.execute_task`
|
| 21 |
+
refactored to use the stateless helpers. Fixes a bug where the
|
| 22 |
+
in-method `@spaces.GPU` decoration silently dropped `self._chunk_ids`
|
| 23 |
+
/ `self._vectors` / cached pipelines because they were set inside the
|
| 24 |
+
ephemeral worker process — yielding `recall@1=0` even with a
|
| 25 |
+
successful model run.
|
| 26 |
+
- `_gpu_encode_batch` and `_gpu_run_pipeline` use `duration=60` to
|
| 27 |
+
stay under the ZeroGPU free-tier cap (180s rejected).
|
| 28 |
+
- `runtime.py` now reports `zero_gpu_available` and adds an explicit
|
| 29 |
+
log note when the ZeroGPU SDK is detected vs. when hardware reports
|
| 30 |
+
ZeroGPU but the SDK is missing.
|
| 31 |
+
|
| 32 |
+
### Added — Frontend (Gradio)
|
| 33 |
+
|
| 34 |
+
- ZIP and multi-file upload (`gr.File(file_count="multiple")`,
|
| 35 |
+
accepts `.zip`); server-side extraction with path-traversal guard,
|
| 36 |
+
capped at `MAX_BATCH_DOCS` (default 20).
|
| 37 |
+
- `gr.Progress` callback wired through `parse_uploaded_document` with
|
| 38 |
+
stage labels: "Validating uploads...", "Parsing N/M: <name>",
|
| 39 |
+
"Bundling artifacts...", "Done".
|
| 40 |
+
- Three pipeline modes in the dropdown: `Docling + PyMuPDF`,
|
| 41 |
+
`Default lightweight`, `Live GPU repair` (loads
|
| 42 |
+
`configs/live_gpu_repair.yaml`, sets
|
| 43 |
+
`repair.execute_gpu_escalations=true`).
|
| 44 |
+
- New tabs: `Chunks` (per-strategy detail with token-count distribution
|
| 45 |
+
+ 3 sample chunks per strategy), `Artifacts` (individual file
|
| 46 |
+
downloads), `Smokes` (button bound to `run_smokes_in_space`),
|
| 47 |
+
`Benchmark` (regression-fixture + uploaded-corpus modes).
|
| 48 |
+
- `run_smokes_in_space`, `run_benchmark_in_space`, and
|
| 49 |
+
`run_benchmark_on_upload` exposed at `/gradio_api/call/<name>` for
|
| 50 |
+
remote validation.
|
| 51 |
+
- `parse_uploaded_document` returns a `summary.batch` block with
|
| 52 |
+
per-doc results and aggregate metrics when more than one doc parsed.
|
| 53 |
+
- Agent-friendly docstring added to `parse_uploaded_document` so HF's
|
| 54 |
+
`agents.md` surface gives AI agents proper "when to use" guidance.
|
| 55 |
+
- `chunking_payload` now includes a `detail` block with strategy
|
| 56 |
+
counts, token-count statistics, and 3 sample chunks per strategy
|
| 57 |
+
with 240-char previews.
|
| 58 |
+
|
| 59 |
+
### Documentation — Production benchmark numbers landed
|
| 60 |
+
|
| 61 |
+
- README's "Production benchmark numbers" section filled in with real
|
| 62 |
+
values from the live Space (commit `de03f34`):
|
| 63 |
+
- Smokes: 4/5 pass on `arjun10g/zeroshotGPU` (lexical, ablation,
|
| 64 |
+
embedding `recall@1=1.0` `recall@5=1.0`, gpu_repair-wiring).
|
| 65 |
+
- Benchmark: `mean_quality_score=0.964`, `mean_retrieval_recall_at_5=1.0`,
|
| 66 |
+
`mean_repair_resolution_rate=1.0`, `mean_repair_regression_rate=0.0`.
|
| 67 |
+
- Live GPU repair invocation documented as wiring-verified but
|
| 68 |
+
end-to-end deferred (deterministic markdown table normalizer fixes
|
| 69 |
+
malformations before GPU escalation kicks in; needs a real PDF with
|
| 70 |
+
bbox-aware tables to genuinely fire Qwen).
|
| 71 |
+
|
| 72 |
### Documentation — README restructured
|
| 73 |
|
| 74 |
- Reorganised into Install → Quick start → Opt-ins → Outputs →
|
app.py
CHANGED
|
@@ -524,6 +524,107 @@ def run_smokes_in_space() -> dict:
|
|
| 524 |
return payload
|
| 525 |
|
| 526 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 527 |
def run_benchmark_in_space() -> dict:
|
| 528 |
"""Run a benchmark against tests/regression/fixtures and return the headline numbers.
|
| 529 |
|
|
@@ -651,12 +752,26 @@ with gr.Blocks(title="zeroshotGPU") as demo:
|
|
| 651 |
smoke_output = gr.JSON(label="Smoke report")
|
| 652 |
with gr.Tab("Benchmark"):
|
| 653 |
gr.Markdown(
|
| 654 |
-
"
|
| 655 |
-
"
|
| 656 |
-
"
|
| 657 |
-
"`/gradio_api/call/run_benchmark_in_space`."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 658 |
)
|
| 659 |
-
benchmark_button = gr.Button("Run benchmark", variant="primary")
|
| 660 |
benchmark_output = gr.JSON(label="Benchmark headline metrics")
|
| 661 |
parse_button.click(
|
| 662 |
parse_uploaded_document,
|
|
@@ -677,6 +792,12 @@ with gr.Blocks(title="zeroshotGPU") as demo:
|
|
| 677 |
)
|
| 678 |
smoke_button.click(run_smokes_in_space, inputs=[], outputs=smoke_output, api_name="run_smokes_in_space")
|
| 679 |
benchmark_button.click(run_benchmark_in_space, inputs=[], outputs=benchmark_output, api_name="run_benchmark_in_space")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 680 |
|
| 681 |
|
| 682 |
if __name__ == "__main__":
|
|
|
|
| 524 |
return payload
|
| 525 |
|
| 526 |
|
| 527 |
+
def run_benchmark_on_upload(file_obj: Any) -> dict:
|
| 528 |
+
"""Run the parser benchmark against a user-supplied corpus.
|
| 529 |
+
|
| 530 |
+
Accepts the same upload shapes as `parse_uploaded_document`: a single
|
| 531 |
+
document, a list, or a `.zip` of documents. Per-file caps and batch
|
| 532 |
+
cap apply identically. Returns the benchmark headline metrics plus a
|
| 533 |
+
`documents` list with per-doc records.
|
| 534 |
+
|
| 535 |
+
For real §29 numbers against labelled datasets, use the
|
| 536 |
+
`omnidocbench` or `doclaynet` loader from a Pro-tier Dev Mode
|
| 537 |
+
terminal — those add layout F1 / table structure / formula CER which
|
| 538 |
+
require ground-truth annotations not available from a raw upload.
|
| 539 |
+
"""
|
| 540 |
+
|
| 541 |
+
if file_obj is None:
|
| 542 |
+
return {"error": "Upload at least one document to benchmark."}
|
| 543 |
+
|
| 544 |
+
import tempfile
|
| 545 |
+
from zsgdp.benchmarks.parser_quality import run_parser_benchmark
|
| 546 |
+
|
| 547 |
+
if isinstance(file_obj, list):
|
| 548 |
+
upload_paths = [Path(item.name if hasattr(item, "name") else item) for item in file_obj if item is not None]
|
| 549 |
+
elif hasattr(file_obj, "name"):
|
| 550 |
+
upload_paths = [Path(file_obj.name)]
|
| 551 |
+
else:
|
| 552 |
+
upload_paths = [Path(str(file_obj))]
|
| 553 |
+
if not upload_paths:
|
| 554 |
+
return {"error": "Upload at least one document to benchmark."}
|
| 555 |
+
|
| 556 |
+
work_dir = Path(tempfile.mkdtemp(prefix="zsgdp_bench_upload_"))
|
| 557 |
+
docs = _extract_uploads_to_parse(upload_paths, work_dir)
|
| 558 |
+
if not docs:
|
| 559 |
+
return {
|
| 560 |
+
"error": "No supported documents found in the upload (accepted: pdf/md/txt/html, optionally inside a zip).",
|
| 561 |
+
"input_count": len(upload_paths),
|
| 562 |
+
}
|
| 563 |
+
|
| 564 |
+
# Per-file abuse guards.
|
| 565 |
+
for doc in docs:
|
| 566 |
+
try:
|
| 567 |
+
_validate_upload(doc)
|
| 568 |
+
except UploadRejected as exc:
|
| 569 |
+
return {"error": str(exc), "rejected": True, "source_path": str(doc)}
|
| 570 |
+
|
| 571 |
+
bench_input = work_dir / "input"
|
| 572 |
+
bench_input.mkdir()
|
| 573 |
+
for doc in docs:
|
| 574 |
+
target = bench_input / doc.name
|
| 575 |
+
# Avoid name collisions (different paths, same filename inside zips).
|
| 576 |
+
suffix = 2
|
| 577 |
+
while target.exists():
|
| 578 |
+
target = bench_input / f"{doc.stem}_{suffix}{doc.suffix}"
|
| 579 |
+
suffix += 1
|
| 580 |
+
shutil.copy2(doc, target)
|
| 581 |
+
|
| 582 |
+
out = work_dir / "out"
|
| 583 |
+
_logger.info(
|
| 584 |
+
"space_benchmark_upload_requested",
|
| 585 |
+
extra={"input_count": len(upload_paths), "docs_found": len(docs)},
|
| 586 |
+
)
|
| 587 |
+
summary = run_parser_benchmark(bench_input, out, dataset_name="custom_folder")
|
| 588 |
+
|
| 589 |
+
headline = {
|
| 590 |
+
"dataset_name": summary.get("dataset_name"),
|
| 591 |
+
"document_count": summary.get("document_count"),
|
| 592 |
+
"mean_quality_score": summary.get("mean_quality_score"),
|
| 593 |
+
"mean_retrieval_recall_at_1": summary.get("mean_retrieval_recall_at_1"),
|
| 594 |
+
"mean_retrieval_recall_at_5": summary.get("mean_retrieval_recall_at_5"),
|
| 595 |
+
"mean_retrieval_mrr": summary.get("mean_retrieval_mrr"),
|
| 596 |
+
"mean_parser_disagreement_rate": summary.get("mean_parser_disagreement_rate"),
|
| 597 |
+
"mean_repair_resolution_rate": summary.get("mean_repair_resolution_rate"),
|
| 598 |
+
"mean_repair_regression_rate": summary.get("mean_repair_regression_rate"),
|
| 599 |
+
"retrieval_evaluated_count": summary.get("retrieval_evaluated_count"),
|
| 600 |
+
"documents": [
|
| 601 |
+
{
|
| 602 |
+
"doc_id": doc.get("doc_id"),
|
| 603 |
+
"file_type": doc.get("file_type"),
|
| 604 |
+
"quality_score": doc.get("quality_score"),
|
| 605 |
+
"elements": doc.get("element_count"),
|
| 606 |
+
"tables": doc.get("table_count"),
|
| 607 |
+
"figures": doc.get("figure_count"),
|
| 608 |
+
"chunks": doc.get("chunk_count"),
|
| 609 |
+
"parser_disagreement_rate": doc.get("parser_disagreement_rate"),
|
| 610 |
+
"repair_resolution_rate": doc.get("repair_resolution_rate"),
|
| 611 |
+
"elapsed_seconds": doc.get("elapsed_seconds"),
|
| 612 |
+
}
|
| 613 |
+
for doc in summary.get("documents") or []
|
| 614 |
+
],
|
| 615 |
+
"note": (
|
| 616 |
+
"GT-comparison metrics (layout F1, table structure, formula CER) "
|
| 617 |
+
"are unavailable for arbitrary uploads — they need labelled datasets "
|
| 618 |
+
"(omnidocbench / doclaynet)."
|
| 619 |
+
),
|
| 620 |
+
}
|
| 621 |
+
_logger.info(
|
| 622 |
+
"space_benchmark_upload_complete",
|
| 623 |
+
extra={k: v for k, v in headline.items() if k != "documents" and not isinstance(v, list)},
|
| 624 |
+
)
|
| 625 |
+
return headline
|
| 626 |
+
|
| 627 |
+
|
| 628 |
def run_benchmark_in_space() -> dict:
|
| 629 |
"""Run a benchmark against tests/regression/fixtures and return the headline numbers.
|
| 630 |
|
|
|
|
| 752 |
smoke_output = gr.JSON(label="Smoke report")
|
| 753 |
with gr.Tab("Benchmark"):
|
| 754 |
gr.Markdown(
|
| 755 |
+
"**Two benchmark modes:**\n"
|
| 756 |
+
"- **Run on regression fixtures** — uses the committed seed "
|
| 757 |
+
"documents (`tests/regression/fixtures/`); reproducible without "
|
| 758 |
+
"any upload. API: `/gradio_api/call/run_benchmark_in_space`.\n"
|
| 759 |
+
"- **Run on uploaded corpus** — accepts a `.zip` of documents "
|
| 760 |
+
"(or a list of files). Returns headline metrics plus a per-doc "
|
| 761 |
+
"breakdown. GT-comparison metrics (layout F1, table structure, "
|
| 762 |
+
"formula CER) are NOT computed — those require labelled "
|
| 763 |
+
"datasets (`omnidocbench` / `doclaynet`) which can be loaded "
|
| 764 |
+
"via the CLI from a Pro-tier Dev Mode terminal. API: "
|
| 765 |
+
"`/gradio_api/call/run_benchmark_on_upload`."
|
| 766 |
+
)
|
| 767 |
+
with gr.Row():
|
| 768 |
+
benchmark_button = gr.Button("Run on regression fixtures", variant="primary")
|
| 769 |
+
benchmark_upload_button = gr.Button("Run on uploaded corpus")
|
| 770 |
+
benchmark_corpus = gr.File(
|
| 771 |
+
label="Optional upload — used only when 'Run on uploaded corpus' is clicked",
|
| 772 |
+
file_types=[".pdf", ".md", ".txt", ".html", ".htm", ".zip"],
|
| 773 |
+
file_count="multiple",
|
| 774 |
)
|
|
|
|
| 775 |
benchmark_output = gr.JSON(label="Benchmark headline metrics")
|
| 776 |
parse_button.click(
|
| 777 |
parse_uploaded_document,
|
|
|
|
| 792 |
)
|
| 793 |
smoke_button.click(run_smokes_in_space, inputs=[], outputs=smoke_output, api_name="run_smokes_in_space")
|
| 794 |
benchmark_button.click(run_benchmark_in_space, inputs=[], outputs=benchmark_output, api_name="run_benchmark_in_space")
|
| 795 |
+
benchmark_upload_button.click(
|
| 796 |
+
run_benchmark_on_upload,
|
| 797 |
+
inputs=[benchmark_corpus],
|
| 798 |
+
outputs=benchmark_output,
|
| 799 |
+
api_name="run_benchmark_on_upload",
|
| 800 |
+
)
|
| 801 |
|
| 802 |
|
| 803 |
if __name__ == "__main__":
|