Arjunvir Singh commited on
Commit
9065d0f
·
1 Parent(s): 83059c2

Upload-driven benchmark + CHANGELOG catch-up

Browse files

- run_benchmark_on_upload: takes a .zip / list / single file, runs the
parser benchmark on the uploaded corpus, returns headline metrics +
per-doc breakdown. GT-comparison metrics intentionally absent
(require labelled datasets). Surfaced in the Benchmark tab as a
second button alongside the regression-fixture button.
- Both benchmark buttons are also reachable via /gradio_api/call/
for remote validation: run_benchmark_in_space (fixtures) and
run_benchmark_on_upload (user-supplied corpus).
- CHANGELOG: catches up the past N pushes worth of work — ZeroGPU
integration, frontend overhaul, smoke/benchmark surfaces,
production-numbers section in the README.

Files changed (2) hide show
  1. CHANGELOG.md +62 -0
  2. app.py +126 -5
CHANGELOG.md CHANGED
@@ -7,6 +7,68 @@ pre-1.0 so minor bumps may include breaking changes.
7
 
8
  ## [Unreleased]
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ### Documentation — README restructured
11
 
12
  - Reorganised into Install → Quick start → Opt-ins → Outputs →
 
7
 
8
  ## [Unreleased]
9
 
10
+ ### Added — Live ZeroGPU integration + Space deploy
11
+
12
+ - Deployed to https://huggingface.co/spaces/arjun10g/zeroshotGPU on
13
+ `zero-a10g` hardware (ZeroGPU free tier).
14
+ - ZeroGPU integration helper (`zsgdp/gpu/zero_gpu.py`): no-op decorator
15
+ off-Space, `@spaces.GPU(duration=N)` on-Space.
16
+ - Two stateless GPU helpers — `_gpu_encode_batch` (embedding) and
17
+ `_gpu_run_pipeline` (transformers VLM/LLM) — that take only picklable
18
+ inputs and return only picklable outputs. State stays in the calling
19
+ process; the GPU worker is invoked just for compute.
20
+ - `EmbeddingRetriever.index/query` and `TransformersClient.execute_task`
21
+ refactored to use the stateless helpers. Fixes a bug where the
22
+ in-method `@spaces.GPU` decoration silently dropped `self._chunk_ids`
23
+ / `self._vectors` / cached pipelines because they were set inside the
24
+ ephemeral worker process — yielding `recall@1=0` even with a
25
+ successful model run.
26
+ - `_gpu_encode_batch` and `_gpu_run_pipeline` use `duration=60` to
27
+ stay under the ZeroGPU free-tier cap (180s rejected).
28
+ - `runtime.py` now reports `zero_gpu_available` and adds an explicit
29
+ log note when the ZeroGPU SDK is detected vs. when hardware reports
30
+ ZeroGPU but the SDK is missing.
31
+
32
+ ### Added — Frontend (Gradio)
33
+
34
+ - ZIP and multi-file upload (`gr.File(file_count="multiple")`,
35
+ accepts `.zip`); server-side extraction with path-traversal guard,
36
+ capped at `MAX_BATCH_DOCS` (default 20).
37
+ - `gr.Progress` callback wired through `parse_uploaded_document` with
38
+ stage labels: "Validating uploads...", "Parsing N/M: <name>",
39
+ "Bundling artifacts...", "Done".
40
+ - Three pipeline modes in the dropdown: `Docling + PyMuPDF`,
41
+ `Default lightweight`, `Live GPU repair` (loads
42
+ `configs/live_gpu_repair.yaml`, sets
43
+ `repair.execute_gpu_escalations=true`).
44
+ - New tabs: `Chunks` (per-strategy detail with token-count distribution
45
+ + 3 sample chunks per strategy), `Artifacts` (individual file
46
+ downloads), `Smokes` (button bound to `run_smokes_in_space`),
47
+ `Benchmark` (regression-fixture + uploaded-corpus modes).
48
+ - `run_smokes_in_space`, `run_benchmark_in_space`, and
49
+ `run_benchmark_on_upload` exposed at `/gradio_api/call/<name>` for
50
+ remote validation.
51
+ - `parse_uploaded_document` returns a `summary.batch` block with
52
+ per-doc results and aggregate metrics when more than one doc parsed.
53
+ - Agent-friendly docstring added to `parse_uploaded_document` so HF's
54
+ `agents.md` surface gives AI agents proper "when to use" guidance.
55
+ - `chunking_payload` now includes a `detail` block with strategy
56
+ counts, token-count statistics, and 3 sample chunks per strategy
57
+ with 240-char previews.
58
+
59
+ ### Documentation — Production benchmark numbers landed
60
+
61
+ - README's "Production benchmark numbers" section filled in with real
62
+ values from the live Space (commit `de03f34`):
63
+ - Smokes: 4/5 pass on `arjun10g/zeroshotGPU` (lexical, ablation,
64
+ embedding `recall@1=1.0` `recall@5=1.0`, gpu_repair-wiring).
65
+ - Benchmark: `mean_quality_score=0.964`, `mean_retrieval_recall_at_5=1.0`,
66
+ `mean_repair_resolution_rate=1.0`, `mean_repair_regression_rate=0.0`.
67
+ - Live GPU repair invocation documented as wiring-verified but
68
+ end-to-end deferred (deterministic markdown table normalizer fixes
69
+ malformations before GPU escalation kicks in; needs a real PDF with
70
+ bbox-aware tables to genuinely fire Qwen).
71
+
72
  ### Documentation — README restructured
73
 
74
  - Reorganised into Install → Quick start → Opt-ins → Outputs →
app.py CHANGED
@@ -524,6 +524,107 @@ def run_smokes_in_space() -> dict:
524
  return payload
525
 
526
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
527
  def run_benchmark_in_space() -> dict:
528
  """Run a benchmark against tests/regression/fixtures and return the headline numbers.
529
 
@@ -651,12 +752,26 @@ with gr.Blocks(title="zeroshotGPU") as demo:
651
  smoke_output = gr.JSON(label="Smoke report")
652
  with gr.Tab("Benchmark"):
653
  gr.Markdown(
654
- "Runs the parser benchmark against the committed regression "
655
- "fixtures (`tests/regression/fixtures/`). For real-corpus runs, "
656
- "use `zsgdp benchmark` from a Dev Mode terminal. API endpoint: "
657
- "`/gradio_api/call/run_benchmark_in_space`."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
658
  )
659
- benchmark_button = gr.Button("Run benchmark", variant="primary")
660
  benchmark_output = gr.JSON(label="Benchmark headline metrics")
661
  parse_button.click(
662
  parse_uploaded_document,
@@ -677,6 +792,12 @@ with gr.Blocks(title="zeroshotGPU") as demo:
677
  )
678
  smoke_button.click(run_smokes_in_space, inputs=[], outputs=smoke_output, api_name="run_smokes_in_space")
679
  benchmark_button.click(run_benchmark_in_space, inputs=[], outputs=benchmark_output, api_name="run_benchmark_in_space")
 
 
 
 
 
 
680
 
681
 
682
  if __name__ == "__main__":
 
524
  return payload
525
 
526
 
527
+ def run_benchmark_on_upload(file_obj: Any) -> dict:
528
+ """Run the parser benchmark against a user-supplied corpus.
529
+
530
+ Accepts the same upload shapes as `parse_uploaded_document`: a single
531
+ document, a list, or a `.zip` of documents. Per-file caps and batch
532
+ cap apply identically. Returns the benchmark headline metrics plus a
533
+ `documents` list with per-doc records.
534
+
535
+ For real §29 numbers against labelled datasets, use the
536
+ `omnidocbench` or `doclaynet` loader from a Pro-tier Dev Mode
537
+ terminal — those add layout F1 / table structure / formula CER which
538
+ require ground-truth annotations not available from a raw upload.
539
+ """
540
+
541
+ if file_obj is None:
542
+ return {"error": "Upload at least one document to benchmark."}
543
+
544
+ import tempfile
545
+ from zsgdp.benchmarks.parser_quality import run_parser_benchmark
546
+
547
+ if isinstance(file_obj, list):
548
+ upload_paths = [Path(item.name if hasattr(item, "name") else item) for item in file_obj if item is not None]
549
+ elif hasattr(file_obj, "name"):
550
+ upload_paths = [Path(file_obj.name)]
551
+ else:
552
+ upload_paths = [Path(str(file_obj))]
553
+ if not upload_paths:
554
+ return {"error": "Upload at least one document to benchmark."}
555
+
556
+ work_dir = Path(tempfile.mkdtemp(prefix="zsgdp_bench_upload_"))
557
+ docs = _extract_uploads_to_parse(upload_paths, work_dir)
558
+ if not docs:
559
+ return {
560
+ "error": "No supported documents found in the upload (accepted: pdf/md/txt/html, optionally inside a zip).",
561
+ "input_count": len(upload_paths),
562
+ }
563
+
564
+ # Per-file abuse guards.
565
+ for doc in docs:
566
+ try:
567
+ _validate_upload(doc)
568
+ except UploadRejected as exc:
569
+ return {"error": str(exc), "rejected": True, "source_path": str(doc)}
570
+
571
+ bench_input = work_dir / "input"
572
+ bench_input.mkdir()
573
+ for doc in docs:
574
+ target = bench_input / doc.name
575
+ # Avoid name collisions (different paths, same filename inside zips).
576
+ suffix = 2
577
+ while target.exists():
578
+ target = bench_input / f"{doc.stem}_{suffix}{doc.suffix}"
579
+ suffix += 1
580
+ shutil.copy2(doc, target)
581
+
582
+ out = work_dir / "out"
583
+ _logger.info(
584
+ "space_benchmark_upload_requested",
585
+ extra={"input_count": len(upload_paths), "docs_found": len(docs)},
586
+ )
587
+ summary = run_parser_benchmark(bench_input, out, dataset_name="custom_folder")
588
+
589
+ headline = {
590
+ "dataset_name": summary.get("dataset_name"),
591
+ "document_count": summary.get("document_count"),
592
+ "mean_quality_score": summary.get("mean_quality_score"),
593
+ "mean_retrieval_recall_at_1": summary.get("mean_retrieval_recall_at_1"),
594
+ "mean_retrieval_recall_at_5": summary.get("mean_retrieval_recall_at_5"),
595
+ "mean_retrieval_mrr": summary.get("mean_retrieval_mrr"),
596
+ "mean_parser_disagreement_rate": summary.get("mean_parser_disagreement_rate"),
597
+ "mean_repair_resolution_rate": summary.get("mean_repair_resolution_rate"),
598
+ "mean_repair_regression_rate": summary.get("mean_repair_regression_rate"),
599
+ "retrieval_evaluated_count": summary.get("retrieval_evaluated_count"),
600
+ "documents": [
601
+ {
602
+ "doc_id": doc.get("doc_id"),
603
+ "file_type": doc.get("file_type"),
604
+ "quality_score": doc.get("quality_score"),
605
+ "elements": doc.get("element_count"),
606
+ "tables": doc.get("table_count"),
607
+ "figures": doc.get("figure_count"),
608
+ "chunks": doc.get("chunk_count"),
609
+ "parser_disagreement_rate": doc.get("parser_disagreement_rate"),
610
+ "repair_resolution_rate": doc.get("repair_resolution_rate"),
611
+ "elapsed_seconds": doc.get("elapsed_seconds"),
612
+ }
613
+ for doc in summary.get("documents") or []
614
+ ],
615
+ "note": (
616
+ "GT-comparison metrics (layout F1, table structure, formula CER) "
617
+ "are unavailable for arbitrary uploads — they need labelled datasets "
618
+ "(omnidocbench / doclaynet)."
619
+ ),
620
+ }
621
+ _logger.info(
622
+ "space_benchmark_upload_complete",
623
+ extra={k: v for k, v in headline.items() if k != "documents" and not isinstance(v, list)},
624
+ )
625
+ return headline
626
+
627
+
628
  def run_benchmark_in_space() -> dict:
629
  """Run a benchmark against tests/regression/fixtures and return the headline numbers.
630
 
 
752
  smoke_output = gr.JSON(label="Smoke report")
753
  with gr.Tab("Benchmark"):
754
  gr.Markdown(
755
+ "**Two benchmark modes:**\n"
756
+ "- **Run on regression fixtures** uses the committed seed "
757
+ "documents (`tests/regression/fixtures/`); reproducible without "
758
+ "any upload. API: `/gradio_api/call/run_benchmark_in_space`.\n"
759
+ "- **Run on uploaded corpus** — accepts a `.zip` of documents "
760
+ "(or a list of files). Returns headline metrics plus a per-doc "
761
+ "breakdown. GT-comparison metrics (layout F1, table structure, "
762
+ "formula CER) are NOT computed — those require labelled "
763
+ "datasets (`omnidocbench` / `doclaynet`) which can be loaded "
764
+ "via the CLI from a Pro-tier Dev Mode terminal. API: "
765
+ "`/gradio_api/call/run_benchmark_on_upload`."
766
+ )
767
+ with gr.Row():
768
+ benchmark_button = gr.Button("Run on regression fixtures", variant="primary")
769
+ benchmark_upload_button = gr.Button("Run on uploaded corpus")
770
+ benchmark_corpus = gr.File(
771
+ label="Optional upload — used only when 'Run on uploaded corpus' is clicked",
772
+ file_types=[".pdf", ".md", ".txt", ".html", ".htm", ".zip"],
773
+ file_count="multiple",
774
  )
 
775
  benchmark_output = gr.JSON(label="Benchmark headline metrics")
776
  parse_button.click(
777
  parse_uploaded_document,
 
792
  )
793
  smoke_button.click(run_smokes_in_space, inputs=[], outputs=smoke_output, api_name="run_smokes_in_space")
794
  benchmark_button.click(run_benchmark_in_space, inputs=[], outputs=benchmark_output, api_name="run_benchmark_in_space")
795
+ benchmark_upload_button.click(
796
+ run_benchmark_on_upload,
797
+ inputs=[benchmark_corpus],
798
+ outputs=benchmark_output,
799
+ api_name="run_benchmark_on_upload",
800
+ )
801
 
802
 
803
  if __name__ == "__main__":