Arjunvir Singh
Initial commit: zeroshotGPU MVP with full eval surface
db06ffa

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Regression fixtures

Each fixture is a (<name>.input.<ext>, <name>.expected.json) pair under fixtures/. The runner in test_regression.py parses every input through parse_document and compares the resulting ParsedDocument against the snapshot in <name>.expected.json with explicit tolerances.

Fixture file shape

<name>.expected.json has these keys (all optional except name):

{
  "name": "human-readable identifier",
  "config": "configs/docling.yaml",
  "selected_parsers": ["text"],
  "tolerances": {
    "quality_score_min": 0.85,
    "element_count_range": [3, 6],
    "table_count": 1,
    "figure_count": 0,
    "chunk_count_min": 1,
    "blocking_failures": false,
    "must_contain_markdown": ["# Report", "Apples grow"],
    "must_not_contain_markdown": ["TODO", "FIXME"]
  }
}

Tolerance keys (all optional):

  • quality_score_min (float): assert parsed.quality_report.score >= value.
  • quality_score_max (float): assert parsed.quality_report.score <= value.
  • element_count (int) or element_count_range ([min, max]).
  • table_count (int) or table_count_range.
  • figure_count (int) or figure_count_range.
  • chunk_count_min (int): assert at least N chunks.
  • chunk_count_max (int): assert at most N chunks.
  • blocking_failures (bool): assert quality_report.has_blocking_failures matches.
  • must_contain_markdown (list[str]): each string must appear in parsed.to_markdown().
  • must_not_contain_markdown (list[str]): each string must NOT appear.
  • must_contain_quality_metrics (list[str]): each metric key must appear in quality_report.metrics.
  • parser_disagreement_rate_max (float): assert disagreement <= value.
  • repair_resolution_rate_min (float): assert resolution >= value.

Missing keys are not asserted (no false failures from over-specification).

Adding a fixture

  1. Drop the input document under fixtures/. PDFs, markdown, html, txt all work via the standard pipeline.
  2. Run a one-off parse_document against it locally and inspect the output.
  3. Hand-write <name>.expected.json with the constraints you want to lock down. Prefer ranges over exact counts where reasonable variance exists.
  4. Run python3.11 -m unittest tests.test_regression. It auto-discovers.

Performance baselines (opt-in)

A fixture may include a performance block with throughput floors:

{
  "performance": {
    "repeats": 2,
    "max_elapsed_seconds": 2.0,
    "min_pages_per_second": 0.5,
    "always_enforce": false
  }
}

Keys:

  • repeats (int, default 2): number of warm parses to time. The median elapsed is compared against the floor so a single cold-import outlier does not flag.
  • max_elapsed_seconds: parse must finish under this in median.
  • min_pages_per_second: median pages/sec must meet or beat this.
  • always_enforce (bool, default false): when true, perf is always checked.

Otherwise perf is gated on ZSGDP_REGRESSION_PERF=1 so slow CI runners don't get noisy. Floors should be catastrophic-regression guards — set them ~50–100x slacker than your local median, not tight perf bars. The point is to catch "parsing a tiny markdown doc now takes 30 seconds," not to track 5 % perf shifts.

To set a baseline for a new fixture: parse it 5 times locally, take the median, multiply by ~10–80x for the max_elapsed_seconds floor.

When a regression fires

The failure message points at the specific tolerance that broke. Don't blindly loosen the tolerance — investigate whether the regression is real first (parser-version bump, repair-loop drift, chunk planner change). If the new behavior is intentional and better, regenerate the snapshot.