Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

App Files Files Community

zeroshotGPU / tests /regression /README.md

Arjunvir Singh

Initial commit: zeroshotGPU MVP with full eval surface

db06ffa 24 days ago

preview code

raw

history blame contribute delete

3.69 kB

	# Regression fixtures

	Each fixture is a `(<name>.input.<ext>, <name>.expected.json)` pair under
	`fixtures/`. The runner in `test_regression.py` parses every input through
	`parse_document` and compares the resulting `ParsedDocument` against the
	snapshot in `<name>.expected.json` with explicit tolerances.

	## Fixture file shape

	`<name>.expected.json` has these keys (all optional except `name`):

	```json
	{
	"name": "human-readable identifier",
	"config": "configs/docling.yaml",
	"selected_parsers": ["text"],
	"tolerances": {
	"quality_score_min": 0.85,
	"element_count_range": [3, 6],
	"table_count": 1,
	"figure_count": 0,
	"chunk_count_min": 1,
	"blocking_failures": false,
	"must_contain_markdown": ["# Report", "Apples grow"],
	"must_not_contain_markdown": ["TODO", "FIXME"]
	}
	}
	```

	Tolerance keys (all optional):

	- `quality_score_min` (float): assert `parsed.quality_report.score >= value`.
	- `quality_score_max` (float): assert `parsed.quality_report.score <= value`.
	- `element_count` (int) or `element_count_range` ([min, max]).
	- `table_count` (int) or `table_count_range`.
	- `figure_count` (int) or `figure_count_range`.
	- `chunk_count_min` (int): assert at least N chunks.
	- `chunk_count_max` (int): assert at most N chunks.
	- `blocking_failures` (bool): assert `quality_report.has_blocking_failures` matches.
	- `must_contain_markdown` (list[str]): each string must appear in
	`parsed.to_markdown()`.
	- `must_not_contain_markdown` (list[str]): each string must NOT appear.
	- `must_contain_quality_metrics` (list[str]): each metric key must appear in
	`quality_report.metrics`.
	- `parser_disagreement_rate_max` (float): assert disagreement <= value.
	- `repair_resolution_rate_min` (float): assert resolution >= value.

	Missing keys are not asserted (no false failures from over-specification).

	## Adding a fixture

	1. Drop the input document under `fixtures/`. PDFs, markdown, html, txt all
	work via the standard pipeline.
	2. Run a one-off `parse_document` against it locally and inspect the output.
	3. Hand-write `<name>.expected.json` with the constraints you want to lock
	down. Prefer ranges over exact counts where reasonable variance exists.
	4. Run `python3.11 -m unittest tests.test_regression`. It auto-discovers.

	## Performance baselines (opt-in)

	A fixture may include a `performance` block with throughput floors:

	```json
	{
	"performance": {
	"repeats": 2,
	"max_elapsed_seconds": 2.0,
	"min_pages_per_second": 0.5,
	"always_enforce": false
	}
	}
	```

	Keys:

	- `repeats` (int, default 2): number of warm parses to time. The median
	elapsed is compared against the floor so a single cold-import outlier
	does not flag.
	- `max_elapsed_seconds`: parse must finish under this in median.
	- `min_pages_per_second`: median pages/sec must meet or beat this.
	- `always_enforce` (bool, default false): when true, perf is always checked.

	Otherwise perf is gated on `ZSGDP_REGRESSION_PERF=1` so slow CI runners
	don't get noisy. Floors should be catastrophic-regression guards — set
	them ~50–100x slacker than your local median, not tight perf bars. The
	point is to catch "parsing a tiny markdown doc now takes 30 seconds,"
	not to track 5 % perf shifts.

	To set a baseline for a new fixture: parse it 5 times locally, take the
	median, multiply by ~10–80x for the `max_elapsed_seconds` floor.

	## When a regression fires

	The failure message points at the specific tolerance that broke. Don't blindly
	loosen the tolerance — investigate whether the regression is real first
	(parser-version bump, repair-loop drift, chunk planner change). If the new
	behavior is intentional and better, regenerate the snapshot.