Arjunvir Singh
Initial commit: zeroshotGPU MVP with full eval surface
db06ffa
# Regression fixtures
Each fixture is a `(<name>.input.<ext>, <name>.expected.json)` pair under
`fixtures/`. The runner in `test_regression.py` parses every input through
`parse_document` and compares the resulting `ParsedDocument` against the
snapshot in `<name>.expected.json` with explicit tolerances.
## Fixture file shape
`<name>.expected.json` has these keys (all optional except `name`):
```json
{
"name": "human-readable identifier",
"config": "configs/docling.yaml",
"selected_parsers": ["text"],
"tolerances": {
"quality_score_min": 0.85,
"element_count_range": [3, 6],
"table_count": 1,
"figure_count": 0,
"chunk_count_min": 1,
"blocking_failures": false,
"must_contain_markdown": ["# Report", "Apples grow"],
"must_not_contain_markdown": ["TODO", "FIXME"]
}
}
```
Tolerance keys (all optional):
- `quality_score_min` (float): assert `parsed.quality_report.score >= value`.
- `quality_score_max` (float): assert `parsed.quality_report.score <= value`.
- `element_count` (int) or `element_count_range` ([min, max]).
- `table_count` (int) or `table_count_range`.
- `figure_count` (int) or `figure_count_range`.
- `chunk_count_min` (int): assert at least N chunks.
- `chunk_count_max` (int): assert at most N chunks.
- `blocking_failures` (bool): assert `quality_report.has_blocking_failures` matches.
- `must_contain_markdown` (list[str]): each string must appear in
`parsed.to_markdown()`.
- `must_not_contain_markdown` (list[str]): each string must NOT appear.
- `must_contain_quality_metrics` (list[str]): each metric key must appear in
`quality_report.metrics`.
- `parser_disagreement_rate_max` (float): assert disagreement <= value.
- `repair_resolution_rate_min` (float): assert resolution >= value.
Missing keys are not asserted (no false failures from over-specification).
## Adding a fixture
1. Drop the input document under `fixtures/`. PDFs, markdown, html, txt all
work via the standard pipeline.
2. Run a one-off `parse_document` against it locally and inspect the output.
3. Hand-write `<name>.expected.json` with the constraints you want to lock
down. Prefer ranges over exact counts where reasonable variance exists.
4. Run `python3.11 -m unittest tests.test_regression`. It auto-discovers.
## Performance baselines (opt-in)
A fixture may include a `performance` block with throughput floors:
```json
{
"performance": {
"repeats": 2,
"max_elapsed_seconds": 2.0,
"min_pages_per_second": 0.5,
"always_enforce": false
}
}
```
Keys:
- `repeats` (int, default 2): number of warm parses to time. The median
elapsed is compared against the floor so a single cold-import outlier
does not flag.
- `max_elapsed_seconds`: parse must finish under this in median.
- `min_pages_per_second`: median pages/sec must meet or beat this.
- `always_enforce` (bool, default false): when true, perf is always checked.
Otherwise perf is gated on `ZSGDP_REGRESSION_PERF=1` so slow CI runners
don't get noisy. Floors should be **catastrophic-regression guards** β€” set
them ~50–100x slacker than your local median, not tight perf bars. The
point is to catch "parsing a tiny markdown doc now takes 30 seconds,"
not to track 5 % perf shifts.
To set a baseline for a new fixture: parse it 5 times locally, take the
median, multiply by ~10–80x for the `max_elapsed_seconds` floor.
## When a regression fires
The failure message points at the specific tolerance that broke. Don't blindly
loosen the tolerance β€” investigate whether the regression is real first
(parser-version bump, repair-loop drift, chunk planner change). If the new
behavior is intentional and better, regenerate the snapshot.