# Regression fixtures Each fixture is a `(.input., .expected.json)` pair under `fixtures/`. The runner in `test_regression.py` parses every input through `parse_document` and compares the resulting `ParsedDocument` against the snapshot in `.expected.json` with explicit tolerances. ## Fixture file shape `.expected.json` has these keys (all optional except `name`): ```json { "name": "human-readable identifier", "config": "configs/docling.yaml", "selected_parsers": ["text"], "tolerances": { "quality_score_min": 0.85, "element_count_range": [3, 6], "table_count": 1, "figure_count": 0, "chunk_count_min": 1, "blocking_failures": false, "must_contain_markdown": ["# Report", "Apples grow"], "must_not_contain_markdown": ["TODO", "FIXME"] } } ``` Tolerance keys (all optional): - `quality_score_min` (float): assert `parsed.quality_report.score >= value`. - `quality_score_max` (float): assert `parsed.quality_report.score <= value`. - `element_count` (int) or `element_count_range` ([min, max]). - `table_count` (int) or `table_count_range`. - `figure_count` (int) or `figure_count_range`. - `chunk_count_min` (int): assert at least N chunks. - `chunk_count_max` (int): assert at most N chunks. - `blocking_failures` (bool): assert `quality_report.has_blocking_failures` matches. - `must_contain_markdown` (list[str]): each string must appear in `parsed.to_markdown()`. - `must_not_contain_markdown` (list[str]): each string must NOT appear. - `must_contain_quality_metrics` (list[str]): each metric key must appear in `quality_report.metrics`. - `parser_disagreement_rate_max` (float): assert disagreement <= value. - `repair_resolution_rate_min` (float): assert resolution >= value. Missing keys are not asserted (no false failures from over-specification). ## Adding a fixture 1. Drop the input document under `fixtures/`. PDFs, markdown, html, txt all work via the standard pipeline. 2. Run a one-off `parse_document` against it locally and inspect the output. 3. Hand-write `.expected.json` with the constraints you want to lock down. Prefer ranges over exact counts where reasonable variance exists. 4. Run `python3.11 -m unittest tests.test_regression`. It auto-discovers. ## Performance baselines (opt-in) A fixture may include a `performance` block with throughput floors: ```json { "performance": { "repeats": 2, "max_elapsed_seconds": 2.0, "min_pages_per_second": 0.5, "always_enforce": false } } ``` Keys: - `repeats` (int, default 2): number of warm parses to time. The median elapsed is compared against the floor so a single cold-import outlier does not flag. - `max_elapsed_seconds`: parse must finish under this in median. - `min_pages_per_second`: median pages/sec must meet or beat this. - `always_enforce` (bool, default false): when true, perf is always checked. Otherwise perf is gated on `ZSGDP_REGRESSION_PERF=1` so slow CI runners don't get noisy. Floors should be **catastrophic-regression guards** — set them ~50–100x slacker than your local median, not tight perf bars. The point is to catch "parsing a tiny markdown doc now takes 30 seconds," not to track 5 % perf shifts. To set a baseline for a new fixture: parse it 5 times locally, take the median, multiply by ~10–80x for the `max_elapsed_seconds` floor. ## When a regression fires The failure message points at the specific tolerance that broke. Don't blindly loosen the tolerance — investigate whether the regression is real first (parser-version bump, repair-loop drift, chunk planner change). If the new behavior is intentional and better, regenerate the snapshot.