Spaces:
Running on Zero
Running on Zero
| # Regression fixtures | |
| Each fixture is a `(<name>.input.<ext>, <name>.expected.json)` pair under | |
| `fixtures/`. The runner in `test_regression.py` parses every input through | |
| `parse_document` and compares the resulting `ParsedDocument` against the | |
| snapshot in `<name>.expected.json` with explicit tolerances. | |
| ## Fixture file shape | |
| `<name>.expected.json` has these keys (all optional except `name`): | |
| ```json | |
| { | |
| "name": "human-readable identifier", | |
| "config": "configs/docling.yaml", | |
| "selected_parsers": ["text"], | |
| "tolerances": { | |
| "quality_score_min": 0.85, | |
| "element_count_range": [3, 6], | |
| "table_count": 1, | |
| "figure_count": 0, | |
| "chunk_count_min": 1, | |
| "blocking_failures": false, | |
| "must_contain_markdown": ["# Report", "Apples grow"], | |
| "must_not_contain_markdown": ["TODO", "FIXME"] | |
| } | |
| } | |
| ``` | |
| Tolerance keys (all optional): | |
| - `quality_score_min` (float): assert `parsed.quality_report.score >= value`. | |
| - `quality_score_max` (float): assert `parsed.quality_report.score <= value`. | |
| - `element_count` (int) or `element_count_range` ([min, max]). | |
| - `table_count` (int) or `table_count_range`. | |
| - `figure_count` (int) or `figure_count_range`. | |
| - `chunk_count_min` (int): assert at least N chunks. | |
| - `chunk_count_max` (int): assert at most N chunks. | |
| - `blocking_failures` (bool): assert `quality_report.has_blocking_failures` matches. | |
| - `must_contain_markdown` (list[str]): each string must appear in | |
| `parsed.to_markdown()`. | |
| - `must_not_contain_markdown` (list[str]): each string must NOT appear. | |
| - `must_contain_quality_metrics` (list[str]): each metric key must appear in | |
| `quality_report.metrics`. | |
| - `parser_disagreement_rate_max` (float): assert disagreement <= value. | |
| - `repair_resolution_rate_min` (float): assert resolution >= value. | |
| Missing keys are not asserted (no false failures from over-specification). | |
| ## Adding a fixture | |
| 1. Drop the input document under `fixtures/`. PDFs, markdown, html, txt all | |
| work via the standard pipeline. | |
| 2. Run a one-off `parse_document` against it locally and inspect the output. | |
| 3. Hand-write `<name>.expected.json` with the constraints you want to lock | |
| down. Prefer ranges over exact counts where reasonable variance exists. | |
| 4. Run `python3.11 -m unittest tests.test_regression`. It auto-discovers. | |
| ## Performance baselines (opt-in) | |
| A fixture may include a `performance` block with throughput floors: | |
| ```json | |
| { | |
| "performance": { | |
| "repeats": 2, | |
| "max_elapsed_seconds": 2.0, | |
| "min_pages_per_second": 0.5, | |
| "always_enforce": false | |
| } | |
| } | |
| ``` | |
| Keys: | |
| - `repeats` (int, default 2): number of warm parses to time. The median | |
| elapsed is compared against the floor so a single cold-import outlier | |
| does not flag. | |
| - `max_elapsed_seconds`: parse must finish under this in median. | |
| - `min_pages_per_second`: median pages/sec must meet or beat this. | |
| - `always_enforce` (bool, default false): when true, perf is always checked. | |
| Otherwise perf is gated on `ZSGDP_REGRESSION_PERF=1` so slow CI runners | |
| don't get noisy. Floors should be **catastrophic-regression guards** β set | |
| them ~50β100x slacker than your local median, not tight perf bars. The | |
| point is to catch "parsing a tiny markdown doc now takes 30 seconds," | |
| not to track 5 % perf shifts. | |
| To set a baseline for a new fixture: parse it 5 times locally, take the | |
| median, multiply by ~10β80x for the `max_elapsed_seconds` floor. | |
| ## When a regression fires | |
| The failure message points at the specific tolerance that broke. Don't blindly | |
| loosen the tolerance β investigate whether the regression is real first | |
| (parser-version bump, repair-loop drift, chunk planner change). If the new | |
| behavior is intentional and better, regenerate the snapshot. | |