Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.15.2
Regression fixtures
Each fixture is a (<name>.input.<ext>, <name>.expected.json) pair under
fixtures/. The runner in test_regression.py parses every input through
parse_document and compares the resulting ParsedDocument against the
snapshot in <name>.expected.json with explicit tolerances.
Fixture file shape
<name>.expected.json has these keys (all optional except name):
{
"name": "human-readable identifier",
"config": "configs/docling.yaml",
"selected_parsers": ["text"],
"tolerances": {
"quality_score_min": 0.85,
"element_count_range": [3, 6],
"table_count": 1,
"figure_count": 0,
"chunk_count_min": 1,
"blocking_failures": false,
"must_contain_markdown": ["# Report", "Apples grow"],
"must_not_contain_markdown": ["TODO", "FIXME"]
}
}
Tolerance keys (all optional):
quality_score_min(float): assertparsed.quality_report.score >= value.quality_score_max(float): assertparsed.quality_report.score <= value.element_count(int) orelement_count_range([min, max]).table_count(int) ortable_count_range.figure_count(int) orfigure_count_range.chunk_count_min(int): assert at least N chunks.chunk_count_max(int): assert at most N chunks.blocking_failures(bool): assertquality_report.has_blocking_failuresmatches.must_contain_markdown(list[str]): each string must appear inparsed.to_markdown().must_not_contain_markdown(list[str]): each string must NOT appear.must_contain_quality_metrics(list[str]): each metric key must appear inquality_report.metrics.parser_disagreement_rate_max(float): assert disagreement <= value.repair_resolution_rate_min(float): assert resolution >= value.
Missing keys are not asserted (no false failures from over-specification).
Adding a fixture
- Drop the input document under
fixtures/. PDFs, markdown, html, txt all work via the standard pipeline. - Run a one-off
parse_documentagainst it locally and inspect the output. - Hand-write
<name>.expected.jsonwith the constraints you want to lock down. Prefer ranges over exact counts where reasonable variance exists. - Run
python3.11 -m unittest tests.test_regression. It auto-discovers.
Performance baselines (opt-in)
A fixture may include a performance block with throughput floors:
{
"performance": {
"repeats": 2,
"max_elapsed_seconds": 2.0,
"min_pages_per_second": 0.5,
"always_enforce": false
}
}
Keys:
repeats(int, default 2): number of warm parses to time. The median elapsed is compared against the floor so a single cold-import outlier does not flag.max_elapsed_seconds: parse must finish under this in median.min_pages_per_second: median pages/sec must meet or beat this.always_enforce(bool, default false): when true, perf is always checked.
Otherwise perf is gated on ZSGDP_REGRESSION_PERF=1 so slow CI runners
don't get noisy. Floors should be catastrophic-regression guards — set
them ~50–100x slacker than your local median, not tight perf bars. The
point is to catch "parsing a tiny markdown doc now takes 30 seconds,"
not to track 5 % perf shifts.
To set a baseline for a new fixture: parse it 5 times locally, take the
median, multiply by ~10–80x for the max_elapsed_seconds floor.
When a regression fires
The failure message points at the specific tolerance that broke. Don't blindly loosen the tolerance — investigate whether the regression is real first (parser-version bump, repair-loop drift, chunk planner change). If the new behavior is intentional and better, regenerate the snapshot.