Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.14.0
Contributing to zeroshotGPU
Thanks for working on this. Three things to know up front:
- Run
make preflightbefore pushing. It's the same suite that runs in pre-push if you have the hooks installed (see below). A green preflight is the local signal that the branch is ready for the Space smoke checklist. - Keep it dependency-light by default. New runtime dependencies need
a corresponding entry in
pyproject.tomlextras and an explicit gate (config flag, lazy import, or feature-detection fallback). Theembeddingextra is the model: opt-in, lazy-imported on first use, raises a cleanRuntimeErrorwhen missing. - Don't change schema shapes silently. Bump
zsgdp.schema.SCHEMA_VERSIONwhenever the on-disk shape ofparsed_document.json,chunks.jsonl, etc. changes. See Schema versioning below.
Setup
git clone <repo>
cd "Document Parser"
python3.11 -m venv .venv && source .venv/bin/activate
python -m pip install -e ".[pdf,yaml,docling,dev]"
Optional extras:
.[embedding]β sentence-transformers + transformers for the embedding retriever. Only needed when you setbenchmarks.retriever.backend=embedding..[gpu_repair]β transformers for live GPU repair. Only needed when you setrepair.execute_gpu_escalations=true..[spaces]β mirrors the rootrequirements.txtso an editable install matches a Space deploy.
Set up .env for local secrets:
cp .env.example .env
# Fill in HF_TOKEN if you need gated models.
.env is gitignored. CLI and app.py load it automatically; pre-set
environment variables always win, so a Space's secrets never get
overridden by a stray local file.
Pre-commit / pre-push hooks
python -m pip install pre-commit
pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push
Two stages:
- pre-commit β fast static checks: trailing whitespace, end-of-file
newline, JSON/YAML syntax, large-file guard, merge-conflict markers.
Runs on every
git commit. - pre-push β runs
python -m zsgdp.cli preflight. Same asmake preflight. Failing this blocks the push.
Skip on a specific commit with git commit --no-verify if you genuinely
need to (e.g. WIP). Skip the pre-push gate with git push --no-verify,
but only if you have a separately verified preflight run.
Running tests
make test # full unittest discover
make regression # snapshot fixture suite
make preflight # everything except the benchmark smoke
make preflight-full # adds an end-to-end benchmark smoke
make benchmark # parses tests/regression/fixtures/ via the CLI
Or directly:
python -m unittest discover
python -m unittest tests.regression.test_regression
python -m zsgdp.cli preflight --root . --benchmark
Performance regressions are gated behind ZSGDP_REGRESSION_PERF=1:
ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression
See tests/regression/README.md for the
fixture format including the performance block.
Adding a regression fixture
- Drop the input under
tests/regression/fixtures/<name>.input.<ext>. - Parse it once locally and inspect the output:
python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity - Hand-write
tests/regression/fixtures/<name>.expected.jsonwith the tolerances you want to lock down. Prefer ranges over exact counts where reasonable variance exists. - Optional: add a
performanceblock withmax_elapsed_secondsset to ~50β100x your local median (catastrophic-regression guard, not a tight bar). - Run
make regressionto confirm the fixture is picked up.
Adding a parser adapter
- Subclass
BaseParserinzsgdp/parsers/<name>_parser.py(or extendexternal.pyfor shell-out adapters). - Set
name,supported_file_types, implementavailable()andparse(path, profile, config, *, pages=None). - Register in
zsgdp/parsers/registry.py. - If the parser produces Markdown, write a normalizer under
zsgdp/normalize/normalize_<name>.pythat returns aParseCandidatevianormalize_markdown_candidate(...). - Add a config block to
configs/default.yamlwithenabled: falseplus any CLI flags the adapter needs. - Add the dependency to
pyproject.tomlas an optional extra. Don't pin it in the top-levelrequirements.txtunless it's free to install on every Space build.
Adding a metric
Pure metrics live under zsgdp/verify/:
- Define inputs as plain dicts/lists (not
ParsedDocument-keyed) so the same metric works on per-parser candidate snapshots, not just the merged document. - Pin definitions in the module docstring β exact denominator, handling of empty inputs, what each return key means.
- Surface in
zsgdp/benchmarks/parser_quality.py:- Add per-document fields to the
doc_record. - Add aggregated means to the top-level
summarydict. - Add a per-document CSV writer if it has detail worth its own file.
- Add per-document fields to the
- Add tests for: perfect input, no-match input, partial overlap,
vacuous empty/empty case, and a benchmark-integration test that
asserts the metric appears in
summary["documents"][0].
Schema versioning
zsgdp.schema.SCHEMA_VERSION lives in
zsgdp/schema/document.py. It's surfaced into
artifact_manifest.json as parsed_document_schema_version so a
consumer reading old output can gate.
Bump rules:
- Additive change (new optional field with a default) β bump the patch (1.0 β 1.1).
- Breaking change (renamed/removed field, semantics changed) β bump the major (1.0 β 2.0). Update the regression fixtures in the same PR; downstream consumers will need a migration.
- No change β leave it alone.
When you bump, add an entry to CHANGELOG.md under
"### Schema" with the version and what changed.
Logging
Use from zsgdp.logging_config import get_logger then
logger = get_logger(__name__). Call .info/.warning/.error with
structured extra={...} fields rather than f-string-formatted messages
where possible β the JSON formatter promotes extra keys to top-level
fields so the HF Spaces logs page is greppable.
Default log level is WARNING (CLI summaries unaffected). Opt in with
ZSGDP_LOG_LEVEL=INFO and ZSGDP_LOG_JSON=1 for Space-style output.
Pull request checklist
Before opening a PR:
-
make preflightpasses locally. - If you added a metric, an adapter, or changed the schema, you
updated
CHANGELOG.md. - If you changed parser behavior, you ran
make regressionand any fixture drift is intentional (and the snapshot was regenerated explicitly). - If your change touches GPU/model code paths, you flagged it for Space-side smoke testing in the PR description (the smoke checklist covers what to run).
- You did not commit
.envor any secret. The.gitignoreshould catch this; if you suspect a leak, treat the token as compromised and rotate it.
Architecture quick map
zsgdp/profiling/β page-level features and labels.zsgdp/routing/β deterministic page β expert mapping.zsgdp/parsers/β adapters; one canonical schema regardless of source.zsgdp/normalize/β convert each parser's output into the schema.zsgdp/merge/β align candidates, dedupe, detect conflicts.zsgdp/verify/β coverage, reading order, table/figure/formula/chunk quality, GT-comparison metrics (layout F1, table structure, formula CER, retrieval recall), parser disagreement and repair success rates.zsgdp/repair/β deterministic header/table fixes plus GPU escalation that dispatches togpu/worker.py.zsgdp/chunking/β agentic planner + structure-aware / parent-child / table / figure / page chunk builders, with semantic / late / vision-guided / proposition deterministic stubs.zsgdp/gpu/β task planning, batching, dry-run worker, transformers and vLLM clients.zsgdp/benchmarks/β dataset loaders, metric runners, ablation, cross-dataset comparison, retrieval (lexical + embedding).zsgdp/cli.pyβ single entry point exposing all of the above.app.pyβ Gradio Space front-end.
The full spec lives in zero_shot_gpu_document_parser_project_spec.md. The 2000-line read isn't required to contribute, but section Β§10 (schema) and Β§17 (chunking ladder) are worth skimming if you're touching those modules.