# Contributing to zeroshotGPU Thanks for working on this. Three things to know up front: 1. **Run `make preflight` before pushing.** It's the same suite that runs in pre-push if you have the hooks installed (see below). A green preflight is the local signal that the branch is ready for the [Space smoke checklist](docs/space_smoke.md). 2. **Keep it dependency-light by default.** New runtime dependencies need a corresponding entry in `pyproject.toml` extras and an explicit gate (config flag, lazy import, or feature-detection fallback). The `embedding` extra is the model: opt-in, lazy-imported on first use, raises a clean `RuntimeError` when missing. 3. **Don't change schema shapes silently.** Bump `zsgdp.schema.SCHEMA_VERSION` whenever the on-disk shape of `parsed_document.json`, `chunks.jsonl`, etc. changes. See [Schema versioning](#schema-versioning) below. --- ## Setup ```bash git clone cd "Document Parser" python3.11 -m venv .venv && source .venv/bin/activate python -m pip install -e ".[pdf,yaml,docling,dev]" ``` Optional extras: - `.[embedding]` — sentence-transformers + transformers for the embedding retriever. Only needed when you set `benchmarks.retriever.backend=embedding`. - `.[gpu_repair]` — transformers for live GPU repair. Only needed when you set `repair.execute_gpu_escalations=true`. - `.[spaces]` — mirrors the root `requirements.txt` so an editable install matches a Space deploy. Set up `.env` for local secrets: ```bash cp .env.example .env # Fill in HF_TOKEN if you need gated models. ``` `.env` is gitignored. CLI and `app.py` load it automatically; pre-set environment variables always win, so a Space's secrets never get overridden by a stray local file. --- ## Pre-commit / pre-push hooks ```bash python -m pip install pre-commit pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push ``` Two stages: - **pre-commit** — fast static checks: trailing whitespace, end-of-file newline, JSON/YAML syntax, large-file guard, merge-conflict markers. Runs on every `git commit`. - **pre-push** — runs `python -m zsgdp.cli preflight`. Same as `make preflight`. Failing this blocks the push. Skip on a specific commit with `git commit --no-verify` if you genuinely need to (e.g. WIP). Skip the pre-push gate with `git push --no-verify`, but only if you have a separately verified preflight run. --- ## Running tests ```bash make test # full unittest discover make regression # snapshot fixture suite make preflight # everything except the benchmark smoke make preflight-full # adds an end-to-end benchmark smoke make benchmark # parses tests/regression/fixtures/ via the CLI ``` Or directly: ```bash python -m unittest discover python -m unittest tests.regression.test_regression python -m zsgdp.cli preflight --root . --benchmark ``` Performance regressions are gated behind `ZSGDP_REGRESSION_PERF=1`: ```bash ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression ``` See [tests/regression/README.md](tests/regression/README.md) for the fixture format including the `performance` block. --- ## Adding a regression fixture 1. Drop the input under `tests/regression/fixtures/.input.`. 2. Parse it once locally and inspect the output: ```bash python -m zsgdp.cli parse --input tests/regression/fixtures/.input. --output /tmp/sanity ``` 3. Hand-write `tests/regression/fixtures/.expected.json` with the tolerances you want to lock down. Prefer ranges over exact counts where reasonable variance exists. 4. Optional: add a `performance` block with `max_elapsed_seconds` set to ~50–100x your local median (catastrophic-regression guard, not a tight bar). 5. Run `make regression` to confirm the fixture is picked up. --- ## Adding a parser adapter 1. Subclass `BaseParser` in `zsgdp/parsers/_parser.py` (or extend `external.py` for shell-out adapters). 2. Set `name`, `supported_file_types`, implement `available()` and `parse(path, profile, config, *, pages=None)`. 3. Register in `zsgdp/parsers/registry.py`. 4. If the parser produces Markdown, write a normalizer under `zsgdp/normalize/normalize_.py` that returns a `ParseCandidate` via `normalize_markdown_candidate(...)`. 5. Add a config block to `configs/default.yaml` with `enabled: false` plus any CLI flags the adapter needs. 6. Add the dependency to `pyproject.toml` as an optional extra. Don't pin it in the top-level `requirements.txt` unless it's free to install on every Space build. --- ## Adding a metric Pure metrics live under `zsgdp/verify/`: 1. Define inputs as plain dicts/lists (not `ParsedDocument`-keyed) so the same metric works on per-parser candidate snapshots, not just the merged document. 2. Pin definitions in the module docstring — exact denominator, handling of empty inputs, what each return key means. 3. Surface in `zsgdp/benchmarks/parser_quality.py`: - Add per-document fields to the `doc_record`. - Add aggregated means to the top-level `summary` dict. - Add a per-document CSV writer if it has detail worth its own file. 4. Add tests for: perfect input, no-match input, partial overlap, vacuous empty/empty case, and a benchmark-integration test that asserts the metric appears in `summary["documents"][0]`. --- ## Schema versioning `zsgdp.schema.SCHEMA_VERSION` lives in [zsgdp/schema/document.py](zsgdp/schema/document.py). It's surfaced into `artifact_manifest.json` as `parsed_document_schema_version` so a consumer reading old output can gate. Bump rules: - **Additive change** (new optional field with a default) — bump the patch (1.0 → 1.1). - **Breaking change** (renamed/removed field, semantics changed) — bump the major (1.0 → 2.0). Update the regression fixtures in the same PR; downstream consumers will need a migration. - **No change** — leave it alone. When you bump, add an entry to `CHANGELOG.md` under "### Schema" with the version and what changed. --- ## Logging Use `from zsgdp.logging_config import get_logger` then `logger = get_logger(__name__)`. Call `.info`/`.warning`/`.error` with structured `extra={...}` fields rather than f-string-formatted messages where possible — the JSON formatter promotes `extra` keys to top-level fields so the HF Spaces logs page is greppable. Default log level is WARNING (CLI summaries unaffected). Opt in with `ZSGDP_LOG_LEVEL=INFO` and `ZSGDP_LOG_JSON=1` for Space-style output. --- ## Pull request checklist Before opening a PR: - [ ] `make preflight` passes locally. - [ ] If you added a metric, an adapter, or changed the schema, you updated `CHANGELOG.md`. - [ ] If you changed parser behavior, you ran `make regression` and any fixture drift is intentional (and the snapshot was regenerated explicitly). - [ ] If your change touches GPU/model code paths, you flagged it for Space-side smoke testing in the PR description (the [smoke checklist](docs/space_smoke.md) covers what to run). - [ ] You did **not** commit `.env` or any secret. The `.gitignore` should catch this; if you suspect a leak, treat the token as compromised and rotate it. --- ## Architecture quick map - `zsgdp/profiling/` — page-level features and labels. - `zsgdp/routing/` — deterministic page → expert mapping. - `zsgdp/parsers/` — adapters; one canonical schema regardless of source. - `zsgdp/normalize/` — convert each parser's output into the schema. - `zsgdp/merge/` — align candidates, dedupe, detect conflicts. - `zsgdp/verify/` — coverage, reading order, table/figure/formula/chunk quality, GT-comparison metrics (layout F1, table structure, formula CER, retrieval recall), parser disagreement and repair success rates. - `zsgdp/repair/` — deterministic header/table fixes plus GPU escalation that dispatches to `gpu/worker.py`. - `zsgdp/chunking/` — agentic planner + structure-aware / parent-child / table / figure / page chunk builders, with semantic / late / vision-guided / proposition deterministic stubs. - `zsgdp/gpu/` — task planning, batching, dry-run worker, transformers and vLLM clients. - `zsgdp/benchmarks/` — dataset loaders, metric runners, ablation, cross-dataset comparison, retrieval (lexical + embedding). - `zsgdp/cli.py` — single entry point exposing all of the above. - `app.py` — Gradio Space front-end. The full spec lives in [zero_shot_gpu_document_parser_project_spec.md](zero_shot_gpu_document_parser_project_spec.md). The 2000-line read isn't required to contribute, but section §10 (schema) and §17 (chunking ladder) are worth skimming if you're touching those modules.