Spaces:
Running on Zero
Running on Zero
| # Contributing to zeroshotGPU | |
| Thanks for working on this. Three things to know up front: | |
| 1. **Run `make preflight` before pushing.** It's the same suite that runs | |
| in pre-push if you have the hooks installed (see below). A green | |
| preflight is the local signal that the branch is ready for the | |
| [Space smoke checklist](docs/space_smoke.md). | |
| 2. **Keep it dependency-light by default.** New runtime dependencies need | |
| a corresponding entry in `pyproject.toml` extras and an explicit | |
| gate (config flag, lazy import, or feature-detection fallback). The | |
| `embedding` extra is the model: opt-in, lazy-imported on first use, | |
| raises a clean `RuntimeError` when missing. | |
| 3. **Don't change schema shapes silently.** Bump | |
| `zsgdp.schema.SCHEMA_VERSION` whenever the on-disk shape of | |
| `parsed_document.json`, `chunks.jsonl`, etc. changes. See | |
| [Schema versioning](#schema-versioning) below. | |
| --- | |
| ## Setup | |
| ```bash | |
| git clone <repo> | |
| cd "Document Parser" | |
| python3.11 -m venv .venv && source .venv/bin/activate | |
| python -m pip install -e ".[pdf,yaml,docling,dev]" | |
| ``` | |
| Optional extras: | |
| - `.[embedding]` β sentence-transformers + transformers for the embedding | |
| retriever. Only needed when you set `benchmarks.retriever.backend=embedding`. | |
| - `.[gpu_repair]` β transformers for live GPU repair. Only needed when you | |
| set `repair.execute_gpu_escalations=true`. | |
| - `.[spaces]` β mirrors the root `requirements.txt` so an editable install | |
| matches a Space deploy. | |
| Set up `.env` for local secrets: | |
| ```bash | |
| cp .env.example .env | |
| # Fill in HF_TOKEN if you need gated models. | |
| ``` | |
| `.env` is gitignored. CLI and `app.py` load it automatically; pre-set | |
| environment variables always win, so a Space's secrets never get | |
| overridden by a stray local file. | |
| --- | |
| ## Pre-commit / pre-push hooks | |
| ```bash | |
| python -m pip install pre-commit | |
| pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push | |
| ``` | |
| Two stages: | |
| - **pre-commit** β fast static checks: trailing whitespace, end-of-file | |
| newline, JSON/YAML syntax, large-file guard, merge-conflict markers. | |
| Runs on every `git commit`. | |
| - **pre-push** β runs `python -m zsgdp.cli preflight`. Same as | |
| `make preflight`. Failing this blocks the push. | |
| Skip on a specific commit with `git commit --no-verify` if you genuinely | |
| need to (e.g. WIP). Skip the pre-push gate with `git push --no-verify`, | |
| but only if you have a separately verified preflight run. | |
| --- | |
| ## Running tests | |
| ```bash | |
| make test # full unittest discover | |
| make regression # snapshot fixture suite | |
| make preflight # everything except the benchmark smoke | |
| make preflight-full # adds an end-to-end benchmark smoke | |
| make benchmark # parses tests/regression/fixtures/ via the CLI | |
| ``` | |
| Or directly: | |
| ```bash | |
| python -m unittest discover | |
| python -m unittest tests.regression.test_regression | |
| python -m zsgdp.cli preflight --root . --benchmark | |
| ``` | |
| Performance regressions are gated behind `ZSGDP_REGRESSION_PERF=1`: | |
| ```bash | |
| ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression | |
| ``` | |
| See [tests/regression/README.md](tests/regression/README.md) for the | |
| fixture format including the `performance` block. | |
| --- | |
| ## Adding a regression fixture | |
| 1. Drop the input under `tests/regression/fixtures/<name>.input.<ext>`. | |
| 2. Parse it once locally and inspect the output: | |
| ```bash | |
| python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity | |
| ``` | |
| 3. Hand-write `tests/regression/fixtures/<name>.expected.json` with the | |
| tolerances you want to lock down. Prefer ranges over exact counts | |
| where reasonable variance exists. | |
| 4. Optional: add a `performance` block with `max_elapsed_seconds` set to | |
| ~50β100x your local median (catastrophic-regression guard, not a | |
| tight bar). | |
| 5. Run `make regression` to confirm the fixture is picked up. | |
| --- | |
| ## Adding a parser adapter | |
| 1. Subclass `BaseParser` in `zsgdp/parsers/<name>_parser.py` (or extend | |
| `external.py` for shell-out adapters). | |
| 2. Set `name`, `supported_file_types`, implement `available()` and | |
| `parse(path, profile, config, *, pages=None)`. | |
| 3. Register in `zsgdp/parsers/registry.py`. | |
| 4. If the parser produces Markdown, write a normalizer under | |
| `zsgdp/normalize/normalize_<name>.py` that returns a `ParseCandidate` | |
| via `normalize_markdown_candidate(...)`. | |
| 5. Add a config block to `configs/default.yaml` with `enabled: false` | |
| plus any CLI flags the adapter needs. | |
| 6. Add the dependency to `pyproject.toml` as an optional extra. Don't | |
| pin it in the top-level `requirements.txt` unless it's free to | |
| install on every Space build. | |
| --- | |
| ## Adding a metric | |
| Pure metrics live under `zsgdp/verify/`: | |
| 1. Define inputs as plain dicts/lists (not `ParsedDocument`-keyed) so | |
| the same metric works on per-parser candidate snapshots, not just | |
| the merged document. | |
| 2. Pin definitions in the module docstring β exact denominator, | |
| handling of empty inputs, what each return key means. | |
| 3. Surface in `zsgdp/benchmarks/parser_quality.py`: | |
| - Add per-document fields to the `doc_record`. | |
| - Add aggregated means to the top-level `summary` dict. | |
| - Add a per-document CSV writer if it has detail worth its own file. | |
| 4. Add tests for: perfect input, no-match input, partial overlap, | |
| vacuous empty/empty case, and a benchmark-integration test that | |
| asserts the metric appears in `summary["documents"][0]`. | |
| --- | |
| ## Schema versioning | |
| `zsgdp.schema.SCHEMA_VERSION` lives in | |
| [zsgdp/schema/document.py](zsgdp/schema/document.py). It's surfaced into | |
| `artifact_manifest.json` as `parsed_document_schema_version` so a | |
| consumer reading old output can gate. | |
| Bump rules: | |
| - **Additive change** (new optional field with a default) β bump the | |
| patch (1.0 β 1.1). | |
| - **Breaking change** (renamed/removed field, semantics changed) β bump | |
| the major (1.0 β 2.0). Update the regression fixtures in the same | |
| PR; downstream consumers will need a migration. | |
| - **No change** β leave it alone. | |
| When you bump, add an entry to `CHANGELOG.md` under | |
| "### Schema" with the version and what changed. | |
| --- | |
| ## Logging | |
| Use `from zsgdp.logging_config import get_logger` then | |
| `logger = get_logger(__name__)`. Call `.info`/`.warning`/`.error` with | |
| structured `extra={...}` fields rather than f-string-formatted messages | |
| where possible β the JSON formatter promotes `extra` keys to top-level | |
| fields so the HF Spaces logs page is greppable. | |
| Default log level is WARNING (CLI summaries unaffected). Opt in with | |
| `ZSGDP_LOG_LEVEL=INFO` and `ZSGDP_LOG_JSON=1` for Space-style output. | |
| --- | |
| ## Pull request checklist | |
| Before opening a PR: | |
| - [ ] `make preflight` passes locally. | |
| - [ ] If you added a metric, an adapter, or changed the schema, you | |
| updated `CHANGELOG.md`. | |
| - [ ] If you changed parser behavior, you ran `make regression` and any | |
| fixture drift is intentional (and the snapshot was regenerated | |
| explicitly). | |
| - [ ] If your change touches GPU/model code paths, you flagged it for | |
| Space-side smoke testing in the PR description (the | |
| [smoke checklist](docs/space_smoke.md) covers what to run). | |
| - [ ] You did **not** commit `.env` or any secret. The `.gitignore` | |
| should catch this; if you suspect a leak, treat the token as | |
| compromised and rotate it. | |
| --- | |
| ## Architecture quick map | |
| - `zsgdp/profiling/` β page-level features and labels. | |
| - `zsgdp/routing/` β deterministic page β expert mapping. | |
| - `zsgdp/parsers/` β adapters; one canonical schema regardless of source. | |
| - `zsgdp/normalize/` β convert each parser's output into the schema. | |
| - `zsgdp/merge/` β align candidates, dedupe, detect conflicts. | |
| - `zsgdp/verify/` β coverage, reading order, table/figure/formula/chunk | |
| quality, GT-comparison metrics (layout F1, table structure, formula | |
| CER, retrieval recall), parser disagreement and repair success rates. | |
| - `zsgdp/repair/` β deterministic header/table fixes plus GPU | |
| escalation that dispatches to `gpu/worker.py`. | |
| - `zsgdp/chunking/` β agentic planner + structure-aware / parent-child / | |
| table / figure / page chunk builders, with semantic / late / | |
| vision-guided / proposition deterministic stubs. | |
| - `zsgdp/gpu/` β task planning, batching, dry-run worker, transformers | |
| and vLLM clients. | |
| - `zsgdp/benchmarks/` β dataset loaders, metric runners, ablation, | |
| cross-dataset comparison, retrieval (lexical + embedding). | |
| - `zsgdp/cli.py` β single entry point exposing all of the above. | |
| - `app.py` β Gradio Space front-end. | |
| The full spec lives in | |
| [zero_shot_gpu_document_parser_project_spec.md](zero_shot_gpu_document_parser_project_spec.md). | |
| The 2000-line read isn't required to contribute, but section Β§10 (schema) | |
| and Β§17 (chunking ladder) are worth skimming if you're touching those | |
| modules. | |