zeroshotGPU / CONTRIBUTING.md
Arjunvir Singh
Initial commit: zeroshotGPU MVP with full eval surface
db06ffa
# Contributing to zeroshotGPU
Thanks for working on this. Three things to know up front:
1. **Run `make preflight` before pushing.** It's the same suite that runs
in pre-push if you have the hooks installed (see below). A green
preflight is the local signal that the branch is ready for the
[Space smoke checklist](docs/space_smoke.md).
2. **Keep it dependency-light by default.** New runtime dependencies need
a corresponding entry in `pyproject.toml` extras and an explicit
gate (config flag, lazy import, or feature-detection fallback). The
`embedding` extra is the model: opt-in, lazy-imported on first use,
raises a clean `RuntimeError` when missing.
3. **Don't change schema shapes silently.** Bump
`zsgdp.schema.SCHEMA_VERSION` whenever the on-disk shape of
`parsed_document.json`, `chunks.jsonl`, etc. changes. See
[Schema versioning](#schema-versioning) below.
---
## Setup
```bash
git clone <repo>
cd "Document Parser"
python3.11 -m venv .venv && source .venv/bin/activate
python -m pip install -e ".[pdf,yaml,docling,dev]"
```
Optional extras:
- `.[embedding]` β€” sentence-transformers + transformers for the embedding
retriever. Only needed when you set `benchmarks.retriever.backend=embedding`.
- `.[gpu_repair]` β€” transformers for live GPU repair. Only needed when you
set `repair.execute_gpu_escalations=true`.
- `.[spaces]` β€” mirrors the root `requirements.txt` so an editable install
matches a Space deploy.
Set up `.env` for local secrets:
```bash
cp .env.example .env
# Fill in HF_TOKEN if you need gated models.
```
`.env` is gitignored. CLI and `app.py` load it automatically; pre-set
environment variables always win, so a Space's secrets never get
overridden by a stray local file.
---
## Pre-commit / pre-push hooks
```bash
python -m pip install pre-commit
pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push
```
Two stages:
- **pre-commit** β€” fast static checks: trailing whitespace, end-of-file
newline, JSON/YAML syntax, large-file guard, merge-conflict markers.
Runs on every `git commit`.
- **pre-push** β€” runs `python -m zsgdp.cli preflight`. Same as
`make preflight`. Failing this blocks the push.
Skip on a specific commit with `git commit --no-verify` if you genuinely
need to (e.g. WIP). Skip the pre-push gate with `git push --no-verify`,
but only if you have a separately verified preflight run.
---
## Running tests
```bash
make test # full unittest discover
make regression # snapshot fixture suite
make preflight # everything except the benchmark smoke
make preflight-full # adds an end-to-end benchmark smoke
make benchmark # parses tests/regression/fixtures/ via the CLI
```
Or directly:
```bash
python -m unittest discover
python -m unittest tests.regression.test_regression
python -m zsgdp.cli preflight --root . --benchmark
```
Performance regressions are gated behind `ZSGDP_REGRESSION_PERF=1`:
```bash
ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression
```
See [tests/regression/README.md](tests/regression/README.md) for the
fixture format including the `performance` block.
---
## Adding a regression fixture
1. Drop the input under `tests/regression/fixtures/<name>.input.<ext>`.
2. Parse it once locally and inspect the output:
```bash
python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity
```
3. Hand-write `tests/regression/fixtures/<name>.expected.json` with the
tolerances you want to lock down. Prefer ranges over exact counts
where reasonable variance exists.
4. Optional: add a `performance` block with `max_elapsed_seconds` set to
~50–100x your local median (catastrophic-regression guard, not a
tight bar).
5. Run `make regression` to confirm the fixture is picked up.
---
## Adding a parser adapter
1. Subclass `BaseParser` in `zsgdp/parsers/<name>_parser.py` (or extend
`external.py` for shell-out adapters).
2. Set `name`, `supported_file_types`, implement `available()` and
`parse(path, profile, config, *, pages=None)`.
3. Register in `zsgdp/parsers/registry.py`.
4. If the parser produces Markdown, write a normalizer under
`zsgdp/normalize/normalize_<name>.py` that returns a `ParseCandidate`
via `normalize_markdown_candidate(...)`.
5. Add a config block to `configs/default.yaml` with `enabled: false`
plus any CLI flags the adapter needs.
6. Add the dependency to `pyproject.toml` as an optional extra. Don't
pin it in the top-level `requirements.txt` unless it's free to
install on every Space build.
---
## Adding a metric
Pure metrics live under `zsgdp/verify/`:
1. Define inputs as plain dicts/lists (not `ParsedDocument`-keyed) so
the same metric works on per-parser candidate snapshots, not just
the merged document.
2. Pin definitions in the module docstring β€” exact denominator,
handling of empty inputs, what each return key means.
3. Surface in `zsgdp/benchmarks/parser_quality.py`:
- Add per-document fields to the `doc_record`.
- Add aggregated means to the top-level `summary` dict.
- Add a per-document CSV writer if it has detail worth its own file.
4. Add tests for: perfect input, no-match input, partial overlap,
vacuous empty/empty case, and a benchmark-integration test that
asserts the metric appears in `summary["documents"][0]`.
---
## Schema versioning
`zsgdp.schema.SCHEMA_VERSION` lives in
[zsgdp/schema/document.py](zsgdp/schema/document.py). It's surfaced into
`artifact_manifest.json` as `parsed_document_schema_version` so a
consumer reading old output can gate.
Bump rules:
- **Additive change** (new optional field with a default) β€” bump the
patch (1.0 β†’ 1.1).
- **Breaking change** (renamed/removed field, semantics changed) β€” bump
the major (1.0 β†’ 2.0). Update the regression fixtures in the same
PR; downstream consumers will need a migration.
- **No change** β€” leave it alone.
When you bump, add an entry to `CHANGELOG.md` under
"### Schema" with the version and what changed.
---
## Logging
Use `from zsgdp.logging_config import get_logger` then
`logger = get_logger(__name__)`. Call `.info`/`.warning`/`.error` with
structured `extra={...}` fields rather than f-string-formatted messages
where possible β€” the JSON formatter promotes `extra` keys to top-level
fields so the HF Spaces logs page is greppable.
Default log level is WARNING (CLI summaries unaffected). Opt in with
`ZSGDP_LOG_LEVEL=INFO` and `ZSGDP_LOG_JSON=1` for Space-style output.
---
## Pull request checklist
Before opening a PR:
- [ ] `make preflight` passes locally.
- [ ] If you added a metric, an adapter, or changed the schema, you
updated `CHANGELOG.md`.
- [ ] If you changed parser behavior, you ran `make regression` and any
fixture drift is intentional (and the snapshot was regenerated
explicitly).
- [ ] If your change touches GPU/model code paths, you flagged it for
Space-side smoke testing in the PR description (the
[smoke checklist](docs/space_smoke.md) covers what to run).
- [ ] You did **not** commit `.env` or any secret. The `.gitignore`
should catch this; if you suspect a leak, treat the token as
compromised and rotate it.
---
## Architecture quick map
- `zsgdp/profiling/` β€” page-level features and labels.
- `zsgdp/routing/` β€” deterministic page β†’ expert mapping.
- `zsgdp/parsers/` β€” adapters; one canonical schema regardless of source.
- `zsgdp/normalize/` β€” convert each parser's output into the schema.
- `zsgdp/merge/` β€” align candidates, dedupe, detect conflicts.
- `zsgdp/verify/` β€” coverage, reading order, table/figure/formula/chunk
quality, GT-comparison metrics (layout F1, table structure, formula
CER, retrieval recall), parser disagreement and repair success rates.
- `zsgdp/repair/` β€” deterministic header/table fixes plus GPU
escalation that dispatches to `gpu/worker.py`.
- `zsgdp/chunking/` β€” agentic planner + structure-aware / parent-child /
table / figure / page chunk builders, with semantic / late /
vision-guided / proposition deterministic stubs.
- `zsgdp/gpu/` β€” task planning, batching, dry-run worker, transformers
and vLLM clients.
- `zsgdp/benchmarks/` β€” dataset loaders, metric runners, ablation,
cross-dataset comparison, retrieval (lexical + embedding).
- `zsgdp/cli.py` β€” single entry point exposing all of the above.
- `app.py` β€” Gradio Space front-end.
The full spec lives in
[zero_shot_gpu_document_parser_project_spec.md](zero_shot_gpu_document_parser_project_spec.md).
The 2000-line read isn't required to contribute, but section Β§10 (schema)
and Β§17 (chunking ladder) are worth skimming if you're touching those
modules.