zeroshotGPU / CONTRIBUTING.md
Arjunvir Singh
Initial commit: zeroshotGPU MVP with full eval surface
db06ffa

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Contributing to zeroshotGPU

Thanks for working on this. Three things to know up front:

  1. Run make preflight before pushing. It's the same suite that runs in pre-push if you have the hooks installed (see below). A green preflight is the local signal that the branch is ready for the Space smoke checklist.
  2. Keep it dependency-light by default. New runtime dependencies need a corresponding entry in pyproject.toml extras and an explicit gate (config flag, lazy import, or feature-detection fallback). The embedding extra is the model: opt-in, lazy-imported on first use, raises a clean RuntimeError when missing.
  3. Don't change schema shapes silently. Bump zsgdp.schema.SCHEMA_VERSION whenever the on-disk shape of parsed_document.json, chunks.jsonl, etc. changes. See Schema versioning below.

Setup

git clone <repo>
cd "Document Parser"
python3.11 -m venv .venv && source .venv/bin/activate
python -m pip install -e ".[pdf,yaml,docling,dev]"

Optional extras:

  • .[embedding] β€” sentence-transformers + transformers for the embedding retriever. Only needed when you set benchmarks.retriever.backend=embedding.
  • .[gpu_repair] β€” transformers for live GPU repair. Only needed when you set repair.execute_gpu_escalations=true.
  • .[spaces] β€” mirrors the root requirements.txt so an editable install matches a Space deploy.

Set up .env for local secrets:

cp .env.example .env
# Fill in HF_TOKEN if you need gated models.

.env is gitignored. CLI and app.py load it automatically; pre-set environment variables always win, so a Space's secrets never get overridden by a stray local file.


Pre-commit / pre-push hooks

python -m pip install pre-commit
pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push

Two stages:

  • pre-commit β€” fast static checks: trailing whitespace, end-of-file newline, JSON/YAML syntax, large-file guard, merge-conflict markers. Runs on every git commit.
  • pre-push β€” runs python -m zsgdp.cli preflight. Same as make preflight. Failing this blocks the push.

Skip on a specific commit with git commit --no-verify if you genuinely need to (e.g. WIP). Skip the pre-push gate with git push --no-verify, but only if you have a separately verified preflight run.


Running tests

make test                 # full unittest discover
make regression           # snapshot fixture suite
make preflight            # everything except the benchmark smoke
make preflight-full       # adds an end-to-end benchmark smoke
make benchmark            # parses tests/regression/fixtures/ via the CLI

Or directly:

python -m unittest discover
python -m unittest tests.regression.test_regression
python -m zsgdp.cli preflight --root . --benchmark

Performance regressions are gated behind ZSGDP_REGRESSION_PERF=1:

ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression

See tests/regression/README.md for the fixture format including the performance block.


Adding a regression fixture

  1. Drop the input under tests/regression/fixtures/<name>.input.<ext>.
  2. Parse it once locally and inspect the output:
    python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity
    
  3. Hand-write tests/regression/fixtures/<name>.expected.json with the tolerances you want to lock down. Prefer ranges over exact counts where reasonable variance exists.
  4. Optional: add a performance block with max_elapsed_seconds set to ~50–100x your local median (catastrophic-regression guard, not a tight bar).
  5. Run make regression to confirm the fixture is picked up.

Adding a parser adapter

  1. Subclass BaseParser in zsgdp/parsers/<name>_parser.py (or extend external.py for shell-out adapters).
  2. Set name, supported_file_types, implement available() and parse(path, profile, config, *, pages=None).
  3. Register in zsgdp/parsers/registry.py.
  4. If the parser produces Markdown, write a normalizer under zsgdp/normalize/normalize_<name>.py that returns a ParseCandidate via normalize_markdown_candidate(...).
  5. Add a config block to configs/default.yaml with enabled: false plus any CLI flags the adapter needs.
  6. Add the dependency to pyproject.toml as an optional extra. Don't pin it in the top-level requirements.txt unless it's free to install on every Space build.

Adding a metric

Pure metrics live under zsgdp/verify/:

  1. Define inputs as plain dicts/lists (not ParsedDocument-keyed) so the same metric works on per-parser candidate snapshots, not just the merged document.
  2. Pin definitions in the module docstring β€” exact denominator, handling of empty inputs, what each return key means.
  3. Surface in zsgdp/benchmarks/parser_quality.py:
    • Add per-document fields to the doc_record.
    • Add aggregated means to the top-level summary dict.
    • Add a per-document CSV writer if it has detail worth its own file.
  4. Add tests for: perfect input, no-match input, partial overlap, vacuous empty/empty case, and a benchmark-integration test that asserts the metric appears in summary["documents"][0].

Schema versioning

zsgdp.schema.SCHEMA_VERSION lives in zsgdp/schema/document.py. It's surfaced into artifact_manifest.json as parsed_document_schema_version so a consumer reading old output can gate.

Bump rules:

  • Additive change (new optional field with a default) β€” bump the patch (1.0 β†’ 1.1).
  • Breaking change (renamed/removed field, semantics changed) β€” bump the major (1.0 β†’ 2.0). Update the regression fixtures in the same PR; downstream consumers will need a migration.
  • No change β€” leave it alone.

When you bump, add an entry to CHANGELOG.md under "### Schema" with the version and what changed.


Logging

Use from zsgdp.logging_config import get_logger then logger = get_logger(__name__). Call .info/.warning/.error with structured extra={...} fields rather than f-string-formatted messages where possible β€” the JSON formatter promotes extra keys to top-level fields so the HF Spaces logs page is greppable.

Default log level is WARNING (CLI summaries unaffected). Opt in with ZSGDP_LOG_LEVEL=INFO and ZSGDP_LOG_JSON=1 for Space-style output.


Pull request checklist

Before opening a PR:

  • make preflight passes locally.
  • If you added a metric, an adapter, or changed the schema, you updated CHANGELOG.md.
  • If you changed parser behavior, you ran make regression and any fixture drift is intentional (and the snapshot was regenerated explicitly).
  • If your change touches GPU/model code paths, you flagged it for Space-side smoke testing in the PR description (the smoke checklist covers what to run).
  • You did not commit .env or any secret. The .gitignore should catch this; if you suspect a leak, treat the token as compromised and rotate it.

Architecture quick map

  • zsgdp/profiling/ β€” page-level features and labels.
  • zsgdp/routing/ β€” deterministic page β†’ expert mapping.
  • zsgdp/parsers/ β€” adapters; one canonical schema regardless of source.
  • zsgdp/normalize/ β€” convert each parser's output into the schema.
  • zsgdp/merge/ β€” align candidates, dedupe, detect conflicts.
  • zsgdp/verify/ β€” coverage, reading order, table/figure/formula/chunk quality, GT-comparison metrics (layout F1, table structure, formula CER, retrieval recall), parser disagreement and repair success rates.
  • zsgdp/repair/ β€” deterministic header/table fixes plus GPU escalation that dispatches to gpu/worker.py.
  • zsgdp/chunking/ β€” agentic planner + structure-aware / parent-child / table / figure / page chunk builders, with semantic / late / vision-guided / proposition deterministic stubs.
  • zsgdp/gpu/ β€” task planning, batching, dry-run worker, transformers and vLLM clients.
  • zsgdp/benchmarks/ β€” dataset loaders, metric runners, ablation, cross-dataset comparison, retrieval (lexical + embedding).
  • zsgdp/cli.py β€” single entry point exposing all of the above.
  • app.py β€” Gradio Space front-end.

The full spec lives in zero_shot_gpu_document_parser_project_spec.md. The 2000-line read isn't required to contribute, but section Β§10 (schema) and Β§17 (chunking ladder) are worth skimming if you're touching those modules.