Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

File size: 8,788 Bytes

db06ffa

# Contributing to zeroshotGPU

Thanks for working on this. Three things to know up front:

1. **Run `make preflight` before pushing.** It's the same suite that runs
   in pre-push if you have the hooks installed (see below). A green
   preflight is the local signal that the branch is ready for the
   [Space smoke checklist](docs/space_smoke.md).
2. **Keep it dependency-light by default.** New runtime dependencies need
   a corresponding entry in `pyproject.toml` extras and an explicit
   gate (config flag, lazy import, or feature-detection fallback). The
   `embedding` extra is the model: opt-in, lazy-imported on first use,
   raises a clean `RuntimeError` when missing.
3. **Don't change schema shapes silently.** Bump
   `zsgdp.schema.SCHEMA_VERSION` whenever the on-disk shape of
   `parsed_document.json`, `chunks.jsonl`, etc. changes. See
   [Schema versioning](#schema-versioning) below.

---

## Setup

```bash
git clone <repo>
cd "Document Parser"
python3.11 -m venv .venv && source .venv/bin/activate
python -m pip install -e ".[pdf,yaml,docling,dev]"
```

Optional extras:

- `.[embedding]` — sentence-transformers + transformers for the embedding
  retriever. Only needed when you set `benchmarks.retriever.backend=embedding`.
- `.[gpu_repair]` — transformers for live GPU repair. Only needed when you
  set `repair.execute_gpu_escalations=true`.
- `.[spaces]` — mirrors the root `requirements.txt` so an editable install
  matches a Space deploy.

Set up `.env` for local secrets:

```bash
cp .env.example .env
# Fill in HF_TOKEN if you need gated models.
```

`.env` is gitignored. CLI and `app.py` load it automatically; pre-set
environment variables always win, so a Space's secrets never get
overridden by a stray local file.

---

## Pre-commit / pre-push hooks

```bash
python -m pip install pre-commit
pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push
```

Two stages:

- **pre-commit** — fast static checks: trailing whitespace, end-of-file
  newline, JSON/YAML syntax, large-file guard, merge-conflict markers.
  Runs on every `git commit`.
- **pre-push** — runs `python -m zsgdp.cli preflight`. Same as
  `make preflight`. Failing this blocks the push.

Skip on a specific commit with `git commit --no-verify` if you genuinely
need to (e.g. WIP). Skip the pre-push gate with `git push --no-verify`,
but only if you have a separately verified preflight run.

---

## Running tests

```bash
make test                 # full unittest discover
make regression           # snapshot fixture suite
make preflight            # everything except the benchmark smoke
make preflight-full       # adds an end-to-end benchmark smoke
make benchmark            # parses tests/regression/fixtures/ via the CLI
```

Or directly:

```bash
python -m unittest discover
python -m unittest tests.regression.test_regression
python -m zsgdp.cli preflight --root . --benchmark
```

Performance regressions are gated behind `ZSGDP_REGRESSION_PERF=1`:

```bash
ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression
```

See [tests/regression/README.md](tests/regression/README.md) for the
fixture format including the `performance` block.

---

## Adding a regression fixture

1. Drop the input under `tests/regression/fixtures/<name>.input.<ext>`.
2. Parse it once locally and inspect the output:
   ```bash
   python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity
   ```
3. Hand-write `tests/regression/fixtures/<name>.expected.json` with the
   tolerances you want to lock down. Prefer ranges over exact counts
   where reasonable variance exists.
4. Optional: add a `performance` block with `max_elapsed_seconds` set to
   ~50–100x your local median (catastrophic-regression guard, not a
   tight bar).
5. Run `make regression` to confirm the fixture is picked up.

---

## Adding a parser adapter

1. Subclass `BaseParser` in `zsgdp/parsers/<name>_parser.py` (or extend
   `external.py` for shell-out adapters).
2. Set `name`, `supported_file_types`, implement `available()` and
   `parse(path, profile, config, *, pages=None)`.
3. Register in `zsgdp/parsers/registry.py`.
4. If the parser produces Markdown, write a normalizer under
   `zsgdp/normalize/normalize_<name>.py` that returns a `ParseCandidate`
   via `normalize_markdown_candidate(...)`.
5. Add a config block to `configs/default.yaml` with `enabled: false`
   plus any CLI flags the adapter needs.
6. Add the dependency to `pyproject.toml` as an optional extra. Don't
   pin it in the top-level `requirements.txt` unless it's free to
   install on every Space build.

---

## Adding a metric

Pure metrics live under `zsgdp/verify/`:

1. Define inputs as plain dicts/lists (not `ParsedDocument`-keyed) so
   the same metric works on per-parser candidate snapshots, not just
   the merged document.
2. Pin definitions in the module docstring — exact denominator,
   handling of empty inputs, what each return key means.
3. Surface in `zsgdp/benchmarks/parser_quality.py`:
   - Add per-document fields to the `doc_record`.
   - Add aggregated means to the top-level `summary` dict.
   - Add a per-document CSV writer if it has detail worth its own file.
4. Add tests for: perfect input, no-match input, partial overlap,
   vacuous empty/empty case, and a benchmark-integration test that
   asserts the metric appears in `summary["documents"][0]`.

---

## Schema versioning

`zsgdp.schema.SCHEMA_VERSION` lives in
[zsgdp/schema/document.py](zsgdp/schema/document.py). It's surfaced into
`artifact_manifest.json` as `parsed_document_schema_version` so a
consumer reading old output can gate.

Bump rules:

- **Additive change** (new optional field with a default) — bump the
  patch (1.0 → 1.1).
- **Breaking change** (renamed/removed field, semantics changed) — bump
  the major (1.0 → 2.0). Update the regression fixtures in the same
  PR; downstream consumers will need a migration.
- **No change** — leave it alone.

When you bump, add an entry to `CHANGELOG.md` under
"### Schema" with the version and what changed.

---

## Logging

Use `from zsgdp.logging_config import get_logger` then
`logger = get_logger(__name__)`. Call `.info`/`.warning`/`.error` with
structured `extra={...}` fields rather than f-string-formatted messages
where possible — the JSON formatter promotes `extra` keys to top-level
fields so the HF Spaces logs page is greppable.

Default log level is WARNING (CLI summaries unaffected). Opt in with
`ZSGDP_LOG_LEVEL=INFO` and `ZSGDP_LOG_JSON=1` for Space-style output.

---

## Pull request checklist

Before opening a PR:

- [ ] `make preflight` passes locally.
- [ ] If you added a metric, an adapter, or changed the schema, you
      updated `CHANGELOG.md`.
- [ ] If you changed parser behavior, you ran `make regression` and any
      fixture drift is intentional (and the snapshot was regenerated
      explicitly).
- [ ] If your change touches GPU/model code paths, you flagged it for
      Space-side smoke testing in the PR description (the
      [smoke checklist](docs/space_smoke.md) covers what to run).
- [ ] You did **not** commit `.env` or any secret. The `.gitignore`
      should catch this; if you suspect a leak, treat the token as
      compromised and rotate it.

---

## Architecture quick map

- `zsgdp/profiling/` — page-level features and labels.
- `zsgdp/routing/` — deterministic page → expert mapping.
- `zsgdp/parsers/` — adapters; one canonical schema regardless of source.
- `zsgdp/normalize/` — convert each parser's output into the schema.
- `zsgdp/merge/` — align candidates, dedupe, detect conflicts.
- `zsgdp/verify/` — coverage, reading order, table/figure/formula/chunk
  quality, GT-comparison metrics (layout F1, table structure, formula
  CER, retrieval recall), parser disagreement and repair success rates.
- `zsgdp/repair/` — deterministic header/table fixes plus GPU
  escalation that dispatches to `gpu/worker.py`.
- `zsgdp/chunking/` — agentic planner + structure-aware / parent-child /
  table / figure / page chunk builders, with semantic / late /
  vision-guided / proposition deterministic stubs.
- `zsgdp/gpu/` — task planning, batching, dry-run worker, transformers
  and vLLM clients.
- `zsgdp/benchmarks/` — dataset loaders, metric runners, ablation,
  cross-dataset comparison, retrieval (lexical + embedding).
- `zsgdp/cli.py` — single entry point exposing all of the above.
- `app.py` — Gradio Space front-end.

The full spec lives in
[zero_shot_gpu_document_parser_project_spec.md](zero_shot_gpu_document_parser_project_spec.md).
The 2000-line read isn't required to contribute, but section §10 (schema)
and §17 (chunking ladder) are worth skimming if you're touching those
modules.