Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

App Files Files Community

zeroshotGPU / CONTRIBUTING.md

Arjunvir Singh

Initial commit: zeroshotGPU MVP with full eval surface

db06ffa 9 days ago

preview code

raw

history blame contribute delete

8.79 kB

	# Contributing to zeroshotGPU

	Thanks for working on this. Three things to know up front:

	1. Run `make preflight` before pushing. It's the same suite that runs
	in pre-push if you have the hooks installed (see below). A green
	preflight is the local signal that the branch is ready for the
	[Space smoke checklist](docs/space_smoke.md).
	2. Keep it dependency-light by default. New runtime dependencies need
	a corresponding entry in `pyproject.toml` extras and an explicit
	gate (config flag, lazy import, or feature-detection fallback). The
	`embedding` extra is the model: opt-in, lazy-imported on first use,
	raises a clean `RuntimeError` when missing.
	3. Don't change schema shapes silently. Bump
	`zsgdp.schema.SCHEMA_VERSION` whenever the on-disk shape of
	`parsed_document.json`, `chunks.jsonl`, etc. changes. See
	[Schema versioning](#schema-versioning) below.

	---

	## Setup

	```bash
	git clone <repo>
	cd "Document Parser"
	python3.11 -m venv .venv && source .venv/bin/activate
	python -m pip install -e ".[pdf,yaml,docling,dev]"
	```

	Optional extras:

	- `.[embedding]` — sentence-transformers + transformers for the embedding
	retriever. Only needed when you set `benchmarks.retriever.backend=embedding`.
	- `.[gpu_repair]` — transformers for live GPU repair. Only needed when you
	set `repair.execute_gpu_escalations=true`.
	- `.[spaces]` — mirrors the root `requirements.txt` so an editable install
	matches a Space deploy.

	Set up `.env` for local secrets:

	```bash
	cp .env.example .env
	# Fill in HF_TOKEN if you need gated models.
	```

	`.env` is gitignored. CLI and `app.py` load it automatically; pre-set
	environment variables always win, so a Space's secrets never get
	overridden by a stray local file.

	---

	## Pre-commit / pre-push hooks

	```bash
	python -m pip install pre-commit
	pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push
	```

	Two stages:

	- pre-commit — fast static checks: trailing whitespace, end-of-file
	newline, JSON/YAML syntax, large-file guard, merge-conflict markers.
	Runs on every `git commit`.
	- pre-push — runs `python -m zsgdp.cli preflight`. Same as
	`make preflight`. Failing this blocks the push.

	Skip on a specific commit with `git commit --no-verify` if you genuinely
	need to (e.g. WIP). Skip the pre-push gate with `git push --no-verify`,
	but only if you have a separately verified preflight run.

	---

	## Running tests

	```bash
	make test # full unittest discover
	make regression # snapshot fixture suite
	make preflight # everything except the benchmark smoke
	make preflight-full # adds an end-to-end benchmark smoke
	make benchmark # parses tests/regression/fixtures/ via the CLI
	```

	Or directly:

	```bash
	python -m unittest discover
	python -m unittest tests.regression.test_regression
	python -m zsgdp.cli preflight --root . --benchmark
	```

	Performance regressions are gated behind `ZSGDP_REGRESSION_PERF=1`:

	```bash
	ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression
	```

	See [tests/regression/README.md](tests/regression/README.md) for the
	fixture format including the `performance` block.

	---

	## Adding a regression fixture

	1. Drop the input under `tests/regression/fixtures/<name>.input.<ext>`.
	2. Parse it once locally and inspect the output:
	```bash
	python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity
	```
	3. Hand-write `tests/regression/fixtures/<name>.expected.json` with the
	tolerances you want to lock down. Prefer ranges over exact counts
	where reasonable variance exists.
	4. Optional: add a `performance` block with `max_elapsed_seconds` set to
	~50–100x your local median (catastrophic-regression guard, not a
	tight bar).
	5. Run `make regression` to confirm the fixture is picked up.

	---

	## Adding a parser adapter

	1. Subclass `BaseParser` in `zsgdp/parsers/<name>_parser.py` (or extend
	`external.py` for shell-out adapters).
	2. Set `name`, `supported_file_types`, implement `available()` and
	`parse(path, profile, config, *, pages=None)`.
	3. Register in `zsgdp/parsers/registry.py`.
	4. If the parser produces Markdown, write a normalizer under
	`zsgdp/normalize/normalize_<name>.py` that returns a `ParseCandidate`
	via `normalize_markdown_candidate(...)`.
	5. Add a config block to `configs/default.yaml` with `enabled: false`
	plus any CLI flags the adapter needs.
	6. Add the dependency to `pyproject.toml` as an optional extra. Don't
	pin it in the top-level `requirements.txt` unless it's free to
	install on every Space build.

	---

	## Adding a metric

	Pure metrics live under `zsgdp/verify/`:

	1. Define inputs as plain dicts/lists (not `ParsedDocument`-keyed) so
	the same metric works on per-parser candidate snapshots, not just
	the merged document.
	2. Pin definitions in the module docstring — exact denominator,
	handling of empty inputs, what each return key means.
	3. Surface in `zsgdp/benchmarks/parser_quality.py`:
	- Add per-document fields to the `doc_record`.
	- Add aggregated means to the top-level `summary` dict.
	- Add a per-document CSV writer if it has detail worth its own file.
	4. Add tests for: perfect input, no-match input, partial overlap,
	vacuous empty/empty case, and a benchmark-integration test that
	asserts the metric appears in `summary["documents"][0]`.

	---

	## Schema versioning

	`zsgdp.schema.SCHEMA_VERSION` lives in
	[zsgdp/schema/document.py](zsgdp/schema/document.py). It's surfaced into
	`artifact_manifest.json` as `parsed_document_schema_version` so a
	consumer reading old output can gate.

	Bump rules:

	- Additive change (new optional field with a default) — bump the
	patch (1.0 → 1.1).
	- Breaking change (renamed/removed field, semantics changed) — bump
	the major (1.0 → 2.0). Update the regression fixtures in the same
	PR; downstream consumers will need a migration.
	- No change — leave it alone.

	When you bump, add an entry to `CHANGELOG.md` under
	"### Schema" with the version and what changed.

	---

	## Logging

	Use `from zsgdp.logging_config import get_logger` then
	`logger = get_logger(__name__)`. Call `.info`/`.warning`/`.error` with
	structured `extra={...}` fields rather than f-string-formatted messages
	where possible — the JSON formatter promotes `extra` keys to top-level
	fields so the HF Spaces logs page is greppable.

	Default log level is WARNING (CLI summaries unaffected). Opt in with
	`ZSGDP_LOG_LEVEL=INFO` and `ZSGDP_LOG_JSON=1` for Space-style output.

	---

	## Pull request checklist

	Before opening a PR:

	- [ ] `make preflight` passes locally.
	- [ ] If you added a metric, an adapter, or changed the schema, you
	updated `CHANGELOG.md`.
	- [ ] If you changed parser behavior, you ran `make regression` and any
	fixture drift is intentional (and the snapshot was regenerated
	explicitly).
	- [ ] If your change touches GPU/model code paths, you flagged it for
	Space-side smoke testing in the PR description (the
	[smoke checklist](docs/space_smoke.md) covers what to run).
	- [ ] You did not commit `.env` or any secret. The `.gitignore`
	should catch this; if you suspect a leak, treat the token as
	compromised and rotate it.

	---

	## Architecture quick map

	- `zsgdp/profiling/` — page-level features and labels.
	- `zsgdp/routing/` — deterministic page → expert mapping.
	- `zsgdp/parsers/` — adapters; one canonical schema regardless of source.
	- `zsgdp/normalize/` — convert each parser's output into the schema.
	- `zsgdp/merge/` — align candidates, dedupe, detect conflicts.
	- `zsgdp/verify/` — coverage, reading order, table/figure/formula/chunk
	quality, GT-comparison metrics (layout F1, table structure, formula
	CER, retrieval recall), parser disagreement and repair success rates.
	- `zsgdp/repair/` — deterministic header/table fixes plus GPU
	escalation that dispatches to `gpu/worker.py`.
	- `zsgdp/chunking/` — agentic planner + structure-aware / parent-child /
	table / figure / page chunk builders, with semantic / late /
	vision-guided / proposition deterministic stubs.
	- `zsgdp/gpu/` — task planning, batching, dry-run worker, transformers
	and vLLM clients.
	- `zsgdp/benchmarks/` — dataset loaders, metric runners, ablation,
	cross-dataset comparison, retrieval (lexical + embedding).
	- `zsgdp/cli.py` — single entry point exposing all of the above.
	- `app.py` — Gradio Space front-end.

	The full spec lives in
	[zero_shot_gpu_document_parser_project_spec.md](zero_shot_gpu_document_parser_project_spec.md).
	The 2000-line read isn't required to contribute, but section §10 (schema)
	and §17 (chunking ladder) are worth skimming if you're touching those
	modules.