Spaces:

arjun10g
/

zeroshotGPU

Running on Zero

App Files Files Community

zeroshotGPU / CONTRIBUTING.md

Arjunvir Singh

Initial commit: zeroshotGPU MVP with full eval surface

db06ffa 9 days ago

preview code

raw

history blame contribute delete

8.79 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Contributing to zeroshotGPU

Thanks for working on this. Three things to know up front:

Run make preflight before pushing. It's the same suite that runs in pre-push if you have the hooks installed (see below). A green preflight is the local signal that the branch is ready for the Space smoke checklist.
Keep it dependency-light by default. New runtime dependencies need a corresponding entry in pyproject.toml extras and an explicit gate (config flag, lazy import, or feature-detection fallback). The embedding extra is the model: opt-in, lazy-imported on first use, raises a clean RuntimeError when missing.
Don't change schema shapes silently. Bump zsgdp.schema.SCHEMA_VERSION whenever the on-disk shape of parsed_document.json, chunks.jsonl, etc. changes. See Schema versioning below.

Setup

git clone <repo>
cd "Document Parser"
python3.11 -m venv .venv && source .venv/bin/activate
python -m pip install -e ".[pdf,yaml,docling,dev]"

Optional extras:

.[embedding] — sentence-transformers + transformers for the embedding retriever. Only needed when you set benchmarks.retriever.backend=embedding.
.[gpu_repair] — transformers for live GPU repair. Only needed when you set repair.execute_gpu_escalations=true.
.[spaces] — mirrors the root requirements.txt so an editable install matches a Space deploy.

Set up .env for local secrets:

cp .env.example .env
# Fill in HF_TOKEN if you need gated models.

.env is gitignored. CLI and app.py load it automatically; pre-set environment variables always win, so a Space's secrets never get overridden by a stray local file.

Pre-commit / pre-push hooks

python -m pip install pre-commit
pre-commit install --install-hooks --hook-type pre-commit --hook-type pre-push

Two stages:

pre-commit — fast static checks: trailing whitespace, end-of-file newline, JSON/YAML syntax, large-file guard, merge-conflict markers. Runs on every git commit.
pre-push — runs python -m zsgdp.cli preflight. Same as make preflight. Failing this blocks the push.

Skip on a specific commit with git commit --no-verify if you genuinely need to (e.g. WIP). Skip the pre-push gate with git push --no-verify, but only if you have a separately verified preflight run.

Running tests

make test                 # full unittest discover
make regression           # snapshot fixture suite
make preflight            # everything except the benchmark smoke
make preflight-full       # adds an end-to-end benchmark smoke
make benchmark            # parses tests/regression/fixtures/ via the CLI

Or directly:

python -m unittest discover
python -m unittest tests.regression.test_regression
python -m zsgdp.cli preflight --root . --benchmark

Performance regressions are gated behind ZSGDP_REGRESSION_PERF=1:

ZSGDP_REGRESSION_PERF=1 python -m unittest tests.regression.test_regression

See tests/regression/README.md for the fixture format including the performance block.

Adding a regression fixture

Drop the input under tests/regression/fixtures/<name>.input.<ext>.

Parse it once locally and inspect the output:

python -m zsgdp.cli parse --input tests/regression/fixtures/<name>.input.<ext> --output /tmp/sanity

Hand-write tests/regression/fixtures/<name>.expected.json with the tolerances you want to lock down. Prefer ranges over exact counts where reasonable variance exists.
Optional: add a performance block with max_elapsed_seconds set to ~50–100x your local median (catastrophic-regression guard, not a tight bar).
Run make regression to confirm the fixture is picked up.

Adding a parser adapter

Subclass BaseParser in zsgdp/parsers/<name>_parser.py (or extend external.py for shell-out adapters).
Set name, supported_file_types, implement available() and parse(path, profile, config, *, pages=None).
Register in zsgdp/parsers/registry.py.
If the parser produces Markdown, write a normalizer under zsgdp/normalize/normalize_<name>.py that returns a ParseCandidate via normalize_markdown_candidate(...).
Add a config block to configs/default.yaml with enabled: false plus any CLI flags the adapter needs.
Add the dependency to pyproject.toml as an optional extra. Don't pin it in the top-level requirements.txt unless it's free to install on every Space build.

Adding a metric

Pure metrics live under zsgdp/verify/:

Define inputs as plain dicts/lists (not ParsedDocument-keyed) so the same metric works on per-parser candidate snapshots, not just the merged document.
Pin definitions in the module docstring — exact denominator, handling of empty inputs, what each return key means.
Surface in zsgdp/benchmarks/parser_quality.py:
- Add per-document fields to the doc_record.
- Add aggregated means to the top-level summary dict.
- Add a per-document CSV writer if it has detail worth its own file.
Add tests for: perfect input, no-match input, partial overlap, vacuous empty/empty case, and a benchmark-integration test that asserts the metric appears in summary["documents"][0].

Schema versioning

zsgdp.schema.SCHEMA_VERSION lives in zsgdp/schema/document.py. It's surfaced into artifact_manifest.json as parsed_document_schema_version so a consumer reading old output can gate.

Bump rules:

Additive change (new optional field with a default) — bump the patch (1.0 → 1.1).
Breaking change (renamed/removed field, semantics changed) — bump the major (1.0 → 2.0). Update the regression fixtures in the same PR; downstream consumers will need a migration.
No change — leave it alone.

When you bump, add an entry to CHANGELOG.md under "### Schema" with the version and what changed.

Logging

Use from zsgdp.logging_config import get_logger then logger = get_logger(__name__). Call .info/.warning/.error with structured extra={...} fields rather than f-string-formatted messages where possible — the JSON formatter promotes extra keys to top-level fields so the HF Spaces logs page is greppable.

Default log level is WARNING (CLI summaries unaffected). Opt in with ZSGDP_LOG_LEVEL=INFO and ZSGDP_LOG_JSON=1 for Space-style output.

Pull request checklist

Before opening a PR:

make preflight passes locally.
If you added a metric, an adapter, or changed the schema, you updated CHANGELOG.md.
If you changed parser behavior, you ran make regression and any fixture drift is intentional (and the snapshot was regenerated explicitly).
If your change touches GPU/model code paths, you flagged it for Space-side smoke testing in the PR description (the smoke checklist covers what to run).
You did not commit .env or any secret. The .gitignore should catch this; if you suspect a leak, treat the token as compromised and rotate it.

Architecture quick map

zsgdp/profiling/ — page-level features and labels.
zsgdp/routing/ — deterministic page → expert mapping.
zsgdp/parsers/ — adapters; one canonical schema regardless of source.
zsgdp/normalize/ — convert each parser's output into the schema.
zsgdp/merge/ — align candidates, dedupe, detect conflicts.
zsgdp/verify/ — coverage, reading order, table/figure/formula/chunk quality, GT-comparison metrics (layout F1, table structure, formula CER, retrieval recall), parser disagreement and repair success rates.
zsgdp/repair/ — deterministic header/table fixes plus GPU escalation that dispatches to gpu/worker.py.
zsgdp/chunking/ — agentic planner + structure-aware / parent-child / table / figure / page chunk builders, with semantic / late / vision-guided / proposition deterministic stubs.
zsgdp/gpu/ — task planning, batching, dry-run worker, transformers and vLLM clients.
zsgdp/benchmarks/ — dataset loaders, metric runners, ablation, cross-dataset comparison, retrieval (lexical + embedding).
zsgdp/cli.py — single entry point exposing all of the above.
app.py — Gradio Space front-end.

The full spec lives in zero_shot_gpu_document_parser_project_spec.md. The 2000-line read isn't required to contribute, but section §10 (schema) and §17 (chunking ladder) are worth skimming if you're touching those modules.