| # pdfsys-core | |
| Shared data contracts for the pdfsys pipeline. Every other package depends on this one. | |
| ## What's in here | |
| - **Enums**: `RegionType` (TEXT / IMAGE / TABLE / FORMULA), `Backend` (MUPDF / PIPELINE / VLM / DEFERRED). | |
| - **PdfRecord**: Frozen dataclass for per-PDF metadata (sha256, source_uri, size, provenance). | |
| - **Layout schema**: `BBox` (normalized [0,1]), `LayoutRegion`, `LayoutPage`, `LayoutDocument` — the contract between layout-analyser and every parser backend. | |
| - **ExtractedDoc / Segment**: Backend-agnostic output schema. All three parser backends emit these. | |
| - **LayoutCache**: Content-addressable on-disk cache for LayoutDocuments, keyed by `sha256 + model_tag`. | |
| - **PdfsysConfig**: Hierarchical configuration (paths, router, layout, per-backend settings, runtime). | |
| - **Serde**: Generic `to_dict()` / `from_dict()` for all the above dataclasses. | |
| ## Key design rule | |
| This package has **zero external dependencies** — stdlib only. Do not add pymupdf, torch, or anything else here. The types must be importable everywhere without pulling in heavy ML libraries. | |