yin
docs: add project README, CONTRIBUTING guide, and per-package READMEs
b8ca6f2
# pdfsys-core
Shared data contracts for the pdfsys pipeline. Every other package depends on this one.
## What's in here
- **Enums**: `RegionType` (TEXT / IMAGE / TABLE / FORMULA), `Backend` (MUPDF / PIPELINE / VLM / DEFERRED).
- **PdfRecord**: Frozen dataclass for per-PDF metadata (sha256, source_uri, size, provenance).
- **Layout schema**: `BBox` (normalized [0,1]), `LayoutRegion`, `LayoutPage`, `LayoutDocument` — the contract between layout-analyser and every parser backend.
- **ExtractedDoc / Segment**: Backend-agnostic output schema. All three parser backends emit these.
- **LayoutCache**: Content-addressable on-disk cache for LayoutDocuments, keyed by `sha256 + model_tag`.
- **PdfsysConfig**: Hierarchical configuration (paths, router, layout, per-backend settings, runtime).
- **Serde**: Generic `to_dict()` / `from_dict()` for all the above dataclasses.
## Key design rule
This package has **zero external dependencies** — stdlib only. Do not add pymupdf, torch, or anything else here. The types must be importable everywhere without pulling in heavy ML libraries.