A newer version of the Gradio SDK is available: 6.14.0
pdfsys-core
Shared data contracts for the pdfsys pipeline. Every other package depends on this one.
What's in here
- Enums:
RegionType(TEXT / IMAGE / TABLE / FORMULA),Backend(MUPDF / PIPELINE / VLM / DEFERRED). - PdfRecord: Frozen dataclass for per-PDF metadata (sha256, source_uri, size, provenance).
- Layout schema:
BBox(normalized [0,1]),LayoutRegion,LayoutPage,LayoutDocument— the contract between layout-analyser and every parser backend. - ExtractedDoc / Segment: Backend-agnostic output schema. All three parser backends emit these.
- LayoutCache: Content-addressable on-disk cache for LayoutDocuments, keyed by
sha256 + model_tag. - PdfsysConfig: Hierarchical configuration (paths, router, layout, per-backend settings, runtime).
- Serde: Generic
to_dict()/from_dict()for all the above dataclasses.
Key design rule
This package has zero external dependencies — stdlib only. Do not add pymupdf, torch, or anything else here. The types must be importable everywhere without pulling in heavy ML libraries.