Spaces:
Running
Running
File size: 17,598 Bytes
bb6107f 96f8c0a 5c87b62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 | ---
title: XmLLM
emoji: "\U0001F4C4"
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: "Document structure engine: OCR output to ALTO XML & PAGE XML"
---
# XmLLM
**Canonical-first document structure engine** that converts OCR/VLM provider outputs into validated **ALTO XML** and **PAGE XML**, via an internal canonical representation.
XmLLM is not tied to any specific OCR model. It is centered on a **canonical document contract** β a normalized, provenance-tracked, geometry-aware internal model that absorbs heterogeneous provider outputs and produces standards-compliant XML exports.
## Key features
- **Dual native export** β ALTO XML v4 and PAGE XML 2019 from the same canonical model
- **Provider-agnostic** β adapters for PaddleOCR (word+polygon), line-level OCR, and text-only mLLM
- **Full provenance** β every node tracks how its data was obtained (native, derived, repaired, manual)
- **Geometry subsystem** β normalization, transforms, quantization, tolerance-based containment
- **Validation pipeline** β structural checks, readiness assessment, export eligibility with configurable policy
- **Enrichment pipeline** β polygon-to-bbox, language propagation, reading order inference, hyphenation detection
- **Job orchestration** β 13-step pipeline with event logging, artifact persistence, state machine
- **REST API** β FastAPI with OpenAPI docs, provider management, job lifecycle, export downloads
- **Deployable anywhere** β same code runs locally, in Docker, and on Hugging Face Spaces
## Architecture
The system is organized in four concentric layers (anneaux):
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β D β Presentation (frontend, viewer) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β C β API (FastAPI routes, request/response) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β B β Execution (providers, jobs, persistence)β β β
β β β βββββββββββββββββββββββββββββββββββββββββ β β β
β β β β A β Domain (models, geometry, β β β β
β β β β validators, enrichers, serializers)β β β β
β β β βββββββββββββββββββββββββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
**Dependencies flow inward only.** The domain layer has no dependency on FastAPI, the database, or the frontend.
### Three internal objects
| Object | Role | Never used for |
|---|---|---|
| `RawProviderPayload` | Source truth β raw provider output, stored for audit | Export, rendering |
| `CanonicalDocument` | Business truth β normalized, validated, enriched | Direct UI rendering |
| `ViewerProjection` | Rendering truth β lightweight overlays for the viewer | Validation, export decisions |
### Processing pipeline
```
Input (image / raw JSON)
β Provider Runtime (local / hub / api)
β Raw Provider Payload (stored)
β Adapter / Normalization (provider-specific β canonical)
β CanonicalDocument
β Enrichers (polygonβbbox, lang propagation, reading order, hyphenation)
β Validators (structural, readiness, export eligibility)
β Document Policy (strict / standard / permissive)
β ALTO XML Serializer ββ alto.xml
β PAGE XML Serializer ββ page.xml
β Viewer Projection ββ viewer.json
β Persistence (SQLite + filesystem)
```
## Quick start
### Local (Python)
```bash
# Clone and install
git clone https://github.com/maribakulj/XmLLM.git
cd XmLLM
pip install -e ".[dev]"
# Configure
cp .env.example .env
# Run
uvicorn src.app.main:app --host 0.0.0.0 --port 7860
# Open http://localhost:7860 for the web UI
# Open http://localhost:7860/docs for the API documentation
```
### Docker
```bash
docker compose up --build
# Open http://localhost:7860
```
### Hugging Face Spaces
The same Docker image runs as a Docker Space. Set persistent storage to enable `/data`:
```
HF_HOME=/data/.huggingface
STORAGE_ROOT=/data
```
The application auto-detects Space mode via the `SPACE_ID` environment variable.
## API reference
All endpoints are documented at `/docs` (Swagger UI) when the server is running.
### Health
| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Status, version, execution mode |
### Providers
| Method | Path | Description |
|---|---|---|
| `POST` | `/providers` | Register a provider profile |
| `GET` | `/providers` | List all registered providers |
| `GET` | `/providers/{id}` | Get provider details |
| `DELETE` | `/providers/{id}` | Delete a provider |
### Jobs
| Method | Path | Description |
|---|---|---|
| `POST` | `/jobs` | Create and run a job (upload raw payload JSON) |
| `GET` | `/jobs` | List jobs (with pagination) |
| `GET` | `/jobs/{id}` | Get job details and status |
| `GET` | `/jobs/{id}/logs` | Get pipeline event log |
### Exports
| Method | Path | Description |
|---|---|---|
| `GET` | `/jobs/{id}/raw` | Download raw provider payload |
| `GET` | `/jobs/{id}/canonical` | Download canonical document JSON |
| `GET` | `/jobs/{id}/alto` | Download ALTO XML |
| `GET` | `/jobs/{id}/pagexml` | Download PAGE XML |
| `GET` | `/jobs/{id}/viewer` | Get viewer projection JSON |
### Example: run a job via API
```bash
curl -X POST "http://localhost:7860/jobs?provider_id=paddleocr&provider_family=word_box_json&image_width=2480&image_height=3508" \
-F "raw_payload_file=@paddle_output.json"
```
## Canonical document
The `CanonicalDocument` is the central model. It represents **what the system knows about the page**, not what a specific model produced.
### Hierarchy
```
CanonicalDocument
βββ Page[]
βββ TextRegion[] (blocks)
β βββ TextLine[]
β βββ Word[]
βββ NonTextRegion[] (illustrations, tables, separators)
```
### Every node carries
- **geometry** β `bbox: (x, y, width, height)` + optional `polygon` + `status` (exact / inferred / repaired / unknown)
- **provenance** β `provider`, `adapter`, `source_ref`, `evidence_type` (provider_native / derived / repaired / manual), `derived_from`
- **metadata** β extensible `dict` for future fields without schema changes
### Geometry conventions
| Convention | Value |
|---|---|
| bbox format | `(x, y, width, height)` |
| Coordinate origin | `top_left` |
| Unit | `px` |
| Polygon | `list[tuple[float, float]]` or `None` |
Providers returning `(x1, y1, x2, y2)` are converted in their adapter. No serializer performs implicit geometry conversion.
## Provider system
The provider system separates three concerns:
| Layer | Question | Examples |
|---|---|---|
| **Runtime** | How do I execute it? | `local`, `hub`, `api` |
| **Family** | What shape is the output? | `word_box_json`, `line_box_json`, `text_only` |
| **Profile** | What is this specific instance? | PaddleOCR local at `/models/paddle`, Qwen API at `https://...` |
### Adapter families
| Family | Output shape | Geometry | ALTO export |
|---|---|---|---|
| `word_box_json` | Words with 4-point polygons (PaddleOCR) | Exact | Full |
| `line_box_json` | Lines with bboxes, no word segmentation | Exact (line-level) | Full (1 word per line) |
| `text_only` | Structured text, no coordinates (mLLM) | Unknown | Refused (honest) |
### Capability matrix
Each provider profile declares a `CapabilityMatrix`:
```
block_geometry, line_geometry, word_geometry, polygon_geometry,
baseline, reading_order, text_confidence, language,
non_text_regions, tables, rotation_info
```
## Validation and policy
### Four validators
| Validator | Checks |
|---|---|
| **Schema** | Pydantic v2 model validation with structured error report |
| **Structural** | ID uniqueness, reading order references, bbox containment (configurable tolerance) |
| **Readiness** | Per-page ALTO/PAGE readiness: full, partial, degraded, or none |
| **Export eligibility** | Independent go/no-go for ALTO, PAGE, and viewer |
### Document policy
Three modes controlling what the system may do:
| Mode | Inference | Partial exports | Tolerance |
|---|---|---|---|
| `strict` | No polygon-to-bbox, no lang propagation, no reading order inference | Refused | 5px |
| `standard` (default) | Polygon-to-bbox, lang propagation, reading order, hyphenation | Allowed | 5px |
| `permissive` | All enrichments enabled | Allowed | 10px |
All modes enforce: **no text invention**, **no bbox invention**.
## Enrichers
Enrichers run after normalization, before validation. Each produces a new immutable document.
| Enricher | What it does | Provenance |
|---|---|---|
| `polygon_to_bbox` | Derives bbox from polygon when geometry is `unknown` | `inferred` |
| `bbox_repair_light` | Clips bboxes overflowing page boundaries | `repaired` |
| `lang_propagation` | Propagates language from region/line to child nodes | unchanged |
| `reading_order_simple` | Infers reading order by spatial position (top-to-bottom, left-to-right) | `inferred` |
| `hyphenation_basic` | Detects word-ending `-` at line boundary with lowercase continuation | `inferred` |
| `text_consistency` | Warns on blank or suspiciously long words (>100 chars) | warnings only |
## Export formats
### ALTO XML v4
- Namespace: `http://www.loc.gov/standards/alto/ns-v4#`
- Mapping: `Page` / `TextBlock` / `TextLine` / `String`
- Attributes: `HPOS`, `VPOS`, `WIDTH`, `HEIGHT` (integer), `CONTENT`, `WC`, `LANG`
- Hyphenation: `SUBS_TYPE` (HypPart1/HypPart2), `SUBS_CONTENT`
- Includes `<Description>` with measurement unit, filename, processing software
### PAGE XML 2019
- Namespace: `http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15`
- Mapping: `TextRegion` / `TextLine` / `Word`
- Coordinates: `<Coords points="x1,y1 x2,y2 ...">` β preserves polygons when available
- `<TextEquiv><Unicode>` at region, line, and word levels
- `<ReadingOrder>` / `<OrderedGroup>` / `<RegionRefIndexed>`
- Region `@type` mapped from block role (paragraph, heading, footnote, etc.)
### Key difference
ALTO uses axis-aligned bounding boxes (integers). PAGE XML uses polygons (preserves original quadrilateral geometry from providers like PaddleOCR).
## Project structure
```
XmLLM/
pyproject.toml # Dependencies and build config
Dockerfile # Deployable container
docker-compose.yml # Local dev with volume
AGENTS.md # Architecture rules (non-negotiable)
.env.example # Configuration reference
src/app/
main.py # FastAPI app entry point
settings.py # SettingsService (auto-detects local vs Space)
api/ # Anneau C β HTTP routes
routes_health.py
routes_providers.py
routes_jobs.py
routes_exports.py
routes_viewer.py
domain/ # Anneau A β Pure domain
models/
canonical_document.py # CanonicalDocument, Word, TextLine, TextRegion, Page
geometry.py # Point, BBox, Polygon, Baseline, Geometry, GeometryContext
provenance.py # Provenance with conditional validation
readiness.py # AltoReadiness, PageXmlReadiness, ExportEligibility
status.py # 12 domain enums
raw_payload.py # RawProviderPayload
viewer_projection.py # OverlayItem, InspectionData, ViewerProjection
errors/ # ValidationReport, ValidationEntry, Severity
geometry/ # Geometric operations
bbox.py # contains, intersects, union, iou, expand (12 ops)
polygon.py # polygon<->bbox, area, centroid, validation (7 ops)
baseline.py # length, angle, interpolation
transforms.py # rescale, clip, rotate, translate
normalization.py # xyxy->xywh, 4-point->bbox, pixel<->normalized
quantization.py # float->int strategies, tolerance checks
providers/ # Anneau B β Provider system
registry.py # Central adapter + runtime index
resolver.py # Profile -> runtime + adapter
profiles.py # ProviderProfile model
capabilities.py # CapabilityMatrix
runtimes/
base.py # BaseRuntime ABC
local_runtime.py
hub_runtime.py
api_runtime.py
adapters/
base.py # BaseAdapter ABC
word_box_json.py # PaddleOCR format
line_box_json.py # Line-level OCR
text_only.py # mLLM without geometry
normalization/
pipeline.py # Raw -> CanonicalDocument orchestration
canonical_builder.py # Fluent builder for CanonicalDocument
enrichers/
__init__.py # BaseEnricher ABC + EnricherPipeline
polygon_to_bbox.py
bbox_repair_light.py
lang_propagation.py
reading_order_simple.py
hyphenation_basic.py
text_consistency.py
validators/
schema_validator.py
structural_validator.py
readiness_validator.py
export_eligibility_validator.py
policies/
document_policy.py # Strict / standard / permissive modes
export_policy.py # Per-format go/no-go decisions
serializers/
alto_xml.py # CanonicalDocument -> ALTO XML v4
page_xml.py # CanonicalDocument -> PAGE XML 2019
viewer/
projection_builder.py # CanonicalDocument -> ViewerProjection
overlays.py # Node -> OverlayItem/InspectionData
jobs/
models.py # Job model (5-state machine)
events.py # EventLog with timed steps
service.py # JobService (13-step pipeline orchestrator)
persistence/
db.py # SQLite (jobs + providers)
file_store.py # Filesystem artifact store
frontend/static/
index.html # Single-page web UI
tests/
fixtures/ # 7 test fixtures (simple, columns, noisy, etc.)
unit/ # 24 unit test modules
integration/ # 5 integration test modules
```
## Configuration
Copy `.env.example` to `.env` and adjust:
| Variable | Default | Description |
|---|---|---|
| `APP_MODE` | `local` | `local` or `space` (auto-detected from `SPACE_ID`) |
| `STORAGE_ROOT` | `./data` | Root for all persistent data |
| `DB_NAME` | `app.db` | SQLite database filename |
| `HOST` | `0.0.0.0` | Server bind address |
| `PORT` | `7860` | Server port |
| `MAX_UPLOAD_SIZE` | `52428800` | Max upload size in bytes (50 MB) |
| `ALLOWED_MIME_TYPES` | `image/png,jpeg,tiff,webp` | Accepted upload types |
| `PROVIDER_TIMEOUT` | `120` | Provider execution timeout (seconds) |
| `BBOX_CONTAINMENT_TOLERANCE` | `5` | Pixels of allowed bbox overflow |
## Testing
```bash
# Run all tests
pytest
# With coverage
pytest --cov=src --cov-report=term-missing
# Only unit tests
pytest tests/unit/
# Only integration tests
pytest tests/integration/
```
**497 tests** covering:
- Domain models (validation, rejection, JSON round-trips)
- Geometry operations (all transforms, containment, quantization)
- Adapters (PaddleOCR, line-box, text-only formats)
- Serializers (ALTO structure, PAGE structure, hyphenation, polygons)
- Validators (structural, readiness, export eligibility)
- Enrichers (all 6 enrichers + pipeline + policy control)
- Persistence (file store, SQLite CRUD)
- API routes (providers, jobs, exports, viewer)
- End-to-end fixtures (simple page, double column, noisy page, title+body, hyphenation, text-only)
## V1 scope
### Included
- Single image input
- Local, Hub, and API runtimes (skeleton β raw payloads provided directly in V1)
- 3 adapter families (word_box_json, line_box_json, text_only)
- Full CanonicalDocument with provenance and geometry
- ALTO XML v4 and PAGE XML 2019 native export
- 6 enrichers with policy control
- 4 validators with configurable tolerance
- Job orchestration with event logging
- SQLite + filesystem persistence
- REST API with OpenAPI docs
- Web UI for upload, job management, and export download
- Docker deployment
### Excluded (V2+)
- PDF multipage input
- Live model execution (currently raw payloads are provided externally)
- Manual editing of canonical documents
- Multi-user collaboration
- Batch processing
- Fine-tuning
- Advanced table extraction
- OpenSeadragon interactive viewer with overlays
- Authentication
## License
Apache 2.0 β see [LICENSE](LICENSE).
|