# PP-DocLayoutV3 TensorRT Layout Service Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build. The model is mounted into the container as a TensorRT engine: ```text host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine ``` Runtime contract: ```text rendered page image or cached benchmark request -> Rust HTTP server -> dynamic batch queue -> C++ TensorRT wrapper -> mounted PP-DocLayoutV3 engine -> layout boxes JSON ``` The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work. ## Build ```bash cd pp-doclayout-server docker compose build doclayout ``` The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image. ## Run Set `DOC_LAYOUT_ENGINE_HOST` to the engine file on the host: ```bash cd pp-doclayout-server DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \ docker compose up ``` Run with a mounted TensorRT engine: ```bash DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \ docker compose up ``` Health: ```bash curl http://localhost:18082/health ``` Metrics: ```bash curl http://localhost:18082/metrics ``` The default layout score threshold is `0.5`. Override per deployment with `DOC_LAYOUT_SCORE_THRESHOLD`, or per request with `score_threshold`. For example, scanned forms may use `0.35` while clean papers can keep the default. ## Endpoints `POST /v1/infer` Uses the configured `DOC_LAYOUT_SAMPLE_IMAGE` and is intended for model-server throughput benchmarking without upload/render cost. ```bash curl -s http://localhost:18082/v1/infer \ -H 'content-type: application/json' \ -d '{"return_boxes": false}' ``` `POST /v1/layout` Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named `file` or `files`. The request body limit is explicit. `DOC_LAYOUT_MAX_UPLOAD_MB` defaults to `512`, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit. ```bash curl -s http://localhost:18082/v1/layout \ -F files=@inputs/sample.png \ > outputs/sample_layout.json curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \ -F files=@inputs/sample.png \ > outputs/sample_layout_scan_recall.json ``` `POST /v1/layout_chw_u8` Production-oriented endpoint for an external orchestrator. Send a raw `3x800x800` CHW `u8` body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates. ```bash curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \ -H 'content-type: application/octet-stream' \ --data-binary @page_800_chw_u8.bin ``` Batched raw endpoint for server/orchestrator experiments: ```bash curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \ -H 'content-type: application/octet-stream' \ --data-binary @pages_b8_800_chw_u8.bin ``` The request body is `batch` contiguous pages, each `3x800x800` CHW `u8`. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service. Rust client example: ```bash cd rust-batcher cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082 ``` The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW `u8`, sends one request, and prints the JSON response. Response shape: ```json { "pages": 1, "results": [ { "boxes": [ { "label": "table", "class_id": 21, "score": 0.91, "bbox": [72, 140, 530, 420], "order": 3 } ], "batch_size": 1, "queue_wait_us": 1000, "infer_us": 25000 } ] } ``` ## Throughput Mode Current clean compose result on RTX 4090 with the mounted validated engine: ```text DOC_LAYOUT_WORKERS=3 DOC_LAYOUT_MAX_BATCH=8 DOC_LAYOUT_MAX_DELAY_US=1000 DOC_LAYOUT_MAX_UPLOAD_MB=512 client concurrency = 48 throughput ~= 308 pages/s p50 ~= 150 ms p95 ~= 167 ms ``` For lower latency, use `DOC_LAYOUT_WORKERS=2` and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling. Run the benchmark against a running service: ```bash python scripts/bench_http.py \ --url http://localhost:18082/v1/infer \ --concurrency 48 \ --requests 1920 ``` The benchmark helper is a host-side client tool. It is not part of the model-server container. ## Memory Checks CUDA leak check with the mounted engine: ```bash DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \ docker compose run --rm \ --entrypoint /usr/local/cuda/bin/compute-sanitizer \ -e DOC_LAYOUT_SELF_TEST_ITERS=2 \ -e DOC_LAYOUT_SELF_TEST_BATCH=2 \ doclayout \ --tool memcheck --leak-check full --error-exitcode 88 \ /usr/local/bin/doclayout-rust-batcher ``` Expected result: ```text LEAK SUMMARY: 0 bytes leaked in 0 allocations ERROR SUMMARY: 0 errors ``` CPU leak check with `valgrind` is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so `compute-sanitizer` is the required CUDA gate.