PP-DocLayoutV3 TensorRT Layout Service

Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build.

The model is mounted into the container as a TensorRT engine:

host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine

Runtime contract:

rendered page image or cached benchmark request
-> Rust HTTP server
-> dynamic batch queue
-> C++ TensorRT wrapper
-> mounted PP-DocLayoutV3 engine
-> layout boxes JSON

The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work.

Build

cd pp-doclayout-server
docker compose build doclayout

The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image.

Run

Set DOC_LAYOUT_ENGINE_HOST to the engine file on the host:

cd pp-doclayout-server
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose up

Run with a mounted TensorRT engine:

DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose up

Health:

curl http://localhost:18082/health

Metrics:

curl http://localhost:18082/metrics

The default layout score threshold is 0.5. Override per deployment with DOC_LAYOUT_SCORE_THRESHOLD, or per request with score_threshold. For example, scanned forms may use 0.35 while clean papers can keep the default.

Endpoints

POST /v1/infer

Uses the configured DOC_LAYOUT_SAMPLE_IMAGE and is intended for model-server throughput benchmarking without upload/render cost.

curl -s http://localhost:18082/v1/infer \
  -H 'content-type: application/json' \
  -d '{"return_boxes": false}'

POST /v1/layout

Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named file or files.

The request body limit is explicit. DOC_LAYOUT_MAX_UPLOAD_MB defaults to 512, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit.

curl -s http://localhost:18082/v1/layout \
  -F files=@inputs/sample.png \
  > outputs/sample_layout.json

curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \
  -F files=@inputs/sample.png \
  > outputs/sample_layout_scan_recall.json

POST /v1/layout_chw_u8

Production-oriented endpoint for an external orchestrator. Send a raw 3x800x800 CHW u8 body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates.

curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \
  -H 'content-type: application/octet-stream' \
  --data-binary @page_800_chw_u8.bin

Batched raw endpoint for server/orchestrator experiments:

curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \
  -H 'content-type: application/octet-stream' \
  --data-binary @pages_b8_800_chw_u8.bin

The request body is batch contiguous pages, each 3x800x800 CHW u8. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service.

Rust client example:

cd rust-batcher
cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082

The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW u8, sends one request, and prints the JSON response.

Response shape:

{
  "pages": 1,
  "results": [
    {
      "boxes": [
        {
          "label": "table",
          "class_id": 21,
          "score": 0.91,
          "bbox": [72, 140, 530, 420],
          "order": 3
        }
      ],
      "batch_size": 1,
      "queue_wait_us": 1000,
      "infer_us": 25000
    }
  ]
}

Throughput Mode

Current clean compose result on RTX 4090 with the mounted validated engine:

DOC_LAYOUT_WORKERS=3
DOC_LAYOUT_MAX_BATCH=8
DOC_LAYOUT_MAX_DELAY_US=1000
DOC_LAYOUT_MAX_UPLOAD_MB=512
client concurrency = 48
throughput ~= 308 pages/s
p50 ~= 150 ms
p95 ~= 167 ms

For lower latency, use DOC_LAYOUT_WORKERS=2 and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling.

Run the benchmark against a running service:

python scripts/bench_http.py \
  --url http://localhost:18082/v1/infer \
  --concurrency 48 \
  --requests 1920

The benchmark helper is a host-side client tool. It is not part of the model-server container.

Memory Checks

CUDA leak check with the mounted engine:

DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose run --rm \
  --entrypoint /usr/local/cuda/bin/compute-sanitizer \
  -e DOC_LAYOUT_SELF_TEST_ITERS=2 \
  -e DOC_LAYOUT_SELF_TEST_BATCH=2 \
  doclayout \
  --tool memcheck --leak-check full --error-exitcode 88 \
  /usr/local/bin/doclayout-rust-batcher

Expected result:

LEAK SUMMARY: 0 bytes leaked in 0 allocations
ERROR SUMMARY: 0 errors

CPU leak check with valgrind is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so compute-sanitizer is the required CUDA gate.