File size: 5,787 Bytes

3c0d3e1

# PP-DocLayoutV3 TensorRT Layout Service

Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build.

The model is mounted into the container as a TensorRT engine:

```text
host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine
```

Runtime contract:

```text
rendered page image or cached benchmark request
-> Rust HTTP server
-> dynamic batch queue
-> C++ TensorRT wrapper
-> mounted PP-DocLayoutV3 engine
-> layout boxes JSON
```

The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work.

## Build

```bash
cd pp-doclayout-server
docker compose build doclayout
```

The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image.

## Run

Set `DOC_LAYOUT_ENGINE_HOST` to the engine file on the host:

```bash
cd pp-doclayout-server
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose up
```

Run with a mounted TensorRT engine:

```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose up
```

Health:

```bash
curl http://localhost:18082/health
```

Metrics:

```bash
curl http://localhost:18082/metrics
```

The default layout score threshold is `0.5`. Override per deployment with `DOC_LAYOUT_SCORE_THRESHOLD`, or per request with `score_threshold`. For example, scanned forms may use `0.35` while clean papers can keep the default.

## Endpoints

`POST /v1/infer`

Uses the configured `DOC_LAYOUT_SAMPLE_IMAGE` and is intended for model-server throughput benchmarking without upload/render cost.

```bash
curl -s http://localhost:18082/v1/infer \
  -H 'content-type: application/json' \
  -d '{"return_boxes": false}'
```

`POST /v1/layout`

Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named `file` or `files`.

The request body limit is explicit. `DOC_LAYOUT_MAX_UPLOAD_MB` defaults to `512`, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit.

```bash
curl -s http://localhost:18082/v1/layout \
  -F files=@inputs/sample.png \
  > outputs/sample_layout.json

curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \
  -F files=@inputs/sample.png \
  > outputs/sample_layout_scan_recall.json
```


`POST /v1/layout_chw_u8`

Production-oriented endpoint for an external orchestrator. Send a raw `3x800x800` CHW `u8` body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates.

```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \
  -H 'content-type: application/octet-stream' \
  --data-binary @page_800_chw_u8.bin
```

Batched raw endpoint for server/orchestrator experiments:

```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \
  -H 'content-type: application/octet-stream' \
  --data-binary @pages_b8_800_chw_u8.bin
```

The request body is `batch` contiguous pages, each `3x800x800` CHW `u8`. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service.

Rust client example:

```bash
cd rust-batcher
cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082
```

The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW `u8`, sends one request, and prints the JSON response.

Response shape:

```json
{
  "pages": 1,
  "results": [
    {
      "boxes": [
        {
          "label": "table",
          "class_id": 21,
          "score": 0.91,
          "bbox": [72, 140, 530, 420],
          "order": 3
        }
      ],
      "batch_size": 1,
      "queue_wait_us": 1000,
      "infer_us": 25000
    }
  ]
}
```

## Throughput Mode

Current clean compose result on RTX 4090 with the mounted validated engine:

```text
DOC_LAYOUT_WORKERS=3
DOC_LAYOUT_MAX_BATCH=8
DOC_LAYOUT_MAX_DELAY_US=1000
DOC_LAYOUT_MAX_UPLOAD_MB=512
client concurrency = 48
throughput ~= 308 pages/s
p50 ~= 150 ms
p95 ~= 167 ms
```

For lower latency, use `DOC_LAYOUT_WORKERS=2` and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling.

Run the benchmark against a running service:

```bash
python scripts/bench_http.py \
  --url http://localhost:18082/v1/infer \
  --concurrency 48 \
  --requests 1920
```

The benchmark helper is a host-side client tool. It is not part of the model-server container.

## Memory Checks

CUDA leak check with the mounted engine:

```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose run --rm \
  --entrypoint /usr/local/cuda/bin/compute-sanitizer \
  -e DOC_LAYOUT_SELF_TEST_ITERS=2 \
  -e DOC_LAYOUT_SELF_TEST_BATCH=2 \
  doclayout \
  --tool memcheck --leak-check full --error-exitcode 88 \
  /usr/local/bin/doclayout-rust-batcher
```

Expected result:

```text
LEAK SUMMARY: 0 bytes leaked in 0 allocations
ERROR SUMMARY: 0 errors
```

CPU leak check with `valgrind` is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so `compute-sanitizer` is the required CUDA gate.