bndos's picture
Add pp-doclayout server source with score threshold
3c0d3e1 verified
# PP-DocLayoutV3 TensorRT Layout Service
Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build.
The model is mounted into the container as a TensorRT engine:
```text
host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine
```
Runtime contract:
```text
rendered page image or cached benchmark request
-> Rust HTTP server
-> dynamic batch queue
-> C++ TensorRT wrapper
-> mounted PP-DocLayoutV3 engine
-> layout boxes JSON
```
The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work.
## Build
```bash
cd pp-doclayout-server
docker compose build doclayout
```
The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image.
## Run
Set `DOC_LAYOUT_ENGINE_HOST` to the engine file on the host:
```bash
cd pp-doclayout-server
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose up
```
Run with a mounted TensorRT engine:
```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose up
```
Health:
```bash
curl http://localhost:18082/health
```
Metrics:
```bash
curl http://localhost:18082/metrics
```
The default layout score threshold is `0.5`. Override per deployment with `DOC_LAYOUT_SCORE_THRESHOLD`, or per request with `score_threshold`. For example, scanned forms may use `0.35` while clean papers can keep the default.
## Endpoints
`POST /v1/infer`
Uses the configured `DOC_LAYOUT_SAMPLE_IMAGE` and is intended for model-server throughput benchmarking without upload/render cost.
```bash
curl -s http://localhost:18082/v1/infer \
-H 'content-type: application/json' \
-d '{"return_boxes": false}'
```
`POST /v1/layout`
Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named `file` or `files`.
The request body limit is explicit. `DOC_LAYOUT_MAX_UPLOAD_MB` defaults to `512`, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit.
```bash
curl -s http://localhost:18082/v1/layout \
-F files=@inputs/sample.png \
> outputs/sample_layout.json
curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \
-F files=@inputs/sample.png \
> outputs/sample_layout_scan_recall.json
```
`POST /v1/layout_chw_u8`
Production-oriented endpoint for an external orchestrator. Send a raw `3x800x800` CHW `u8` body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates.
```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \
-H 'content-type: application/octet-stream' \
--data-binary @page_800_chw_u8.bin
```
Batched raw endpoint for server/orchestrator experiments:
```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \
-H 'content-type: application/octet-stream' \
--data-binary @pages_b8_800_chw_u8.bin
```
The request body is `batch` contiguous pages, each `3x800x800` CHW `u8`. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service.
Rust client example:
```bash
cd rust-batcher
cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082
```
The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW `u8`, sends one request, and prints the JSON response.
Response shape:
```json
{
"pages": 1,
"results": [
{
"boxes": [
{
"label": "table",
"class_id": 21,
"score": 0.91,
"bbox": [72, 140, 530, 420],
"order": 3
}
],
"batch_size": 1,
"queue_wait_us": 1000,
"infer_us": 25000
}
]
}
```
## Throughput Mode
Current clean compose result on RTX 4090 with the mounted validated engine:
```text
DOC_LAYOUT_WORKERS=3
DOC_LAYOUT_MAX_BATCH=8
DOC_LAYOUT_MAX_DELAY_US=1000
DOC_LAYOUT_MAX_UPLOAD_MB=512
client concurrency = 48
throughput ~= 308 pages/s
p50 ~= 150 ms
p95 ~= 167 ms
```
For lower latency, use `DOC_LAYOUT_WORKERS=2` and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling.
Run the benchmark against a running service:
```bash
python scripts/bench_http.py \
--url http://localhost:18082/v1/infer \
--concurrency 48 \
--requests 1920
```
The benchmark helper is a host-side client tool. It is not part of the model-server container.
## Memory Checks
CUDA leak check with the mounted engine:
```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose run --rm \
--entrypoint /usr/local/cuda/bin/compute-sanitizer \
-e DOC_LAYOUT_SELF_TEST_ITERS=2 \
-e DOC_LAYOUT_SELF_TEST_BATCH=2 \
doclayout \
--tool memcheck --leak-check full --error-exitcode 88 \
/usr/local/bin/doclayout-rust-batcher
```
Expected result:
```text
LEAK SUMMARY: 0 bytes leaked in 0 allocations
ERROR SUMMARY: 0 errors
```
CPU leak check with `valgrind` is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so `compute-sanitizer` is the required CUDA gate.