Instructions to use bndos/pp-doclayout-v3-trt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use bndos/pp-doclayout-v3-trt with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| # PP-DocLayoutV3 TensorRT Layout Service | |
| Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build. | |
| The model is mounted into the container as a TensorRT engine: | |
| ```text | |
| host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine | |
| ``` | |
| Runtime contract: | |
| ```text | |
| rendered page image or cached benchmark request | |
| -> Rust HTTP server | |
| -> dynamic batch queue | |
| -> C++ TensorRT wrapper | |
| -> mounted PP-DocLayoutV3 engine | |
| -> layout boxes JSON | |
| ``` | |
| The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work. | |
| ## Build | |
| ```bash | |
| cd pp-doclayout-server | |
| docker compose build doclayout | |
| ``` | |
| The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image. | |
| ## Run | |
| Set `DOC_LAYOUT_ENGINE_HOST` to the engine file on the host: | |
| ```bash | |
| cd pp-doclayout-server | |
| DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \ | |
| docker compose up | |
| ``` | |
| Run with a mounted TensorRT engine: | |
| ```bash | |
| DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \ | |
| docker compose up | |
| ``` | |
| Health: | |
| ```bash | |
| curl http://localhost:18082/health | |
| ``` | |
| Metrics: | |
| ```bash | |
| curl http://localhost:18082/metrics | |
| ``` | |
| The default layout score threshold is `0.5`. Override per deployment with `DOC_LAYOUT_SCORE_THRESHOLD`, or per request with `score_threshold`. For example, scanned forms may use `0.35` while clean papers can keep the default. | |
| ## Endpoints | |
| `POST /v1/infer` | |
| Uses the configured `DOC_LAYOUT_SAMPLE_IMAGE` and is intended for model-server throughput benchmarking without upload/render cost. | |
| ```bash | |
| curl -s http://localhost:18082/v1/infer \ | |
| -H 'content-type: application/json' \ | |
| -d '{"return_boxes": false}' | |
| ``` | |
| `POST /v1/layout` | |
| Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named `file` or `files`. | |
| The request body limit is explicit. `DOC_LAYOUT_MAX_UPLOAD_MB` defaults to `512`, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit. | |
| ```bash | |
| curl -s http://localhost:18082/v1/layout \ | |
| -F files=@inputs/sample.png \ | |
| > outputs/sample_layout.json | |
| curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \ | |
| -F files=@inputs/sample.png \ | |
| > outputs/sample_layout_scan_recall.json | |
| ``` | |
| `POST /v1/layout_chw_u8` | |
| Production-oriented endpoint for an external orchestrator. Send a raw `3x800x800` CHW `u8` body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates. | |
| ```bash | |
| curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \ | |
| -H 'content-type: application/octet-stream' \ | |
| --data-binary @page_800_chw_u8.bin | |
| ``` | |
| Batched raw endpoint for server/orchestrator experiments: | |
| ```bash | |
| curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \ | |
| -H 'content-type: application/octet-stream' \ | |
| --data-binary @pages_b8_800_chw_u8.bin | |
| ``` | |
| The request body is `batch` contiguous pages, each `3x800x800` CHW `u8`. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service. | |
| Rust client example: | |
| ```bash | |
| cd rust-batcher | |
| cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082 | |
| ``` | |
| The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW `u8`, sends one request, and prints the JSON response. | |
| Response shape: | |
| ```json | |
| { | |
| "pages": 1, | |
| "results": [ | |
| { | |
| "boxes": [ | |
| { | |
| "label": "table", | |
| "class_id": 21, | |
| "score": 0.91, | |
| "bbox": [72, 140, 530, 420], | |
| "order": 3 | |
| } | |
| ], | |
| "batch_size": 1, | |
| "queue_wait_us": 1000, | |
| "infer_us": 25000 | |
| } | |
| ] | |
| } | |
| ``` | |
| ## Throughput Mode | |
| Current clean compose result on RTX 4090 with the mounted validated engine: | |
| ```text | |
| DOC_LAYOUT_WORKERS=3 | |
| DOC_LAYOUT_MAX_BATCH=8 | |
| DOC_LAYOUT_MAX_DELAY_US=1000 | |
| DOC_LAYOUT_MAX_UPLOAD_MB=512 | |
| client concurrency = 48 | |
| throughput ~= 308 pages/s | |
| p50 ~= 150 ms | |
| p95 ~= 167 ms | |
| ``` | |
| For lower latency, use `DOC_LAYOUT_WORKERS=2` and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling. | |
| Run the benchmark against a running service: | |
| ```bash | |
| python scripts/bench_http.py \ | |
| --url http://localhost:18082/v1/infer \ | |
| --concurrency 48 \ | |
| --requests 1920 | |
| ``` | |
| The benchmark helper is a host-side client tool. It is not part of the model-server container. | |
| ## Memory Checks | |
| CUDA leak check with the mounted engine: | |
| ```bash | |
| DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \ | |
| docker compose run --rm \ | |
| --entrypoint /usr/local/cuda/bin/compute-sanitizer \ | |
| -e DOC_LAYOUT_SELF_TEST_ITERS=2 \ | |
| -e DOC_LAYOUT_SELF_TEST_BATCH=2 \ | |
| doclayout \ | |
| --tool memcheck --leak-check full --error-exitcode 88 \ | |
| /usr/local/bin/doclayout-rust-batcher | |
| ``` | |
| Expected result: | |
| ```text | |
| LEAK SUMMARY: 0 bytes leaked in 0 allocations | |
| ERROR SUMMARY: 0 errors | |
| ``` | |
| CPU leak check with `valgrind` is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so `compute-sanitizer` is the required CUDA gate. | |