Instructions to use bndos/pp-doclayout-v3-trt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use bndos/pp-doclayout-v3-trt with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
PP-DocLayoutV3 TensorRT Layout Service
Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build.
The model is mounted into the container as a TensorRT engine:
host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine
Runtime contract:
rendered page image or cached benchmark request
-> Rust HTTP server
-> dynamic batch queue
-> C++ TensorRT wrapper
-> mounted PP-DocLayoutV3 engine
-> layout boxes JSON
The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work.
Build
cd pp-doclayout-server
docker compose build doclayout
The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image.
Run
Set DOC_LAYOUT_ENGINE_HOST to the engine file on the host:
cd pp-doclayout-server
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose up
Run with a mounted TensorRT engine:
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose up
Health:
curl http://localhost:18082/health
Metrics:
curl http://localhost:18082/metrics
The default layout score threshold is 0.5. Override per deployment with DOC_LAYOUT_SCORE_THRESHOLD, or per request with score_threshold. For example, scanned forms may use 0.35 while clean papers can keep the default.
Endpoints
POST /v1/infer
Uses the configured DOC_LAYOUT_SAMPLE_IMAGE and is intended for model-server throughput benchmarking without upload/render cost.
curl -s http://localhost:18082/v1/infer \
-H 'content-type: application/json' \
-d '{"return_boxes": false}'
POST /v1/layout
Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named file or files.
The request body limit is explicit. DOC_LAYOUT_MAX_UPLOAD_MB defaults to 512, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit.
curl -s http://localhost:18082/v1/layout \
-F files=@inputs/sample.png \
> outputs/sample_layout.json
curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \
-F files=@inputs/sample.png \
> outputs/sample_layout_scan_recall.json
POST /v1/layout_chw_u8
Production-oriented endpoint for an external orchestrator. Send a raw 3x800x800 CHW u8 body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates.
curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \
-H 'content-type: application/octet-stream' \
--data-binary @page_800_chw_u8.bin
Batched raw endpoint for server/orchestrator experiments:
curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \
-H 'content-type: application/octet-stream' \
--data-binary @pages_b8_800_chw_u8.bin
The request body is batch contiguous pages, each 3x800x800 CHW u8. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service.
Rust client example:
cd rust-batcher
cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082
The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW u8, sends one request, and prints the JSON response.
Response shape:
{
"pages": 1,
"results": [
{
"boxes": [
{
"label": "table",
"class_id": 21,
"score": 0.91,
"bbox": [72, 140, 530, 420],
"order": 3
}
],
"batch_size": 1,
"queue_wait_us": 1000,
"infer_us": 25000
}
]
}
Throughput Mode
Current clean compose result on RTX 4090 with the mounted validated engine:
DOC_LAYOUT_WORKERS=3
DOC_LAYOUT_MAX_BATCH=8
DOC_LAYOUT_MAX_DELAY_US=1000
DOC_LAYOUT_MAX_UPLOAD_MB=512
client concurrency = 48
throughput ~= 308 pages/s
p50 ~= 150 ms
p95 ~= 167 ms
For lower latency, use DOC_LAYOUT_WORKERS=2 and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling.
Run the benchmark against a running service:
python scripts/bench_http.py \
--url http://localhost:18082/v1/infer \
--concurrency 48 \
--requests 1920
The benchmark helper is a host-side client tool. It is not part of the model-server container.
Memory Checks
CUDA leak check with the mounted engine:
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose run --rm \
--entrypoint /usr/local/cuda/bin/compute-sanitizer \
-e DOC_LAYOUT_SELF_TEST_ITERS=2 \
-e DOC_LAYOUT_SELF_TEST_BATCH=2 \
doclayout \
--tool memcheck --leak-check full --error-exitcode 88 \
/usr/local/bin/doclayout-rust-batcher
Expected result:
LEAK SUMMARY: 0 bytes leaked in 0 allocations
ERROR SUMMARY: 0 errors
CPU leak check with valgrind is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so compute-sanitizer is the required CUDA gate.