Instructions to use bndos/pp-doclayout-v3-trt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use bndos/pp-doclayout-v3-trt with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 5,787 Bytes
3c0d3e1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | # PP-DocLayoutV3 TensorRT Layout Service
Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build.
The model is mounted into the container as a TensorRT engine:
```text
host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine
```
Runtime contract:
```text
rendered page image or cached benchmark request
-> Rust HTTP server
-> dynamic batch queue
-> C++ TensorRT wrapper
-> mounted PP-DocLayoutV3 engine
-> layout boxes JSON
```
The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work.
## Build
```bash
cd pp-doclayout-server
docker compose build doclayout
```
The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image.
## Run
Set `DOC_LAYOUT_ENGINE_HOST` to the engine file on the host:
```bash
cd pp-doclayout-server
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose up
```
Run with a mounted TensorRT engine:
```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose up
```
Health:
```bash
curl http://localhost:18082/health
```
Metrics:
```bash
curl http://localhost:18082/metrics
```
The default layout score threshold is `0.5`. Override per deployment with `DOC_LAYOUT_SCORE_THRESHOLD`, or per request with `score_threshold`. For example, scanned forms may use `0.35` while clean papers can keep the default.
## Endpoints
`POST /v1/infer`
Uses the configured `DOC_LAYOUT_SAMPLE_IMAGE` and is intended for model-server throughput benchmarking without upload/render cost.
```bash
curl -s http://localhost:18082/v1/infer \
-H 'content-type: application/json' \
-d '{"return_boxes": false}'
```
`POST /v1/layout`
Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named `file` or `files`.
The request body limit is explicit. `DOC_LAYOUT_MAX_UPLOAD_MB` defaults to `512`, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit.
```bash
curl -s http://localhost:18082/v1/layout \
-F files=@inputs/sample.png \
> outputs/sample_layout.json
curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \
-F files=@inputs/sample.png \
> outputs/sample_layout_scan_recall.json
```
`POST /v1/layout_chw_u8`
Production-oriented endpoint for an external orchestrator. Send a raw `3x800x800` CHW `u8` body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates.
```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \
-H 'content-type: application/octet-stream' \
--data-binary @page_800_chw_u8.bin
```
Batched raw endpoint for server/orchestrator experiments:
```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \
-H 'content-type: application/octet-stream' \
--data-binary @pages_b8_800_chw_u8.bin
```
The request body is `batch` contiguous pages, each `3x800x800` CHW `u8`. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service.
Rust client example:
```bash
cd rust-batcher
cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082
```
The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW `u8`, sends one request, and prints the JSON response.
Response shape:
```json
{
"pages": 1,
"results": [
{
"boxes": [
{
"label": "table",
"class_id": 21,
"score": 0.91,
"bbox": [72, 140, 530, 420],
"order": 3
}
],
"batch_size": 1,
"queue_wait_us": 1000,
"infer_us": 25000
}
]
}
```
## Throughput Mode
Current clean compose result on RTX 4090 with the mounted validated engine:
```text
DOC_LAYOUT_WORKERS=3
DOC_LAYOUT_MAX_BATCH=8
DOC_LAYOUT_MAX_DELAY_US=1000
DOC_LAYOUT_MAX_UPLOAD_MB=512
client concurrency = 48
throughput ~= 308 pages/s
p50 ~= 150 ms
p95 ~= 167 ms
```
For lower latency, use `DOC_LAYOUT_WORKERS=2` and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling.
Run the benchmark against a running service:
```bash
python scripts/bench_http.py \
--url http://localhost:18082/v1/infer \
--concurrency 48 \
--requests 1920
```
The benchmark helper is a host-side client tool. It is not part of the model-server container.
## Memory Checks
CUDA leak check with the mounted engine:
```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
docker compose run --rm \
--entrypoint /usr/local/cuda/bin/compute-sanitizer \
-e DOC_LAYOUT_SELF_TEST_ITERS=2 \
-e DOC_LAYOUT_SELF_TEST_BATCH=2 \
doclayout \
--tool memcheck --leak-check full --error-exitcode 88 \
/usr/local/bin/doclayout-rust-batcher
```
Expected result:
```text
LEAK SUMMARY: 0 bytes leaked in 0 allocations
ERROR SUMMARY: 0 errors
```
CPU leak check with `valgrind` is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so `compute-sanitizer` is the required CUDA gate.
|