File size: 5,787 Bytes
3c0d3e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# PP-DocLayoutV3 TensorRT Layout Service

Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build.

The model is mounted into the container as a TensorRT engine:

```text
host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine
```

Runtime contract:

```text
rendered page image or cached benchmark request
-> Rust HTTP server
-> dynamic batch queue
-> C++ TensorRT wrapper
-> mounted PP-DocLayoutV3 engine
-> layout boxes JSON
```

The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work.

## Build

```bash
cd pp-doclayout-server
docker compose build doclayout
```

The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image.

## Run

Set `DOC_LAYOUT_ENGINE_HOST` to the engine file on the host:

```bash
cd pp-doclayout-server
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose up
```

Run with a mounted TensorRT engine:

```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose up
```

Health:

```bash
curl http://localhost:18082/health
```

Metrics:

```bash
curl http://localhost:18082/metrics
```

The default layout score threshold is `0.5`. Override per deployment with `DOC_LAYOUT_SCORE_THRESHOLD`, or per request with `score_threshold`. For example, scanned forms may use `0.35` while clean papers can keep the default.

## Endpoints

`POST /v1/infer`

Uses the configured `DOC_LAYOUT_SAMPLE_IMAGE` and is intended for model-server throughput benchmarking without upload/render cost.

```bash
curl -s http://localhost:18082/v1/infer \
  -H 'content-type: application/json' \
  -d '{"return_boxes": false}'
```

`POST /v1/layout`

Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named `file` or `files`.

The request body limit is explicit. `DOC_LAYOUT_MAX_UPLOAD_MB` defaults to `512`, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit.

```bash
curl -s http://localhost:18082/v1/layout \
  -F files=@inputs/sample.png \
  > outputs/sample_layout.json

curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \
  -F files=@inputs/sample.png \
  > outputs/sample_layout_scan_recall.json
```


`POST /v1/layout_chw_u8`

Production-oriented endpoint for an external orchestrator. Send a raw `3x800x800` CHW `u8` body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates.

```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \
  -H 'content-type: application/octet-stream' \
  --data-binary @page_800_chw_u8.bin
```

Batched raw endpoint for server/orchestrator experiments:

```bash
curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \
  -H 'content-type: application/octet-stream' \
  --data-binary @pages_b8_800_chw_u8.bin
```

The request body is `batch` contiguous pages, each `3x800x800` CHW `u8`. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service.

Rust client example:

```bash
cd rust-batcher
cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082
```

The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW `u8`, sends one request, and prints the JSON response.

Response shape:

```json
{
  "pages": 1,
  "results": [
    {
      "boxes": [
        {
          "label": "table",
          "class_id": 21,
          "score": 0.91,
          "bbox": [72, 140, 530, 420],
          "order": 3
        }
      ],
      "batch_size": 1,
      "queue_wait_us": 1000,
      "infer_us": 25000
    }
  ]
}
```

## Throughput Mode

Current clean compose result on RTX 4090 with the mounted validated engine:

```text
DOC_LAYOUT_WORKERS=3
DOC_LAYOUT_MAX_BATCH=8
DOC_LAYOUT_MAX_DELAY_US=1000
DOC_LAYOUT_MAX_UPLOAD_MB=512
client concurrency = 48
throughput ~= 308 pages/s
p50 ~= 150 ms
p95 ~= 167 ms
```

For lower latency, use `DOC_LAYOUT_WORKERS=2` and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling.

Run the benchmark against a running service:

```bash
python scripts/bench_http.py \
  --url http://localhost:18082/v1/infer \
  --concurrency 48 \
  --requests 1920
```

The benchmark helper is a host-side client tool. It is not part of the model-server container.

## Memory Checks

CUDA leak check with the mounted engine:

```bash
DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
  docker compose run --rm \
  --entrypoint /usr/local/cuda/bin/compute-sanitizer \
  -e DOC_LAYOUT_SELF_TEST_ITERS=2 \
  -e DOC_LAYOUT_SELF_TEST_BATCH=2 \
  doclayout \
  --tool memcheck --leak-check full --error-exitcode 88 \
  /usr/local/bin/doclayout-rust-batcher
```

Expected result:

```text
LEAK SUMMARY: 0 bytes leaked in 0 allocations
ERROR SUMMARY: 0 errors
```

CPU leak check with `valgrind` is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so `compute-sanitizer` is the required CUDA gate.