Add pp-doclayout server source with score threshold

3c0d3e1 verified 8 days ago

5.79 kB

	# PP-DocLayoutV3 TensorRT Layout Service

	Standalone PP-DocLayoutV3 layout model server. It does not run Paddle, PaddleX, OCR, PDF rendering, orchestration, Hugging Face download, or engine build.

	The model is mounted into the container as a TensorRT engine:

	```text
	host pp_doclayout_v3.engine -> /models/pp_doclayout_v3.engine
	```

	Runtime contract:

	```text
	rendered page image or cached benchmark request
	-> Rust HTTP server
	-> dynamic batch queue
	-> C++ TensorRT wrapper
	-> mounted PP-DocLayoutV3 engine
	-> layout boxes JSON
	```

	The orchestrator stays outside this container. It should render PDFs, manage model artifacts, decide which boxes matter, and route downstream OCR/table work.

	## Build

	```bash
	cd pp-doclayout-server
	docker compose build doclayout
	```

	The runtime image does not install Python packages and does not copy Python code. It contains the Rust server binary and links to TensorRT/CUDA libraries from the NVIDIA TensorRT base image.

	## Run

	Set `DOC_LAYOUT_ENGINE_HOST` to the engine file on the host:

	```bash
	cd pp-doclayout-server
	DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
	docker compose up
	```

	Run with a mounted TensorRT engine:

	```bash
	DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
	docker compose up
	```

	Health:

	```bash
	curl http://localhost:18082/health
	```

	Metrics:

	```bash
	curl http://localhost:18082/metrics
	```

	The default layout score threshold is `0.5`. Override per deployment with `DOC_LAYOUT_SCORE_THRESHOLD`, or per request with `score_threshold`. For example, scanned forms may use `0.35` while clean papers can keep the default.

	## Endpoints

	`POST /v1/infer`

	Uses the configured `DOC_LAYOUT_SAMPLE_IMAGE` and is intended for model-server throughput benchmarking without upload/render cost.

	```bash
	curl -s http://localhost:18082/v1/infer \
	-H 'content-type: application/json' \
	-d '{"return_boxes": false}'
	```

	`POST /v1/layout`

	Convenience image endpoint for integration testing. Send already-rendered page images as multipart fields named `file` or `files`.

	The request body limit is explicit. `DOC_LAYOUT_MAX_UPLOAD_MB` defaults to `512`, because Axum's default multipart limit is only 2 MB. This is a total request-body limit, not a per-page model batch limit.

	```bash
	curl -s http://localhost:18082/v1/layout \
	-F files=@inputs/sample.png \
	> outputs/sample_layout.json

	curl -s 'http://localhost:18082/v1/layout?score_threshold=0.35' \
	-F files=@inputs/sample.png \
	> outputs/sample_layout_scan_recall.json
	```


	`POST /v1/layout_chw_u8`

	Production-oriented endpoint for an external orchestrator. Send a raw `3x800x800` CHW `u8` body, with dimensions and original page size in query parameters. This avoids image codec work inside the model service while still returning boxes in original page coordinates.

	```bash
	curl -s 'http://localhost:18082/v1/layout_chw_u8?width=800&height=800&original_width=1587&original_height=2243&score_threshold=0.35' \
	-H 'content-type: application/octet-stream' \
	--data-binary @page_800_chw_u8.bin
	```

	Batched raw endpoint for server/orchestrator experiments:

	```bash
	curl -s 'http://localhost:18082/v1/layout_chw_u8_batch?batch=8&width=800&height=800&original_width=1587&original_height=2243' \
	-H 'content-type: application/octet-stream' \
	--data-binary @pages_b8_800_chw_u8.bin
	```

	The request body is `batch` contiguous pages, each `3x800x800` CHW `u8`. This endpoint is only for layout boxes; there is no secondary text detector or OCR fallback in this service.

	Rust client example:

	```bash
	cd rust-batcher
	cargo run --example optimized_client -- /path/to/rendered_page.png --server http://localhost:18082
	```

	The example decodes a rendered page, resizes with the same 800x800 Triangle filter expected by the service, packs CHW `u8`, sends one request, and prints the JSON response.

	Response shape:

	```json
	{
	"pages": 1,
	"results": [
	{
	"boxes": [
	{
	"label": "table",
	"class_id": 21,
	"score": 0.91,
	"bbox": [72, 140, 530, 420],
	"order": 3
	}
	],
	"batch_size": 1,
	"queue_wait_us": 1000,
	"infer_us": 25000
	}
	]
	}
	```

	## Throughput Mode

	Current clean compose result on RTX 4090 with the mounted validated engine:

	```text
	DOC_LAYOUT_WORKERS=3
	DOC_LAYOUT_MAX_BATCH=8
	DOC_LAYOUT_MAX_DELAY_US=1000
	DOC_LAYOUT_MAX_UPLOAD_MB=512
	client concurrency = 48
	throughput ~= 308 pages/s
	p50 ~= 150 ms
	p95 ~= 167 ms
	```

	For lower latency, use `DOC_LAYOUT_WORKERS=2` and client concurrency around 32. That reproduced around 292 pages/s with p50 around 106 ms. The raw TensorRT engine with host/device transfers enabled is around 316 pages/s on this machine, so the 3-worker server is now close to the practical engine ceiling.

	Run the benchmark against a running service:

	```bash
	python scripts/bench_http.py \
	--url http://localhost:18082/v1/infer \
	--concurrency 48 \
	--requests 1920
	```

	The benchmark helper is a host-side client tool. It is not part of the model-server container.

	## Memory Checks

	CUDA leak check with the mounted engine:

	```bash
	DOC_LAYOUT_ENGINE_HOST=/path/to/pp_doclayout_v3.engine \
	docker compose run --rm \
	--entrypoint /usr/local/cuda/bin/compute-sanitizer \
	-e DOC_LAYOUT_SELF_TEST_ITERS=2 \
	-e DOC_LAYOUT_SELF_TEST_BATCH=2 \
	doclayout \
	--tool memcheck --leak-check full --error-exitcode 88 \
	/usr/local/bin/doclayout-rust-batcher
	```

	Expected result:

	```text
	LEAK SUMMARY: 0 bytes leaked in 0 allocations
	ERROR SUMMARY: 0 errors
	```

	CPU leak check with `valgrind` is useful only for definite/indirect leaks in our code. TensorRT/CUDA libraries emit reachable/possibly-lost allocations and uninitialized-value noise, so `compute-sanitizer` is the required CUDA gate.