PP-DocLayoutV3-ONNX / README.md
Bei0001's picture
Upload 6 files
ff67726 verified
---
license: apache-2.0
library_name: onnx
tags:
- onnx
- object-detection
- layout-analysis
- document-understanding
- paddleocr
base_model: PaddlePaddle/PP-DocLayoutV3_safetensors
---
# PP-DocLayoutV3 β€” ONNX export
ONNX export of [PaddlePaddle/PP-DocLayoutV3_safetensors](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors), the layout-detection model used in the PaddleOCR-VL-1.5 pipeline.
This export preserves all four model heads: classification logits, bounding boxes, instance-segmentation masks, and reading-order logits. The original PaddlePaddle release outputs polygon points and reading order via a postprocessor that consumes these four tensors.
## Files
| file | size | purpose |
|---|---|---|
| `PP-DocLayoutV3.onnx` | ~5 MB | model graph (references external weights) |
| `PP-DocLayoutV3.onnx.data` | ~137 MB | weight tensors (must sit alongside `.onnx`) |
| `config.json` | β€” | original model config (HuggingFace-style) |
| `preprocessor_config.json` | β€” | image preprocessing parameters (800Γ—800 resize, normalize) |
| `inference.yml` | β€” | original PaddlePaddle inference config (reference only) |
## Inputs / outputs
**Input** (single tensor):
| name | shape | dtype | notes |
|---|---|---|---|
| `pixel_values` | `(B, 3, 800, 800)` | `float32` | Resize image to 800Γ—800, rescale by `1/255`, mean=`[0,0,0]`, std=`[1,1,1]` (matches `preprocessor_config.json`). |
**Outputs** (four tensors):
| name | shape | notes |
|---|---|---|
| `logits` | `(B, 300, 25)` | per-query class logits over 25 layout classes |
| `pred_boxes` | `(B, 300, 4)` | normalized `(cx, cy, w, h)` β€” convert via standard DETR decoding |
| `out_masks` | `(B, 300, 200, 200)` | per-query instance-segmentation masks; cv2 contour extraction yields polygon points |
| `order_logits` | `(B, 300, 300)` | per-query permutation logits for reading order; argmax / Sinkhorn for ordering |
## Postprocessing
The official postprocessor lives in `transformers.models.pp_doclayout_v3.image_processing_pp_doclayout_v3.PPDocLayoutV3ImageProcessor.post_process_object_detection`. It takes the four output tensors plus a `target_sizes` tensor and returns:
```
{
"scores": (N,) float32
"labels": (N,) int64
"boxes": (N, 4) float32 β€” axis-aligned (x1, y1, x2, y2) in target coords
"polygon_points": list[N] each (P, 2) int polygon vertices in target coords
"order_seq": (N,) int64 β€” reading-order index
}
```
You can use that postprocessor directly (`transformers >= 5.4`, requires `torch` and `cv2`) or port it to numpy + cv2 for a torch-free runtime.
## Loading
```python
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("PP-DocLayoutV3.onnx", providers=["CPUExecutionProvider"])
# preprocess to 800x800 RGB float32, normalize per preprocessor_config.json
pixel_values = ... # shape (1, 3, 800, 800), float32
logits, pred_boxes, out_masks, order_logits = sess.run(
["logits", "pred_boxes", "out_masks", "order_logits"],
{"pixel_values": pixel_values},
)
```
The `.onnx.data` sidecar is loaded automatically by onnxruntime via the relative `location` reference embedded in the graph. Both files must sit in the same directory.
## How this was exported
1. `pip install transformers==5.6.2 torch==2.11 onnx==1.21 onnxscript`
2. `model = AutoModelForObjectDetection.from_pretrained("PaddlePaddle/PP-DocLayoutV3_safetensors").eval()`
3. Wrap the model so `forward(pixel_values)` returns `(logits, pred_boxes, out_masks, order_logits)`.
4. `torch.onnx.export(wrapped, (pixel_values,), "PP-DocLayoutV3.onnx", opset_version=18, dynamo=True, dynamic_axes={"pixel_values": {0: "batch"}})`
5. Re-save with `onnx.save(..., save_as_external_data=True, location="PP-DocLayoutV3.onnx.data")` to standardize the sidecar filename.
Numerical parity vs torch (random `(1, 3, 800, 800)` input):
| output | max absolute diff |
|---|---|
| `logits` | 1.32e-4 |
| `pred_boxes` | 1.57e-5 |
| `out_masks` | 1.62e-3 |
| `order_logits` | 3.96e-2 |
The `order_logits` deviation reflects accumulated floating-point drift in the decoder's attention; argmax-based reading order is unaffected on the test images we checked.
## Inference speed
CPU (Apple M-series, single page, 800Γ—800 input): **~480 ms/page** with `CPUExecutionProvider`.
## Source
- Original weights: [PaddlePaddle/PP-DocLayoutV3_safetensors](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors)
- Original PaddlePaddle release: [PaddlePaddle/PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3)
- Paper: [PaddleOCR-VL-1.5 (arXiv:2601.21957)](https://arxiv.org/abs/2601.21957)
## License
Apache-2.0 (inherited from PaddlePaddle/PP-DocLayoutV3_safetensors).