---
license: apache-2.0
library_name: onnx
tags:
  - onnx
  - object-detection
  - layout-analysis
  - document-understanding
  - paddleocr
base_model: PaddlePaddle/PP-DocLayoutV3_safetensors
---

# PP-DocLayoutV3 — ONNX export

ONNX export of [PaddlePaddle/PP-DocLayoutV3_safetensors](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors), the layout-detection model used in the PaddleOCR-VL-1.5 pipeline.

This export preserves all four model heads: classification logits, bounding boxes, instance-segmentation masks, and reading-order logits. The original PaddlePaddle release outputs polygon points and reading order via a postprocessor that consumes these four tensors.

## Files

| file | size | purpose |
|---|---|---|
| `PP-DocLayoutV3.onnx` | ~5 MB | model graph (references external weights) |
| `PP-DocLayoutV3.onnx.data` | ~137 MB | weight tensors (must sit alongside `.onnx`) |
| `config.json` | — | original model config (HuggingFace-style) |
| `preprocessor_config.json` | — | image preprocessing parameters (800×800 resize, normalize) |
| `inference.yml` | — | original PaddlePaddle inference config (reference only) |

## Inputs / outputs

**Input** (single tensor):

| name | shape | dtype | notes |
|---|---|---|---|
| `pixel_values` | `(B, 3, 800, 800)` | `float32` | Resize image to 800×800, rescale by `1/255`, mean=`[0,0,0]`, std=`[1,1,1]` (matches `preprocessor_config.json`). |

**Outputs** (four tensors):

| name | shape | notes |
|---|---|---|
| `logits` | `(B, 300, 25)` | per-query class logits over 25 layout classes |
| `pred_boxes` | `(B, 300, 4)` | normalized `(cx, cy, w, h)` — convert via standard DETR decoding |
| `out_masks` | `(B, 300, 200, 200)` | per-query instance-segmentation masks; cv2 contour extraction yields polygon points |
| `order_logits` | `(B, 300, 300)` | per-query permutation logits for reading order; argmax / Sinkhorn for ordering |

## Postprocessing

The official postprocessor lives in `transformers.models.pp_doclayout_v3.image_processing_pp_doclayout_v3.PPDocLayoutV3ImageProcessor.post_process_object_detection`. It takes the four output tensors plus a `target_sizes` tensor and returns:

```
{
  "scores":         (N,)      float32
  "labels":         (N,)      int64
  "boxes":          (N, 4)    float32 — axis-aligned (x1, y1, x2, y2) in target coords
  "polygon_points": list[N]   each (P, 2) int polygon vertices in target coords
  "order_seq":      (N,)      int64   — reading-order index
}
```

You can use that postprocessor directly (`transformers >= 5.4`, requires `torch` and `cv2`) or port it to numpy + cv2 for a torch-free runtime.

## Loading

```python
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("PP-DocLayoutV3.onnx", providers=["CPUExecutionProvider"])
# preprocess to 800x800 RGB float32, normalize per preprocessor_config.json
pixel_values = ...  # shape (1, 3, 800, 800), float32
logits, pred_boxes, out_masks, order_logits = sess.run(
    ["logits", "pred_boxes", "out_masks", "order_logits"],
    {"pixel_values": pixel_values},
)
```

The `.onnx.data` sidecar is loaded automatically by onnxruntime via the relative `location` reference embedded in the graph. Both files must sit in the same directory.

## How this was exported

1. `pip install transformers==5.6.2 torch==2.11 onnx==1.21 onnxscript`
2. `model = AutoModelForObjectDetection.from_pretrained("PaddlePaddle/PP-DocLayoutV3_safetensors").eval()`
3. Wrap the model so `forward(pixel_values)` returns `(logits, pred_boxes, out_masks, order_logits)`.
4. `torch.onnx.export(wrapped, (pixel_values,), "PP-DocLayoutV3.onnx", opset_version=18, dynamo=True, dynamic_axes={"pixel_values": {0: "batch"}})`
5. Re-save with `onnx.save(..., save_as_external_data=True, location="PP-DocLayoutV3.onnx.data")` to standardize the sidecar filename.

Numerical parity vs torch (random `(1, 3, 800, 800)` input):

| output | max absolute diff |
|---|---|
| `logits` | 1.32e-4 |
| `pred_boxes` | 1.57e-5 |
| `out_masks` | 1.62e-3 |
| `order_logits` | 3.96e-2 |

The `order_logits` deviation reflects accumulated floating-point drift in the decoder's attention; argmax-based reading order is unaffected on the test images we checked.

## Inference speed

CPU (Apple M-series, single page, 800×800 input): **~480 ms/page** with `CPUExecutionProvider`.

## Source

- Original weights: [PaddlePaddle/PP-DocLayoutV3_safetensors](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors)
- Original PaddlePaddle release: [PaddlePaddle/PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3)
- Paper: [PaddleOCR-VL-1.5 (arXiv:2601.21957)](https://arxiv.org/abs/2601.21957)

## License

Apache-2.0 (inherited from PaddlePaddle/PP-DocLayoutV3_safetensors).