File size: 7,168 Bytes
2346b61
 
 
 
 
 
5c5e694
2346b61
 
 
 
 
 
f67a36a
 
 
 
2346b61
 
f67a36a
6ae0d46
f67a36a
 
 
63be8e1
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
 
 
 
6ae0d46
f67a36a
 
 
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
 
c354cee
6ae0d46
 
f67a36a
6ae0d46
f67a36a
6ae0d46
 
 
 
 
 
f67a36a
 
 
6ae0d46
f67a36a
 
 
6ae0d46
 
 
 
f67a36a
6ae0d46
f67a36a
6ae0d46
 
f67a36a
6ae0d46
 
 
 
f67a36a
6ae0d46
 
f67a36a
 
6ae0d46
 
 
f67a36a
 
6ae0d46
 
 
 
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
 
 
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
 
 
f67a36a
6ae0d46
f67a36a
 
6ae0d46
f67a36a
 
 
 
 
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
 
 
6ae0d46
f67a36a
 
 
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
 
 
 
6ae0d46
f67a36a
6ae0d46
f67a36a
6ae0d46
 
 
f67a36a
6ae0d46
f67a36a
6ae0d46
f67a36a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: mit
datasets:
- ds4sd/DocLayNet
language:
- en
library_name: ultralytics
base_model:
- Ultralytics/YOLO11
pipeline_tag: object-detection
tags:
- object-detection
- document-layout
- yolov11
- ultralytics
- document-layout-analysis
- document-ai
---

# YOLOv11 for Advanced Document Layout Analysis

<p align="center">
  <img src="images/logo.png" alt="Logo" width="100%"/>
</p>

This repository hosts three YOLOv11 models (**nano, small, and medium**) fine-tuned for high-performance **Document Layout Analysis** on the challenging [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet).

The goal is to accurately detect and classify key layout elements in a document, such as text, tables, figures, and titles. This is a fundamental task for document understanding and information extraction pipelines.

### ✨ Model Highlights
*   **πŸš€ Three Powerful Variants:** Choose between `nano`, `small`, and `medium` models to fit your performance needs.
*   **🎯 High Accuracy:** Trained on the comprehensive DocLayNet dataset to recognize 11 distinct layout types.
*   ⚑ **Optimized for Efficiency:** The recommended **`yolo11n` (nano) model** offers an exceptional balance of speed and accuracy, making it ideal for production environments.

---

## πŸš€ Get Started

Get up and running with just a few lines of code.

### 1. Installation

First, install the necessary libraries.

```bash
pip install ultralytics huggingface_hub
```

### 2. Inference Example

This Python snippet shows how to download a model from the Hub and run inference on a local document image.

```python
from pathlib import Path
from huggingface_hub import hf_hub_download
from ultralytics import YOLO

# Define the local directory to save models
DOWNLOAD_PATH = Path("./models")
DOWNLOAD_PATH.mkdir(exist_ok=True)

# Choose which model to use
# 0: nano, 1: small, 2: medium
model_files = [
    "yolo11n_doc_layout.pt",
    "yolo11s_doc_layout.pt",
    "yolo11m_doc_layout.pt",
]
selected_model_file = model_files[0] # Using the recommended nano model

# Download the model from the Hugging Face Hub
model_path = hf_hub_download(
    repo_id="Armaggheddon/yolo11-document-layout",
    filename=selected_model_file,
    repo_type="model",
    local_dir=DOWNLOAD_PATH,
)

# Initialize the YOLO model
model = YOLO(model_path)

# Run inference on an image
# Replace 'path/to/your/document.jpg' with your file
results = model('path/to/your/document.jpg')

# Process and display results
results[0].print()  # Print detection details
results[0].show()   # Display the image with bounding boxes
```

---

## πŸ“Š Model Performance & Evaluation

We fine-tuned three YOLOv11 variants, allowing you to choose the best model for your use case.

*   **`yolo11n_doc_layout.pt` (train4)**: **Recommended.** The nano model offers the best trade-off between speed and accuracy.
*   **`yolo11s_doc_layout.pt` (train5)**: A larger, slightly more accurate model.
*   **`yolo11m_doc_layout.pt` (train6)**: The largest model, providing the highest accuracy with a corresponding increase in computational cost.

As shown in the analysis below, performance gains are marginal when moving from the `small` to the `medium` model, making the `nano` and `small` variants the most practical choices.

### Nano vs. Small vs. Medium Comparison

Here's how the three models stack up across key metrics. The plots compare their performance for each document layout label.

| **mAP@50-95** (Strict IoU) | **mAP@50** (Standard IoU) |
| :---: | :---: |
| <img src="images/nsm_map50_95_per_label.png" alt="mAP@50-95" width="400"> | <img src="images/nsm_map50_per_label.png" alt="mAP@50" width="400"> |

| **Precision** (Box Quality) | **Recall** (Detection Coverage) |
| :---: | :---: |
| <img src="images/nsm_box_precision_per_label.png" alt="Precision" width="400"> | <img src="images/nsm_recall_per_label.png" alt="Recall" width="400"> |

<details>
<summary><b>Click to see detailed Training Metrics & Confusion Matrices</b></summary>

| Model | Training Metrics | Normalized Confusion Matrix |
| :---: | :---: | :---: |
| **`yolo11n`** (train4) | <img src="images/t4_results.png" alt="train4 results" height="200"> | <img src="images/t4_confusion_mat_normalized.png" alt="train4 confusion matrix" height="200"> |
| **`yolo11s`** (train5) | <img src="images/t5_results.png" alt="train5 results" height="200"> | <img src="images/t5_confusion_mat_normalized.png" alt="train5 confusion matrix" height="200"> |
| **`yolo11m`** (train6) | <img src="images/t6_results.png" alt="train6 results" height="200"> | <img src="images/t6_confusion_mat_normalized.png" alt="train6 confusion matrix" height="200"> |

</details>

### πŸ† The Champion: Why `train4` (Nano) is the Best Choice

While all nano-family models performed well, a deeper analysis revealed that **`train4`** stands out for its superior **localization quality**.

We compared it against `train9` (another strong nano contender), which achieved a slightly higher recall by sacrificing bounding box precision. For applications where data integrity and accurate object boundaries are critical, `train4` is the clear winner.

**Key Advantages of `train4`:**
1.  **Superior Box Precision:** It delivered significantly more accurate bounding boxes, with a **+9.0%** precision improvement for the `title` class and strong gains for `section-header` and `table`.
2.  **Higher Quality Detections:** It achieved a **+2.4%** mAP50 and **+2.05%** mAP50-95 improvement for the difficult `footnote` class, proving its ability to meet stricter IoU thresholds.

| Box Precision Improvement | mAP50 Improvement | mAP50-95 Improvement |
| :---: | :---: | :---: |
| <img src="images/nbest_box_precision_percentage_improvement_per_label.png" alt="Box Precision Improvement"> | <img src="images/nbest_map50_percentage_improvement_per_label.png" alt="mAP50 Improvement"> | <img src="images/nbest_map50_95_percentage_improvement_per_label.png" alt="mAP50-95 Improvement"> |

In short, `train4` prioritizes **quality over quantity**, making it the most reliable and optimal choice for production systems.

---

## πŸ“š About the Dataset: DocLayNet

The models were trained on the [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet), which provides a rich and diverse collection of document images annotated with 11 layout categories:

*   **Text**, **Title**, **Section-header**
*   **Table**, **Picture**, **Caption**
*   **List-item**, **Formula**
*   **Page-header**, **Page-footer**, **Footnote**

**Training Resolution:** All models were trained at **1280x1280** resolution. Initial tests at the default 640x640 resulted in a significant performance drop, especially for smaller elements like `footnote` and `caption`.

<img src="images/class_distribution.jpg" alt="DocLayNet Samples" width="500px"/>

---

## πŸ’» Code & Training Details

This model card focuses on results and usage. For the complete end-to-end pipeline, including training scripts, dataset conversion utilities, and detailed examples, please visit the main GitHub repository:

➑️ **[GitHub Repo: yolo11_doc_layout](https://github.com/Armaggheddon/yolo11_doc_layout)**