|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- ds4sd/DocLayNet |
|
|
language: |
|
|
- en |
|
|
library_name: ultralytics |
|
|
base_model: |
|
|
- Ultralytics/YOLO11 |
|
|
pipeline_tag: object-detection |
|
|
tags: |
|
|
- object-detection |
|
|
- document-layout |
|
|
- yolov11 |
|
|
- ultralytics |
|
|
- document-layout-analysis |
|
|
- document-ai |
|
|
--- |
|
|
|
|
|
# YOLOv11 for Advanced Document Layout Analysis |
|
|
|
|
|
<p align="center"> |
|
|
<img src="images/logo.png" alt="Logo" width="100%"/> |
|
|
</p> |
|
|
|
|
|
This repository hosts three YOLOv11 models (**nano, small, and medium**) fine-tuned for high-performance **Document Layout Analysis** on the challenging [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet). |
|
|
|
|
|
The goal is to accurately detect and classify key layout elements in a document, such as text, tables, figures, and titles. This is a fundamental task for document understanding and information extraction pipelines. |
|
|
|
|
|
### β¨ Model Highlights |
|
|
* **π Three Powerful Variants:** Choose between `nano`, `small`, and `medium` models to fit your performance needs. |
|
|
* **π― High Accuracy:** Trained on the comprehensive DocLayNet dataset to recognize 11 distinct layout types. |
|
|
* β‘ **Optimized for Efficiency:** The recommended **`yolo11n` (nano) model** offers an exceptional balance of speed and accuracy, making it ideal for production environments. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Get Started |
|
|
|
|
|
Get up and running with just a few lines of code. |
|
|
|
|
|
### 1. Installation |
|
|
|
|
|
First, install the necessary libraries. |
|
|
|
|
|
```bash |
|
|
pip install ultralytics huggingface_hub |
|
|
``` |
|
|
|
|
|
### 2. Inference Example |
|
|
|
|
|
This Python snippet shows how to download a model from the Hub and run inference on a local document image. |
|
|
|
|
|
```python |
|
|
from pathlib import Path |
|
|
from huggingface_hub import hf_hub_download |
|
|
from ultralytics import YOLO |
|
|
|
|
|
# Define the local directory to save models |
|
|
DOWNLOAD_PATH = Path("./models") |
|
|
DOWNLOAD_PATH.mkdir(exist_ok=True) |
|
|
|
|
|
# Choose which model to use |
|
|
# 0: nano, 1: small, 2: medium |
|
|
model_files = [ |
|
|
"yolo11n_doc_layout.pt", |
|
|
"yolo11s_doc_layout.pt", |
|
|
"yolo11m_doc_layout.pt", |
|
|
] |
|
|
selected_model_file = model_files[0] # Using the recommended nano model |
|
|
|
|
|
# Download the model from the Hugging Face Hub |
|
|
model_path = hf_hub_download( |
|
|
repo_id="Armaggheddon/yolo11-document-layout", |
|
|
filename=selected_model_file, |
|
|
repo_type="model", |
|
|
local_dir=DOWNLOAD_PATH, |
|
|
) |
|
|
|
|
|
# Initialize the YOLO model |
|
|
model = YOLO(model_path) |
|
|
|
|
|
# Run inference on an image |
|
|
# Replace 'path/to/your/document.jpg' with your file |
|
|
results = model('path/to/your/document.jpg') |
|
|
|
|
|
# Process and display results |
|
|
results[0].print() # Print detection details |
|
|
results[0].show() # Display the image with bounding boxes |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Performance & Evaluation |
|
|
|
|
|
We fine-tuned three YOLOv11 variants, allowing you to choose the best model for your use case. |
|
|
|
|
|
* **`yolo11n_doc_layout.pt` (train4)**: **Recommended.** The nano model offers the best trade-off between speed and accuracy. |
|
|
* **`yolo11s_doc_layout.pt` (train5)**: A larger, slightly more accurate model. |
|
|
* **`yolo11m_doc_layout.pt` (train6)**: The largest model, providing the highest accuracy with a corresponding increase in computational cost. |
|
|
|
|
|
As shown in the analysis below, performance gains are marginal when moving from the `small` to the `medium` model, making the `nano` and `small` variants the most practical choices. |
|
|
|
|
|
### Nano vs. Small vs. Medium Comparison |
|
|
|
|
|
Here's how the three models stack up across key metrics. The plots compare their performance for each document layout label. |
|
|
|
|
|
| **mAP@50-95** (Strict IoU) | **mAP@50** (Standard IoU) | |
|
|
| :---: | :---: | |
|
|
| <img src="images/nsm_map50_95_per_label.png" alt="mAP@50-95" width="400"> | <img src="images/nsm_map50_per_label.png" alt="mAP@50" width="400"> | |
|
|
|
|
|
| **Precision** (Box Quality) | **Recall** (Detection Coverage) | |
|
|
| :---: | :---: | |
|
|
| <img src="images/nsm_box_precision_per_label.png" alt="Precision" width="400"> | <img src="images/nsm_recall_per_label.png" alt="Recall" width="400"> | |
|
|
|
|
|
<details> |
|
|
<summary><b>Click to see detailed Training Metrics & Confusion Matrices</b></summary> |
|
|
|
|
|
| Model | Training Metrics | Normalized Confusion Matrix | |
|
|
| :---: | :---: | :---: | |
|
|
| **`yolo11n`** (train4) | <img src="images/t4_results.png" alt="train4 results" height="200"> | <img src="images/t4_confusion_mat_normalized.png" alt="train4 confusion matrix" height="200"> | |
|
|
| **`yolo11s`** (train5) | <img src="images/t5_results.png" alt="train5 results" height="200"> | <img src="images/t5_confusion_mat_normalized.png" alt="train5 confusion matrix" height="200"> | |
|
|
| **`yolo11m`** (train6) | <img src="images/t6_results.png" alt="train6 results" height="200"> | <img src="images/t6_confusion_mat_normalized.png" alt="train6 confusion matrix" height="200"> | |
|
|
|
|
|
</details> |
|
|
|
|
|
### π The Champion: Why `train4` (Nano) is the Best Choice |
|
|
|
|
|
While all nano-family models performed well, a deeper analysis revealed that **`train4`** stands out for its superior **localization quality**. |
|
|
|
|
|
We compared it against `train9` (another strong nano contender), which achieved a slightly higher recall by sacrificing bounding box precision. For applications where data integrity and accurate object boundaries are critical, `train4` is the clear winner. |
|
|
|
|
|
**Key Advantages of `train4`:** |
|
|
1. **Superior Box Precision:** It delivered significantly more accurate bounding boxes, with a **+9.0%** precision improvement for the `title` class and strong gains for `section-header` and `table`. |
|
|
2. **Higher Quality Detections:** It achieved a **+2.4%** mAP50 and **+2.05%** mAP50-95 improvement for the difficult `footnote` class, proving its ability to meet stricter IoU thresholds. |
|
|
|
|
|
| Box Precision Improvement | mAP50 Improvement | mAP50-95 Improvement | |
|
|
| :---: | :---: | :---: | |
|
|
| <img src="images/nbest_box_precision_percentage_improvement_per_label.png" alt="Box Precision Improvement"> | <img src="images/nbest_map50_percentage_improvement_per_label.png" alt="mAP50 Improvement"> | <img src="images/nbest_map50_95_percentage_improvement_per_label.png" alt="mAP50-95 Improvement"> | |
|
|
|
|
|
In short, `train4` prioritizes **quality over quantity**, making it the most reliable and optimal choice for production systems. |
|
|
|
|
|
--- |
|
|
|
|
|
## π About the Dataset: DocLayNet |
|
|
|
|
|
The models were trained on the [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet), which provides a rich and diverse collection of document images annotated with 11 layout categories: |
|
|
|
|
|
* **Text**, **Title**, **Section-header** |
|
|
* **Table**, **Picture**, **Caption** |
|
|
* **List-item**, **Formula** |
|
|
* **Page-header**, **Page-footer**, **Footnote** |
|
|
|
|
|
**Training Resolution:** All models were trained at **1280x1280** resolution. Initial tests at the default 640x640 resulted in a significant performance drop, especially for smaller elements like `footnote` and `caption`. |
|
|
|
|
|
<img src="images/class_distribution.jpg" alt="DocLayNet Samples" width="500px"/> |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Code & Training Details |
|
|
|
|
|
This model card focuses on results and usage. For the complete end-to-end pipeline, including training scripts, dataset conversion utilities, and detailed examples, please visit the main GitHub repository: |
|
|
|
|
|
β‘οΈ **[GitHub Repo: yolo11_doc_layout](https://github.com/Armaggheddon/yolo11_doc_layout)** |