🔍 D-FINE X — Object Detection for Document Layout

D-FINE HGNetv2-X fine-tuned for document layout analysis

Overview

Mô hình D-FINE HGNetv2-X được fine-tune cho bài toán nhận diện bố cục tài liệu (Document Layout Analysis). Mô hình sử dụng kiến trúc D-FINE tiên tiến với cơ chế Fine-grained Distribution Refinement (FDR) để tinh chỉnh bounding box, kết hợp backbone HGNetv2-B5 mạnh mẽ.

Property	Value
Base model	D-FINE (HGNetv2-X)
Task	Object Detection
Domain	Document Layout
Input resolution	640 × 640
Training	80 epochs (72 Stage 1 + 8 Stage 2)

Classes

ID	Label	Description
0	`Object-detection`	Generic detectable region
1	`Figure`	Diagrams, illustrations, charts
2	`Icon`	Small symbolic graphics
3	`Table`	Tabular data structures

Benchmark

✅ Best validation metrics (Stage 2)

Metric	Value
mAP @ [0.50 : 0.95]	0.8375
mAP @ 0.50	0.9182
mAP @ 0.75	0.8966
AP (Medium)	0.8703
AP (Large)	0.8384
AR @ 100	0.9348

Usage

Option 1 — HuggingFace Transformers (recommended)

pip install transformers torch torchvision
# Clone D-FINE source code (bắt buộc)
git clone https://github.com/Peterande/D-FINE.git
cd D-FINE
pip install -r requirements.txt

from transformers import AutoModel, AutoImageProcessor
from PIL import Image

processor = AutoImageProcessor.from_pretrained(
    "ducnhan0804/dfine-x-obj-detection",
    trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "ducnhan0804/dfine-x-obj-detection",
    trust_remote_code=True
)

image = Image.open("path/to/your/document.png")
inputs = processor.preprocess(image)
outputs = model(inputs["images"], threshold=0.4)
results = processor.post_process_object_detection(outputs, threshold=0.4)

# results[0]["boxes"]  → xyxy coordinates (tensor)
# results[0]["scores"] → confidence scores (tensor)
# results[0]["labels"] → class ids (tensor)
print(results)

Note: trust_remote_code=True is required because this model uses custom modeling_dfine.py. The checkpoint (~~1GB) is downloaded automatically to `~~/.cache/huggingface/` on first run.

Important: You must run this code from the D-FINE root directory so that the source code (src/) can be imported.

Option 2 — D-FINE directly

git clone https://github.com/Peterande/D-FINE.git
cd D-FINE
pip install -r requirements.txt

# Download checkpoint from HuggingFace
from huggingface_hub import hf_hub_download

weights = hf_hub_download(
    repo_id="ducnhan0804/dfine-x-obj-detection",
    filename="best_stg2.pth"
)

# Run inference using D-FINE's built-in script
# python tools/inference/torch_inf.py -c configs/dfine/custom/dfine_hgnetv2_x_custom.yml -r <weights_path> -i <image_path> -d cuda:0

Training Configuration

model:
  architecture: D-FINE (HGNetv2-X)
  backbone: HGNetv2-B5

training:
  total_epochs: 35
  stage1_epochs: 30 (heavy augmentation)
  stage2_epochs: 5 (fine-tuning, no heavy augmentation)
  resolution: 640
  device: cuda

optimizer:
  type: AdamW
  lr: 0.00025
  backbone_lr: 0.0000025

Dataset split (COCO format)

datasets/
├── train/
│   ├── _annotations.coco.json
│   └── *.jpg
└── valid/
    ├── _annotations.coco.json
    └── *.jpg

Repository Structure

ducnhan0804/dfine-x-obj-detection
├── README.md                        # Model card
├── config.json                      # HuggingFace model config
├── preprocessor_config.json         # Image processor config
├── modeling_dfine.py                # Custom modeling code (trust_remote_code)
├── best_stg2.pth                    # Best checkpoint (Stage 2)
├── dfine_hgnetv2_x_custom.yml      # D-FINE training config
└── custom_detection.yml             # Dataset config

Limitations

Trained on a specific document layout dataset; generalization to other document types is not guaranteed.
Requires D-FINE source code to be cloned and available locally for inference.
The checkpoint file is ~1GB in size.

License

This model is released under the Apache License 2.0, inherited from the D-FINE base model.

Citation

Base model (D-FINE):

@article{peng2024d,
  title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
  author={Peng, Yansong and Songtao, Liu and others},
  journal={arXiv preprint arXiv:2407.06537},
  year={2024}
}

Downloads last month: -

Paper for ducnhan0804/dfine-x-obj-detection

Efficient and Accurate Memorable Conversation Model using DPO based on sLLM

Paper • 2407.06537 • Published Aug 27, 2024