π§Ύ Model Card: nevernever69/dit-doclaynet-segmentation
π§ Model Overview
This model is a fine-tuned version of microsoft/dit-base for document layout semantic segmentation on the DocLayNet dataset (small subset: nevernever69/small-DocLayNet-v1.1). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer.
π Intended Uses
- Segment document images into structured layout elements
- Assist in downstream tasks like document OCR, archiving, and automatic annotation
- Useful for researchers and developers working in document AI or digital humanities
π·οΈ Labels (11 Classes)
| ID |
Label |
Color |
| 0 |
Background |
Black |
| 1 |
Title |
Red |
| 2 |
Paragraph |
Green |
| 3 |
Figure |
Blue |
| 4 |
Table |
Yellow |
| 5 |
List |
Magenta |
| 6 |
Header |
Cyan |
| 7 |
Footer |
Dark Red |
| 8 |
Page Number |
Dark Green |
| 9 |
Footnote |
Dark Blue |
| 10 |
Caption |
Olive |
π§ͺ Training Details
- Base model:
microsoft/dit-base
- Dataset:
nevernever69/small-DocLayNet-v1.1
- Input size: 1025Γ1025 (resized to 56Γ56 masks during training)
- Batch size: 8
- Epochs: 2
- Learning rate: 5e-5
- Loss function: Cross-entropy
- Hardware: Trained with mixed precision (
fp16) on GPU
π Evaluation
The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents.
π How to Use
from transformers import AutoImageProcessor, BeitForSemanticSegmentation
from PIL import Image
import torch
model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation")
image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation")
image = Image.open("your-image.png").convert("RGB")
inputs = image_processor(images=image, return_tensors="pt").to("cuda")
model.to("cuda").eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False)
mask = upsampled.argmax(dim=1).squeeze().cpu().numpy()
π§βπ Author
Created by Never @nevernever69.
Feel free to open issues or discuss improvements on the Hugging Face hub.
π Citation
If you use this model in your work, please consider citing:
@misc{never2025doclaynetseg,
author = {Never},
title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet},
year = {2025},
howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}}
}
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="nevernever69/dit-doclaynet-segmentation")