File size: 3,340 Bytes
f63948c 6ab5f62 5b85a03 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 f63948c 6ab5f62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
library_name: transformers
license: mit
datasets:
- nevernever69/small-DocLayNet-v1.1
pipeline_tag: image-segmentation
---
# π§Ύ Model Card: `nevernever69/dit-doclaynet-segmentation`
## π§ Model Overview
This model is a fine-tuned version of [microsoft/dit-base](https://huggingface.co/microsoft/dit-base) for **document layout semantic segmentation** on the [DocLayNet](https://huggingface.co/datasets/ibm/DocLayNet) dataset (small subset: `nevernever69/small-DocLayNet-v1.1`). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer.
## π Intended Uses
- Segment document images into structured layout elements
- Assist in downstream tasks like document OCR, archiving, and automatic annotation
- Useful for researchers and developers working in document AI or digital humanities
## π·οΈ Labels (11 Classes)
| ID | Label | Color |
|----|--------------|--------------|
| 0 | Background | Black |
| 1 | Title | Red |
| 2 | Paragraph | Green |
| 3 | Figure | Blue |
| 4 | Table | Yellow |
| 5 | List | Magenta |
| 6 | Header | Cyan |
| 7 | Footer | Dark Red |
| 8 | Page Number | Dark Green |
| 9 | Footnote | Dark Blue |
| 10 | Caption | Olive |
## π§ͺ Training Details
- **Base model**: `microsoft/dit-base`
- **Dataset**: [`nevernever69/small-DocLayNet-v1.1`](https://huggingface.co/datasets/nevernever69/small-DocLayNet-v1.1)
- **Input size**: 1025Γ1025 (resized to 56Γ56 masks during training)
- **Batch size**: 8
- **Epochs**: 2
- **Learning rate**: 5e-5
- **Loss function**: Cross-entropy
- **Hardware**: Trained with mixed precision (`fp16`) on GPU
## π Evaluation
The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents.
## π How to Use
```python
from transformers import AutoImageProcessor, BeitForSemanticSegmentation
from PIL import Image
import torch
# Load model
model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation")
image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation")
# Load and preprocess image
image = Image.open("your-image.png").convert("RGB")
inputs = image_processor(images=image, return_tensors="pt").to("cuda")
# Inference
model.to("cuda").eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False)
mask = upsampled.argmax(dim=1).squeeze().cpu().numpy()
```
## π§βπ Author
Created by **Never** [`@nevernever69`](https://huggingface.co/nevernever69).
Feel free to open issues or discuss improvements on the Hugging Face hub.
## π Citation
If you use this model in your work, please consider citing:
```bibtex
@misc{never2025doclaynetseg,
author = {Never},
title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet},
year = {2025},
howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}}
}
``` |