File size: 3,340 Bytes
f63948c
 
6ab5f62
 
 
5b85a03
f63948c
 
6ab5f62
f63948c
6ab5f62
f63948c
6ab5f62
f63948c
6ab5f62
f63948c
6ab5f62
 
 
f63948c
6ab5f62
f63948c
6ab5f62
 
 
 
 
 
 
 
 
 
 
 
 
f63948c
6ab5f62
f63948c
6ab5f62
 
 
 
 
 
 
 
f63948c
6ab5f62
f63948c
6ab5f62
f63948c
6ab5f62
f63948c
6ab5f62
 
 
 
f63948c
6ab5f62
 
 
f63948c
6ab5f62
 
 
f63948c
6ab5f62
 
 
 
 
 
 
 
f63948c
6ab5f62
f63948c
6ab5f62
 
f63948c
6ab5f62
f63948c
6ab5f62
f63948c
6ab5f62
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
library_name: transformers
license: mit
datasets:
- nevernever69/small-DocLayNet-v1.1
pipeline_tag: image-segmentation
---

# 🧾 Model Card: `nevernever69/dit-doclaynet-segmentation`

## 🧠 Model Overview

This model is a fine-tuned version of [microsoft/dit-base](https://huggingface.co/microsoft/dit-base) for **document layout semantic segmentation** on the [DocLayNet](https://huggingface.co/datasets/ibm/DocLayNet) dataset (small subset: `nevernever69/small-DocLayNet-v1.1`). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer.

## πŸ“š Intended Uses

- Segment document images into structured layout elements
- Assist in downstream tasks like document OCR, archiving, and automatic annotation
- Useful for researchers and developers working in document AI or digital humanities

## 🏷️ Labels (11 Classes)

| ID | Label        | Color        |
|----|--------------|--------------|
| 0  | Background   | Black        |
| 1  | Title        | Red          |
| 2  | Paragraph    | Green        |
| 3  | Figure       | Blue         |
| 4  | Table        | Yellow       |
| 5  | List         | Magenta      |
| 6  | Header       | Cyan         |
| 7  | Footer       | Dark Red     |
| 8  | Page Number  | Dark Green   |
| 9  | Footnote     | Dark Blue    |
| 10 | Caption      | Olive        |

## πŸ§ͺ Training Details

- **Base model**: `microsoft/dit-base`
- **Dataset**: [`nevernever69/small-DocLayNet-v1.1`](https://huggingface.co/datasets/nevernever69/small-DocLayNet-v1.1)
- **Input size**: 1025Γ—1025 (resized to 56Γ—56 masks during training)
- **Batch size**: 8
- **Epochs**: 2
- **Learning rate**: 5e-5
- **Loss function**: Cross-entropy
- **Hardware**: Trained with mixed precision (`fp16`) on GPU

## πŸ“Š Evaluation

The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents.

## πŸš€ How to Use

```python
from transformers import AutoImageProcessor, BeitForSemanticSegmentation
from PIL import Image
import torch

# Load model
model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation")
image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation")

# Load and preprocess image
image = Image.open("your-image.png").convert("RGB")
inputs = image_processor(images=image, return_tensors="pt").to("cuda")

# Inference
model.to("cuda").eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False)
    mask = upsampled.argmax(dim=1).squeeze().cpu().numpy()
```

## πŸ§‘β€πŸŽ“ Author

Created by **Never** [`@nevernever69`](https://huggingface.co/nevernever69).  
Feel free to open issues or discuss improvements on the Hugging Face hub.

## πŸ“ Citation

If you use this model in your work, please consider citing:

```bibtex
@misc{never2025doclaynetseg,
  author = {Never},
  title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet},
  year = {2025},
  howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}}
}
```