---
library_name: transformers
license: mit
datasets:
- nevernever69/small-DocLayNet-v1.1
pipeline_tag: image-segmentation
---

# 🧾 Model Card: `nevernever69/dit-doclaynet-segmentation`

## 🧠 Model Overview

This model is a fine-tuned version of [microsoft/dit-base](https://huggingface.co/microsoft/dit-base) for **document layout semantic segmentation** on the [DocLayNet](https://huggingface.co/datasets/ibm/DocLayNet) dataset (small subset: `nevernever69/small-DocLayNet-v1.1`). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer.

## 📚 Intended Uses

- Segment document images into structured layout elements
- Assist in downstream tasks like document OCR, archiving, and automatic annotation
- Useful for researchers and developers working in document AI or digital humanities

## 🏷️ Labels (11 Classes)

| ID | Label        | Color        |
|----|--------------|--------------|
| 0  | Background   | Black        |
| 1  | Title        | Red          |
| 2  | Paragraph    | Green        |
| 3  | Figure       | Blue         |
| 4  | Table        | Yellow       |
| 5  | List         | Magenta      |
| 6  | Header       | Cyan         |
| 7  | Footer       | Dark Red     |
| 8  | Page Number  | Dark Green   |
| 9  | Footnote     | Dark Blue    |
| 10 | Caption      | Olive        |

## 🧪 Training Details

- **Base model**: `microsoft/dit-base`
- **Dataset**: [`nevernever69/small-DocLayNet-v1.1`](https://huggingface.co/datasets/nevernever69/small-DocLayNet-v1.1)
- **Input size**: 1025×1025 (resized to 56×56 masks during training)
- **Batch size**: 8
- **Epochs**: 2
- **Learning rate**: 5e-5
- **Loss function**: Cross-entropy
- **Hardware**: Trained with mixed precision (`fp16`) on GPU

## 📊 Evaluation

The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents.

## 🚀 How to Use

```python
from transformers import AutoImageProcessor, BeitForSemanticSegmentation
from PIL import Image
import torch

# Load model
model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation")
image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation")

# Load and preprocess image
image = Image.open("your-image.png").convert("RGB")
inputs = image_processor(images=image, return_tensors="pt").to("cuda")

# Inference
model.to("cuda").eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False)
    mask = upsampled.argmax(dim=1).squeeze().cpu().numpy()
```

## 🧑‍🎓 Author

Created by **Never** [`@nevernever69`](https://huggingface.co/nevernever69).  
Feel free to open issues or discuss improvements on the Hugging Face hub.

## 📝 Citation

If you use this model in your work, please consider citing:

```bibtex
@misc{never2025doclaynetseg,
  author = {Never},
  title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet},
  year = {2025},
  howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}}
}
```