|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
datasets: |
|
|
- nevernever69/small-DocLayNet-v1.1 |
|
|
pipeline_tag: image-segmentation |
|
|
--- |
|
|
|
|
|
# π§Ύ Model Card: `nevernever69/dit-doclaynet-segmentation` |
|
|
|
|
|
## π§ Model Overview |
|
|
|
|
|
This model is a fine-tuned version of [microsoft/dit-base](https://huggingface.co/microsoft/dit-base) for **document layout semantic segmentation** on the [DocLayNet](https://huggingface.co/datasets/ibm/DocLayNet) dataset (small subset: `nevernever69/small-DocLayNet-v1.1`). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer. |
|
|
|
|
|
## π Intended Uses |
|
|
|
|
|
- Segment document images into structured layout elements |
|
|
- Assist in downstream tasks like document OCR, archiving, and automatic annotation |
|
|
- Useful for researchers and developers working in document AI or digital humanities |
|
|
|
|
|
## π·οΈ Labels (11 Classes) |
|
|
|
|
|
| ID | Label | Color | |
|
|
|----|--------------|--------------| |
|
|
| 0 | Background | Black | |
|
|
| 1 | Title | Red | |
|
|
| 2 | Paragraph | Green | |
|
|
| 3 | Figure | Blue | |
|
|
| 4 | Table | Yellow | |
|
|
| 5 | List | Magenta | |
|
|
| 6 | Header | Cyan | |
|
|
| 7 | Footer | Dark Red | |
|
|
| 8 | Page Number | Dark Green | |
|
|
| 9 | Footnote | Dark Blue | |
|
|
| 10 | Caption | Olive | |
|
|
|
|
|
## π§ͺ Training Details |
|
|
|
|
|
- **Base model**: `microsoft/dit-base` |
|
|
- **Dataset**: [`nevernever69/small-DocLayNet-v1.1`](https://huggingface.co/datasets/nevernever69/small-DocLayNet-v1.1) |
|
|
- **Input size**: 1025Γ1025 (resized to 56Γ56 masks during training) |
|
|
- **Batch size**: 8 |
|
|
- **Epochs**: 2 |
|
|
- **Learning rate**: 5e-5 |
|
|
- **Loss function**: Cross-entropy |
|
|
- **Hardware**: Trained with mixed precision (`fp16`) on GPU |
|
|
|
|
|
## π Evaluation |
|
|
|
|
|
The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents. |
|
|
|
|
|
## π How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoImageProcessor, BeitForSemanticSegmentation |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation") |
|
|
image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation") |
|
|
|
|
|
# Load and preprocess image |
|
|
image = Image.open("your-image.png").convert("RGB") |
|
|
inputs = image_processor(images=image, return_tensors="pt").to("cuda") |
|
|
|
|
|
# Inference |
|
|
model.to("cuda").eval() |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False) |
|
|
mask = upsampled.argmax(dim=1).squeeze().cpu().numpy() |
|
|
``` |
|
|
|
|
|
## π§βπ Author |
|
|
|
|
|
Created by **Never** [`@nevernever69`](https://huggingface.co/nevernever69). |
|
|
Feel free to open issues or discuss improvements on the Hugging Face hub. |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model in your work, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@misc{never2025doclaynetseg, |
|
|
author = {Never}, |
|
|
title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}} |
|
|
} |
|
|
``` |