nevernever69
/

dit-doclaynet-segmentation

Image Segmentation

Model card Files Files and versions

dit-doclaynet-segmentation / README.md

nevernever69's picture

Update README.md

5b85a03 verified 7 months ago

|

history blame contribute delete

3.34 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- nevernever69/small-DocLayNet-v1.1
	pipeline_tag: image-segmentation
	---

	# 🧾 Model Card: `nevernever69/dit-doclaynet-segmentation`

	## 🧠 Model Overview

	This model is a fine-tuned version of [microsoft/dit-base](https://huggingface.co/microsoft/dit-base) for document layout semantic segmentation on the [DocLayNet](https://huggingface.co/datasets/ibm/DocLayNet) dataset (small subset: `nevernever69/small-DocLayNet-v1.1`). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer.

	## 📚 Intended Uses

	- Segment document images into structured layout elements
	- Assist in downstream tasks like document OCR, archiving, and automatic annotation
	- Useful for researchers and developers working in document AI or digital humanities

	## 🏷️ Labels (11 Classes)

	\| ID \| Label \| Color \|
	\|----\|--------------\|--------------\|
	\| 0 \| Background \| Black \|
	\| 1 \| Title \| Red \|
	\| 2 \| Paragraph \| Green \|
	\| 3 \| Figure \| Blue \|
	\| 4 \| Table \| Yellow \|
	\| 5 \| List \| Magenta \|
	\| 6 \| Header \| Cyan \|
	\| 7 \| Footer \| Dark Red \|
	\| 8 \| Page Number \| Dark Green \|
	\| 9 \| Footnote \| Dark Blue \|
	\| 10 \| Caption \| Olive \|

	## 🧪 Training Details

	- Base model: `microsoft/dit-base`
	- Dataset: [`nevernever69/small-DocLayNet-v1.1`](https://huggingface.co/datasets/nevernever69/small-DocLayNet-v1.1)
	- Input size: 1025×1025 (resized to 56×56 masks during training)
	- Batch size: 8
	- Epochs: 2
	- Learning rate: 5e-5
	- Loss function: Cross-entropy
	- Hardware: Trained with mixed precision (`fp16`) on GPU

	## 📊 Evaluation

	The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents.

	## 🚀 How to Use

	```python
	from transformers import AutoImageProcessor, BeitForSemanticSegmentation
	from PIL import Image
	import torch

	# Load model
	model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation")
	image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation")

	# Load and preprocess image
	image = Image.open("your-image.png").convert("RGB")
	inputs = image_processor(images=image, return_tensors="pt").to("cuda")

	# Inference
	model.to("cuda").eval()
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False)
	mask = upsampled.argmax(dim=1).squeeze().cpu().numpy()
	```

	## 🧑‍🎓 Author

	Created by Never [`@nevernever69`](https://huggingface.co/nevernever69).
	Feel free to open issues or discuss improvements on the Hugging Face hub.

	## 📝 Citation

	If you use this model in your work, please consider citing:

	```bibtex
	@misc{never2025doclaynetseg,
	author = {Never},
	title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet},
	year = {2025},
	howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}}
	}
	```