--- library_name: transformers license: mit datasets: - nevernever69/small-DocLayNet-v1.1 pipeline_tag: image-segmentation --- # ๐Ÿงพ Model Card: `nevernever69/dit-doclaynet-segmentation` ## ๐Ÿง  Model Overview This model is a fine-tuned version of [microsoft/dit-base](https://huggingface.co/microsoft/dit-base) for **document layout semantic segmentation** on the [DocLayNet](https://huggingface.co/datasets/ibm/DocLayNet) dataset (small subset: `nevernever69/small-DocLayNet-v1.1`). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer. ## ๐Ÿ“š Intended Uses - Segment document images into structured layout elements - Assist in downstream tasks like document OCR, archiving, and automatic annotation - Useful for researchers and developers working in document AI or digital humanities ## ๐Ÿท๏ธ Labels (11 Classes) | ID | Label | Color | |----|--------------|--------------| | 0 | Background | Black | | 1 | Title | Red | | 2 | Paragraph | Green | | 3 | Figure | Blue | | 4 | Table | Yellow | | 5 | List | Magenta | | 6 | Header | Cyan | | 7 | Footer | Dark Red | | 8 | Page Number | Dark Green | | 9 | Footnote | Dark Blue | | 10 | Caption | Olive | ## ๐Ÿงช Training Details - **Base model**: `microsoft/dit-base` - **Dataset**: [`nevernever69/small-DocLayNet-v1.1`](https://huggingface.co/datasets/nevernever69/small-DocLayNet-v1.1) - **Input size**: 1025ร—1025 (resized to 56ร—56 masks during training) - **Batch size**: 8 - **Epochs**: 2 - **Learning rate**: 5e-5 - **Loss function**: Cross-entropy - **Hardware**: Trained with mixed precision (`fp16`) on GPU ## ๐Ÿ“Š Evaluation The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents. ## ๐Ÿš€ How to Use ```python from transformers import AutoImageProcessor, BeitForSemanticSegmentation from PIL import Image import torch # Load model model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation") image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation") # Load and preprocess image image = Image.open("your-image.png").convert("RGB") inputs = image_processor(images=image, return_tensors="pt").to("cuda") # Inference model.to("cuda").eval() with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False) mask = upsampled.argmax(dim=1).squeeze().cpu().numpy() ``` ## ๐Ÿง‘โ€๐ŸŽ“ Author Created by **Never** [`@nevernever69`](https://huggingface.co/nevernever69). Feel free to open issues or discuss improvements on the Hugging Face hub. ## ๐Ÿ“ Citation If you use this model in your work, please consider citing: ```bibtex @misc{never2025doclaynetseg, author = {Never}, title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet}, year = {2025}, howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}} } ```