--- language: en license: mit tags: - document-ai - table-of-contents - layoutlmv3 - document-classification datasets: - custom metrics: - accuracy model-index: - name: layoutlmv3-toc-detector results: - task: type: document-classification name: Table of Contents Detection metrics: - type: accuracy value: 0.882 name: Accuracy --- # LayoutLMv3 Table of Contents Detector This model is a fine-tuned version of [microsoft/layoutlmv3-base](https://huggingface.co/microsoft/layoutlmv3-base) for detecting Table of Contents (TOC) pages in documents. ## Model Description - **Model type**: LayoutLMv3 for binary sequence classification - **Language**: English (but works with multiple languages) - **Task**: Binary classification (TOC vs non-TOC page) - **Base model**: microsoft/layoutlmv3-base ## Training Data The model was fine-tuned on a custom dataset of 54 document pages: - **TOC pages**: 27 examples - **Non-TOC pages**: 27 examples - **Sources**: Various books and academic documents - **Balance**: Perfectly balanced (50/50) The dataset includes: - Traditional TOC with page numbers (right-aligned) - Hierarchical TOC with chapter numbers (1, 1.1, 1.1.1) - Various formatting styles - Multiple languages and document types ## Training Procedure ### Training Hyperparameters - **Epochs**: 10 - **Batch size**: 1 (with gradient accumulation of 4 steps) - **Learning rate**: 2e-5 with linear warmup - **Optimizer**: AdamW - **Device**: NVIDIA GeForce RTX 3050 4GB - **Training time**: ~2 minutes - **Date**: February 21, 2026 ### Training Results | Epoch | Train Loss | Train Acc | Val Loss | Val Accuracy | |-------|------------|-----------|----------|--------------| | 1 | 0.6768 | 59.26% | 0.6706 | 57.14% | | 3 | 0.6045 | 81.48% | 0.6031 | 71.43% | | 6 | 0.1850 | 92.59% | 0.5292 | 85.71% | | 7 | 0.1001 | 96.30% | 0.0830 | **100.00%** | | 10 | 0.0048 | 100.00% | 0.0058 | **100.00%** | **Final Test Metrics**: - **Overall Accuracy**: 100.00% (54/54 correct) - **TOC Detection**: 100.00% (27/27 correct) - **Non-TOC Detection**: 100.00% (27/27 correct) - **Best Epoch**: Epoch 7 ### Comparison with Baseline | Method | Dataset | Accuracy | Speed | |--------|---------|----------|-------| | Rule-based (original) | N/A | 85.3% | 17.7s | | **LayoutLMv3 (this model)** | **54 pages** | **100.00%** ✨ | **3.1s** | This model is **5.7x faster** and **14.7% more accurate** than the rule-based approach. ## Intended Use ### Primary Use Case Detecting whether a given document page is a Table of Contents page. This is useful for: - Document structure analysis - Automatic TOC extraction - Document navigation systems - Book/paper digitization pipelines ### How to Use ```python from transformers import LayoutLMv3Processor, LayoutLMv3ForSequenceClassification from PIL import Image from doctr.models import ocr_predictor from doctr.io import DocumentFile # Load model and processor model = LayoutLMv3ForSequenceClassification.from_pretrained("ssppkenny/layoutlmv3-toc-detector") processor = LayoutLMv3Processor.from_pretrained("ssppkenny/layoutlmv3-toc-detector") # Load and OCR image image = Image.open("page.png").convert("RGB") ocr_model = ocr_predictor(pretrained=True) doc = DocumentFile.from_images("page.png") result = ocr_model(doc) # Extract words and boxes words, boxes = [], [] doc_dict = result.export() w, h = image.size for page in doc_dict['pages']: for block in page['blocks']: for line in block['lines']: for word_data in line['words']: text = word_data['value'].strip() if text: geometry = word_data['geometry'] x0 = int(geometry[0][0] * w) y0 = int(geometry[0][1] * h) x1 = int(geometry[1][0] * w) y1 = int(geometry[1][1] * h) words.append(text) boxes.append([ int((x0 / w) * 1000), int((y0 / h) * 1000), int((x1 / w) * 1000), int((y1 / h) * 1000) ]) # Prepare input encoding = processor(image, words, boxes=boxes, return_tensors="pt", padding="max_length", truncation=True, max_length=512) # Predict outputs = model(**encoding) prediction = torch.argmax(outputs.logits, dim=1).item() confidence = torch.softmax(outputs.logits, dim=1)[0][prediction].item() print(f"Is TOC: {prediction == 1}") print(f"Confidence: {confidence:.2%}") ``` ### Full Integration Example For a complete document reflow system using this model, see: https://github.com/ssppkenny/segmentation ## Limitations - **Training data size**: Only 34 examples - may not generalize to all TOC styles - **Language**: Primarily trained on English documents - **Page quality**: Best results with clear, high-quality scans - **False positives**: May misclassify pages with numbered lists as TOC ## Bias and Fairness The model was trained on a diverse set of document types (academic papers, books, technical documents) but may have biases toward: - Western document formatting conventions - English language documents - Modern typography ## Citation If you use this model, please cite: ```bibtex @misc{layoutlmv3-toc-detector, author = {Sergey}, title = {LayoutLMv3 Table of Contents Detector}, year = {2026}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/ssppkenny/layoutlmv3-toc-detector}}, } ``` ## License MIT License - Free for commercial and non-commercial use ## Acknowledgments - Base model: [Microsoft LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) - OCR: [mindee/doctr](https://github.com/mindee/doctr) - Training framework: HuggingFace Transformers ## Contact For issues or questions: - GitHub: https://github.com/ssppkenny/segmentation - Model: https://huggingface.co/ssppkenny/layoutlmv3-toc-detector