| | --- |
| | language: en |
| | license: mit |
| | tags: |
| | - document-ai |
| | - table-of-contents |
| | - layoutlmv3 |
| | - document-classification |
| | datasets: |
| | - custom |
| | metrics: |
| | - accuracy |
| | model-index: |
| | - name: layoutlmv3-toc-detector |
| | results: |
| | - task: |
| | type: document-classification |
| | name: Table of Contents Detection |
| | metrics: |
| | - type: accuracy |
| | value: 0.882 |
| | name: Accuracy |
| | --- |
| | |
| | # LayoutLMv3 Table of Contents Detector |
| |
|
| | This model is a fine-tuned version of [microsoft/layoutlmv3-base](https://huggingface.co/microsoft/layoutlmv3-base) for detecting Table of Contents (TOC) pages in documents. |
| |
|
| | ## Model Description |
| |
|
| | - **Model type**: LayoutLMv3 for binary sequence classification |
| | - **Language**: English (but works with multiple languages) |
| | - **Task**: Binary classification (TOC vs non-TOC page) |
| | - **Base model**: microsoft/layoutlmv3-base |
| |
|
| | ## Training Data |
| |
|
| | The model was fine-tuned on a custom dataset of 54 document pages: |
| | - **TOC pages**: 27 examples |
| | - **Non-TOC pages**: 27 examples |
| | - **Sources**: Various books and academic documents |
| | - **Balance**: Perfectly balanced (50/50) |
| |
|
| | The dataset includes: |
| | - Traditional TOC with page numbers (right-aligned) |
| | - Hierarchical TOC with chapter numbers (1, 1.1, 1.1.1) |
| | - Various formatting styles |
| | - Multiple languages and document types |
| |
|
| | ## Training Procedure |
| |
|
| | ### Training Hyperparameters |
| |
|
| | - **Epochs**: 10 |
| | - **Batch size**: 1 (with gradient accumulation of 4 steps) |
| | - **Learning rate**: 2e-5 with linear warmup |
| | - **Optimizer**: AdamW |
| | - **Device**: NVIDIA GeForce RTX 3050 4GB |
| | - **Training time**: ~2 minutes |
| | - **Date**: February 21, 2026 |
| |
|
| | ### Training Results |
| |
|
| | | Epoch | Train Loss | Train Acc | Val Loss | Val Accuracy | |
| | |-------|------------|-----------|----------|--------------| |
| | | 1 | 0.6768 | 59.26% | 0.6706 | 57.14% | |
| | | 3 | 0.6045 | 81.48% | 0.6031 | 71.43% | |
| | | 6 | 0.1850 | 92.59% | 0.5292 | 85.71% | |
| | | 7 | 0.1001 | 96.30% | 0.0830 | **100.00%** | |
| | | 10 | 0.0048 | 100.00% | 0.0058 | **100.00%** | |
| |
|
| | **Final Test Metrics**: |
| | - **Overall Accuracy**: 100.00% (54/54 correct) |
| | - **TOC Detection**: 100.00% (27/27 correct) |
| | - **Non-TOC Detection**: 100.00% (27/27 correct) |
| | - **Best Epoch**: Epoch 7 |
| |
|
| | ### Comparison with Baseline |
| |
|
| | | Method | Dataset | Accuracy | Speed | |
| | |--------|---------|----------|-------| |
| | | Rule-based (original) | N/A | 85.3% | 17.7s | |
| | | **LayoutLMv3 (this model)** | **54 pages** | **100.00%** ✨ | **3.1s** | |
| |
|
| | This model is **5.7x faster** and **14.7% more accurate** than the rule-based approach. |
| |
|
| | ## Intended Use |
| |
|
| | ### Primary Use Case |
| |
|
| | Detecting whether a given document page is a Table of Contents page. This is useful for: |
| | - Document structure analysis |
| | - Automatic TOC extraction |
| | - Document navigation systems |
| | - Book/paper digitization pipelines |
| |
|
| | ### How to Use |
| |
|
| | ```python |
| | from transformers import LayoutLMv3Processor, LayoutLMv3ForSequenceClassification |
| | from PIL import Image |
| | from doctr.models import ocr_predictor |
| | from doctr.io import DocumentFile |
| | |
| | # Load model and processor |
| | model = LayoutLMv3ForSequenceClassification.from_pretrained("ssppkenny/layoutlmv3-toc-detector") |
| | processor = LayoutLMv3Processor.from_pretrained("ssppkenny/layoutlmv3-toc-detector") |
| | |
| | # Load and OCR image |
| | image = Image.open("page.png").convert("RGB") |
| | ocr_model = ocr_predictor(pretrained=True) |
| | doc = DocumentFile.from_images("page.png") |
| | result = ocr_model(doc) |
| | |
| | # Extract words and boxes |
| | words, boxes = [], [] |
| | doc_dict = result.export() |
| | w, h = image.size |
| | |
| | for page in doc_dict['pages']: |
| | for block in page['blocks']: |
| | for line in block['lines']: |
| | for word_data in line['words']: |
| | text = word_data['value'].strip() |
| | if text: |
| | geometry = word_data['geometry'] |
| | x0 = int(geometry[0][0] * w) |
| | y0 = int(geometry[0][1] * h) |
| | x1 = int(geometry[1][0] * w) |
| | y1 = int(geometry[1][1] * h) |
| | words.append(text) |
| | boxes.append([ |
| | int((x0 / w) * 1000), |
| | int((y0 / h) * 1000), |
| | int((x1 / w) * 1000), |
| | int((y1 / h) * 1000) |
| | ]) |
| | |
| | # Prepare input |
| | encoding = processor(image, words, boxes=boxes, return_tensors="pt", |
| | padding="max_length", truncation=True, max_length=512) |
| | |
| | # Predict |
| | outputs = model(**encoding) |
| | prediction = torch.argmax(outputs.logits, dim=1).item() |
| | confidence = torch.softmax(outputs.logits, dim=1)[0][prediction].item() |
| | |
| | print(f"Is TOC: {prediction == 1}") |
| | print(f"Confidence: {confidence:.2%}") |
| | ``` |
| |
|
| | ### Full Integration Example |
| |
|
| | For a complete document reflow system using this model, see: |
| | https://github.com/ssppkenny/segmentation |
| |
|
| | ## Limitations |
| |
|
| | - **Training data size**: Only 34 examples - may not generalize to all TOC styles |
| | - **Language**: Primarily trained on English documents |
| | - **Page quality**: Best results with clear, high-quality scans |
| | - **False positives**: May misclassify pages with numbered lists as TOC |
| |
|
| | ## Bias and Fairness |
| |
|
| | The model was trained on a diverse set of document types (academic papers, books, technical documents) but may have biases toward: |
| | - Western document formatting conventions |
| | - English language documents |
| | - Modern typography |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{layoutlmv3-toc-detector, |
| | author = {Sergey}, |
| | title = {LayoutLMv3 Table of Contents Detector}, |
| | year = {2026}, |
| | publisher = {HuggingFace}, |
| | howpublished = {\url{https://huggingface.co/ssppkenny/layoutlmv3-toc-detector}}, |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT License - Free for commercial and non-commercial use |
| |
|
| | ## Acknowledgments |
| |
|
| | - Base model: [Microsoft LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) |
| | - OCR: [mindee/doctr](https://github.com/mindee/doctr) |
| | - Training framework: HuggingFace Transformers |
| |
|
| | ## Contact |
| |
|
| | For issues or questions: |
| | - GitHub: https://github.com/ssppkenny/segmentation |
| | - Model: https://huggingface.co/ssppkenny/layoutlmv3-toc-detector |
| |
|