LayoutLMv3 Table of Contents Detector

This model is a fine-tuned version of microsoft/layoutlmv3-base for detecting Table of Contents (TOC) pages in documents.

Model Description

  • Model type: LayoutLMv3 for binary sequence classification
  • Language: English (but works with multiple languages)
  • Task: Binary classification (TOC vs non-TOC page)
  • Base model: microsoft/layoutlmv3-base

Training Data

The model was fine-tuned on a custom dataset of 54 document pages:

  • TOC pages: 27 examples
  • Non-TOC pages: 27 examples
  • Sources: Various books and academic documents
  • Balance: Perfectly balanced (50/50)

The dataset includes:

  • Traditional TOC with page numbers (right-aligned)
  • Hierarchical TOC with chapter numbers (1, 1.1, 1.1.1)
  • Various formatting styles
  • Multiple languages and document types

Training Procedure

Training Hyperparameters

  • Epochs: 10
  • Batch size: 1 (with gradient accumulation of 4 steps)
  • Learning rate: 2e-5 with linear warmup
  • Optimizer: AdamW
  • Device: NVIDIA GeForce RTX 3050 4GB
  • Training time: ~2 minutes
  • Date: February 21, 2026

Training Results

Epoch Train Loss Train Acc Val Loss Val Accuracy
1 0.6768 59.26% 0.6706 57.14%
3 0.6045 81.48% 0.6031 71.43%
6 0.1850 92.59% 0.5292 85.71%
7 0.1001 96.30% 0.0830 100.00%
10 0.0048 100.00% 0.0058 100.00%

Final Test Metrics:

  • Overall Accuracy: 100.00% (54/54 correct)
  • TOC Detection: 100.00% (27/27 correct)
  • Non-TOC Detection: 100.00% (27/27 correct)
  • Best Epoch: Epoch 7

Comparison with Baseline

Method Dataset Accuracy Speed
Rule-based (original) N/A 85.3% 17.7s
LayoutLMv3 (this model) 54 pages 100.00% ✨ 3.1s

This model is 5.7x faster and 14.7% more accurate than the rule-based approach.

Intended Use

Primary Use Case

Detecting whether a given document page is a Table of Contents page. This is useful for:

  • Document structure analysis
  • Automatic TOC extraction
  • Document navigation systems
  • Book/paper digitization pipelines

How to Use

from transformers import LayoutLMv3Processor, LayoutLMv3ForSequenceClassification
from PIL import Image
from doctr.models import ocr_predictor
from doctr.io import DocumentFile

# Load model and processor
model = LayoutLMv3ForSequenceClassification.from_pretrained("ssppkenny/layoutlmv3-toc-detector")
processor = LayoutLMv3Processor.from_pretrained("ssppkenny/layoutlmv3-toc-detector")

# Load and OCR image
image = Image.open("page.png").convert("RGB")
ocr_model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images("page.png")
result = ocr_model(doc)

# Extract words and boxes
words, boxes = [], []
doc_dict = result.export()
w, h = image.size

for page in doc_dict['pages']:
    for block in page['blocks']:
        for line in block['lines']:
            for word_data in line['words']:
                text = word_data['value'].strip()
                if text:
                    geometry = word_data['geometry']
                    x0 = int(geometry[0][0] * w)
                    y0 = int(geometry[0][1] * h)
                    x1 = int(geometry[1][0] * w)
                    y1 = int(geometry[1][1] * h)
                    words.append(text)
                    boxes.append([
                        int((x0 / w) * 1000),
                        int((y0 / h) * 1000),
                        int((x1 / w) * 1000),
                        int((y1 / h) * 1000)
                    ])

# Prepare input
encoding = processor(image, words, boxes=boxes, return_tensors="pt", 
                     padding="max_length", truncation=True, max_length=512)

# Predict
outputs = model(**encoding)
prediction = torch.argmax(outputs.logits, dim=1).item()
confidence = torch.softmax(outputs.logits, dim=1)[0][prediction].item()

print(f"Is TOC: {prediction == 1}")
print(f"Confidence: {confidence:.2%}")

Full Integration Example

For a complete document reflow system using this model, see: https://github.com/ssppkenny/segmentation

Limitations

  • Training data size: Only 34 examples - may not generalize to all TOC styles
  • Language: Primarily trained on English documents
  • Page quality: Best results with clear, high-quality scans
  • False positives: May misclassify pages with numbered lists as TOC

Bias and Fairness

The model was trained on a diverse set of document types (academic papers, books, technical documents) but may have biases toward:

  • Western document formatting conventions
  • English language documents
  • Modern typography

Citation

If you use this model, please cite:

@misc{layoutlmv3-toc-detector,
  author = {Sergey},
  title = {LayoutLMv3 Table of Contents Detector},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ssppkenny/layoutlmv3-toc-detector}},
}

License

MIT License - Free for commercial and non-commercial use

Acknowledgments

Contact

For issues or questions:

Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results