LayoutLMv3 Table of Contents Detector
This model is a fine-tuned version of microsoft/layoutlmv3-base for detecting Table of Contents (TOC) pages in documents.
Model Description
- Model type: LayoutLMv3 for binary sequence classification
- Language: English (but works with multiple languages)
- Task: Binary classification (TOC vs non-TOC page)
- Base model: microsoft/layoutlmv3-base
Training Data
The model was fine-tuned on a custom dataset of 54 document pages:
- TOC pages: 27 examples
- Non-TOC pages: 27 examples
- Sources: Various books and academic documents
- Balance: Perfectly balanced (50/50)
The dataset includes:
- Traditional TOC with page numbers (right-aligned)
- Hierarchical TOC with chapter numbers (1, 1.1, 1.1.1)
- Various formatting styles
- Multiple languages and document types
Training Procedure
Training Hyperparameters
- Epochs: 10
- Batch size: 1 (with gradient accumulation of 4 steps)
- Learning rate: 2e-5 with linear warmup
- Optimizer: AdamW
- Device: NVIDIA GeForce RTX 3050 4GB
- Training time: ~2 minutes
- Date: February 21, 2026
Training Results
| Epoch | Train Loss | Train Acc | Val Loss | Val Accuracy |
|---|---|---|---|---|
| 1 | 0.6768 | 59.26% | 0.6706 | 57.14% |
| 3 | 0.6045 | 81.48% | 0.6031 | 71.43% |
| 6 | 0.1850 | 92.59% | 0.5292 | 85.71% |
| 7 | 0.1001 | 96.30% | 0.0830 | 100.00% |
| 10 | 0.0048 | 100.00% | 0.0058 | 100.00% |
Final Test Metrics:
- Overall Accuracy: 100.00% (54/54 correct)
- TOC Detection: 100.00% (27/27 correct)
- Non-TOC Detection: 100.00% (27/27 correct)
- Best Epoch: Epoch 7
Comparison with Baseline
| Method | Dataset | Accuracy | Speed |
|---|---|---|---|
| Rule-based (original) | N/A | 85.3% | 17.7s |
| LayoutLMv3 (this model) | 54 pages | 100.00% ✨ | 3.1s |
This model is 5.7x faster and 14.7% more accurate than the rule-based approach.
Intended Use
Primary Use Case
Detecting whether a given document page is a Table of Contents page. This is useful for:
- Document structure analysis
- Automatic TOC extraction
- Document navigation systems
- Book/paper digitization pipelines
How to Use
from transformers import LayoutLMv3Processor, LayoutLMv3ForSequenceClassification
from PIL import Image
from doctr.models import ocr_predictor
from doctr.io import DocumentFile
# Load model and processor
model = LayoutLMv3ForSequenceClassification.from_pretrained("ssppkenny/layoutlmv3-toc-detector")
processor = LayoutLMv3Processor.from_pretrained("ssppkenny/layoutlmv3-toc-detector")
# Load and OCR image
image = Image.open("page.png").convert("RGB")
ocr_model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images("page.png")
result = ocr_model(doc)
# Extract words and boxes
words, boxes = [], []
doc_dict = result.export()
w, h = image.size
for page in doc_dict['pages']:
for block in page['blocks']:
for line in block['lines']:
for word_data in line['words']:
text = word_data['value'].strip()
if text:
geometry = word_data['geometry']
x0 = int(geometry[0][0] * w)
y0 = int(geometry[0][1] * h)
x1 = int(geometry[1][0] * w)
y1 = int(geometry[1][1] * h)
words.append(text)
boxes.append([
int((x0 / w) * 1000),
int((y0 / h) * 1000),
int((x1 / w) * 1000),
int((y1 / h) * 1000)
])
# Prepare input
encoding = processor(image, words, boxes=boxes, return_tensors="pt",
padding="max_length", truncation=True, max_length=512)
# Predict
outputs = model(**encoding)
prediction = torch.argmax(outputs.logits, dim=1).item()
confidence = torch.softmax(outputs.logits, dim=1)[0][prediction].item()
print(f"Is TOC: {prediction == 1}")
print(f"Confidence: {confidence:.2%}")
Full Integration Example
For a complete document reflow system using this model, see: https://github.com/ssppkenny/segmentation
Limitations
- Training data size: Only 34 examples - may not generalize to all TOC styles
- Language: Primarily trained on English documents
- Page quality: Best results with clear, high-quality scans
- False positives: May misclassify pages with numbered lists as TOC
Bias and Fairness
The model was trained on a diverse set of document types (academic papers, books, technical documents) but may have biases toward:
- Western document formatting conventions
- English language documents
- Modern typography
Citation
If you use this model, please cite:
@misc{layoutlmv3-toc-detector,
author = {Sergey},
title = {LayoutLMv3 Table of Contents Detector},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/ssppkenny/layoutlmv3-toc-detector}},
}
License
MIT License - Free for commercial and non-commercial use
Acknowledgments
- Base model: Microsoft LayoutLMv3
- OCR: mindee/doctr
- Training framework: HuggingFace Transformers
Contact
For issues or questions:
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Evaluation results
- Accuracyself-reported0.882