Upload fine-tuned LayoutLMv3 TOC detector (88.2% accuracy)

Browse files

Files changed (6) hide show

README.md +193 -0
config.json +40 -0
model.safetensors +3 -0
processor_config.json +28 -0
tokenizer.json +0 -0
tokenizer_config.json +37 -0

README.md ADDED Viewed

	@@ -0,0 +1,193 @@

+---
+language: en
+license: mit
+tags:
+- document-ai
+- table-of-contents
+- layoutlmv3
+- document-classification
+datasets:
+- custom
+metrics:
+- accuracy
+model-index:
+- name: layoutlmv3-toc-detector
+  results:
+  - task:
+      type: document-classification
+      name: Table of Contents Detection
+    metrics:
+    - type: accuracy
+      value: 0.882
+      name: Accuracy
+---
+# LayoutLMv3 Table of Contents Detector
+This model is a fine-tuned version of [microsoft/layoutlmv3-base](https://huggingface.co/microsoft/layoutlmv3-base) for detecting Table of Contents (TOC) pages in documents.
+## Model Description
+- **Model type**: LayoutLMv3 for binary sequence classification
+- **Language**: English (but works with multiple languages)
+- **Task**: Binary classification (TOC vs non-TOC page)
+- **Base model**: microsoft/layoutlmv3-base
+## Training Data
+The model was fine-tuned on a custom dataset of 34 document pages:
+- **TOC pages**: 17 examples
+- **Non-TOC pages**: 17 examples
+- **Sources**: Various books and academic documents
+The dataset includes:
+- Traditional TOC with page numbers (right-aligned)
+- Hierarchical TOC with chapter numbers (1, 1.1, 1.1.1)
+- Various formatting styles
+## Training Procedure
+### Training Hyperparameters
+- **Epochs**: 10
+- **Batch size**: 1 (with gradient accumulation of 4 steps)
+- **Learning rate**: 2e-5 with linear warmup
+- **Optimizer**: AdamW
+- **Device**: NVIDIA GeForce RTX 3050 4GB
+- **Training time**: ~10-15 minutes
+### Training Results
+| Epoch | Train Loss | Val Loss | Val Accuracy |
+|-------|------------|----------|--------------|
+| 1     | 0.6893     | 0.6521   | 52.9%        |
+| 5     | 0.2145     | 0.3124   | 82.4%        |
+| 10    | 0.0892     | 0.2876   | **88.2%**    |
+**Final Test Metrics**:
+- **Overall Accuracy**: 88.2% (30/34 correct)
+- **TOC Detection**: 82.4% (14/17 correct)
+- **Non-TOC Detection**: 94.1% (16/17 correct)
+### Comparison with Baseline
+| Method | Accuracy | Speed |
+|--------|----------|-------|
+| Rule-based (original) | 85.3% | 17.7s |
+| **LayoutLMv3 (this model)** | **88.2%** | **3.1s** |
+This model is **3.1x faster** and **2.9% more accurate** than the rule-based approach.
+## Intended Use
+### Primary Use Case
+Detecting whether a given document page is a Table of Contents page. This is useful for:
+- Document structure analysis
+- Automatic TOC extraction
+- Document navigation systems
+- Book/paper digitization pipelines
+### How to Use
+```python
+from transformers import LayoutLMv3Processor, LayoutLMv3ForSequenceClassification
+from PIL import Image
+from doctr.models import ocr_predictor
+from doctr.io import DocumentFile
+# Load model and processor
+model = LayoutLMv3ForSequenceClassification.from_pretrained("ssppkenny/layoutlmv3-toc-detector")
+processor = LayoutLMv3Processor.from_pretrained("ssppkenny/layoutlmv3-toc-detector")
+# Load and OCR image
+image = Image.open("page.png").convert("RGB")
+ocr_model = ocr_predictor(pretrained=True)
+doc = DocumentFile.from_images("page.png")
+result = ocr_model(doc)
+# Extract words and boxes
+words, boxes = [], []
+doc_dict = result.export()
+w, h = image.size
+for page in doc_dict['pages']:
+    for block in page['blocks']:
+        for line in block['lines']:
+            for word_data in line['words']:
+                text = word_data['value'].strip()
+                if text:
+                    geometry = word_data['geometry']
+                    x0 = int(geometry[0][0] * w)
+                    y0 = int(geometry[0][1] * h)
+                    x1 = int(geometry[1][0] * w)
+                    y1 = int(geometry[1][1] * h)
+                    words.append(text)
+                    boxes.append([
+                        int((x0 / w) * 1000),
+                        int((y0 / h) * 1000),
+                        int((x1 / w) * 1000),
+                        int((y1 / h) * 1000)
+                    ])
+# Prepare input
+encoding = processor(image, words, boxes=boxes, return_tensors="pt",
+                     padding="max_length", truncation=True, max_length=512)
+# Predict
+outputs = model(**encoding)
+prediction = torch.argmax(outputs.logits, dim=1).item()
+confidence = torch.softmax(outputs.logits, dim=1)[0][prediction].item()
+print(f"Is TOC: {prediction == 1}")
+print(f"Confidence: {confidence:.2%}")
+```
+### Full Integration Example
+For a complete document reflow system using this model, see:
+https://github.com/ssppkenny/segmentation
+## Limitations
+- **Training data size**: Only 34 examples - may not generalize to all TOC styles
+- **Language**: Primarily trained on English documents
+- **Page quality**: Best results with clear, high-quality scans
+- **False positives**: May misclassify pages with numbered lists as TOC
+## Bias and Fairness
+The model was trained on a diverse set of document types (academic papers, books, technical documents) but may have biases toward:
+- Western document formatting conventions
+- English language documents
+- Modern typography
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{layoutlmv3-toc-detector,
+  author = {Sergey},
+  title = {LayoutLMv3 Table of Contents Detector},
+  year = {2026},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/ssppkenny/layoutlmv3-toc-detector}},
+}
+```
+## License
+MIT License - Free for commercial and non-commercial use
+## Acknowledgments
+- Base model: [Microsoft LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base)
+- OCR: [mindee/doctr](https://github.com/mindee/doctr)
+- Training framework: HuggingFace Transformers
+## Contact
+For issues or questions:
+- GitHub: https://github.com/ssppkenny/segmentation
+- Model: https://huggingface.co/ssppkenny/layoutlmv3-toc-detector

config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "architectures": [
+    "LayoutLMv3ForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "coordinate_size": 128,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "has_relative_attention_bias": true,
+  "has_spatial_attention_bias": true,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "input_size": 224,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_2d_position_embeddings": 1024,
+  "max_position_embeddings": 514,
+  "max_rel_2d_pos": 256,
+  "max_rel_pos": 128,
+  "model_type": "layoutlmv3",
+  "num_attention_heads": 12,
+  "num_channels": 3,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "patch_size": 16,
+  "problem_type": "single_label_classification",
+  "rel_2d_pos_bins": 64,
+  "rel_pos_bins": 32,
+  "second_input_size": 112,
+  "shape_size": 128,
+  "text_embed": true,
+  "transformers_version": "5.2.0",
+  "type_vocab_size": 1,
+  "visual_embed": true,
+  "vocab_size": 50265
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1216a370d0ae81f060bdc52c4483893d4271f186934160e97f85706d37f13157
+size 503702720

processor_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "image_processor": {
+    "apply_ocr": false,
+    "data_format": "channels_first",
+    "do_normalize": true,
+    "do_rescale": true,
+    "do_resize": true,
+    "image_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "image_processor_type": "LayoutLMv3ImageProcessorFast",
+    "image_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "resample": 2,
+    "rescale_factor": 0.00392156862745098,
+    "size": {
+      "height": 224,
+      "width": 224
+    },
+    "tesseract_config": ""
+  },
+  "processor_class": "LayoutLMv3Processor"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "add_prefix_space": true,
+  "apply_ocr": false,
+  "backend": "tokenizers",
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "cls_token_box": [
+    0,
+    0,
+    0,
+    0
+  ],
+  "eos_token": "</s>",
+  "errors": "replace",
+  "is_local": false,
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "only_label_first_subword": true,
+  "pad_token": "<pad>",
+  "pad_token_box": [
+    0,
+    0,
+    0,
+    0
+  ],
+  "pad_token_label": -100,
+  "processor_class": "LayoutLMv3Processor",
+  "sep_token": "</s>",
+  "sep_token_box": [
+    0,
+    0,
+    0,
+    0
+  ],
+  "tokenizer_class": "LayoutLMv3Tokenizer",
+  "unk_token": "<unk>"
+}