Commit
·
afe3abb
1
Parent(s):
1b5a04c
Update README
Browse files
README.md
CHANGED
|
@@ -28,17 +28,19 @@ NabuOCR: Neural Cuneiform Transliteration
|
|
| 28 |
|
| 29 |
# NabuOCR
|
| 30 |
|
| 31 |
-
NabuOCR is
|
| 32 |
|
| 33 |
## Overview
|
| 34 |
|
| 35 |
-
NabuOCR processes images of cuneiform tablets and
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## Features
|
| 38 |
|
| 39 |
-
NabuOCR is based on the efficient 0.9B parameter [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) model
|
| 40 |
|
| 41 |
-
It employs **
|
| 42 |
|
| 43 |
## Example Output
|
| 44 |
|
|
@@ -52,29 +54,45 @@ TODO: demo here
|
|
| 52 |
|
| 53 |
### Base Model
|
| 54 |
|
| 55 |
-
NabuOCR is built on [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) with an expanded tokenizer vocabulary to include
|
| 56 |
|
| 57 |
### Dataset
|
| 58 |
|
| 59 |
-
The training data
|
| 60 |
-
|
| 61 |
-
The images are in color with dimensions between 100px and 2048px, inclusive.
|
| 62 |
|
| 63 |
### SFT
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |

|
| 68 |
|
| 69 |
### GRPO
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |

|
| 74 |
|
| 75 |
### Story
|
| 76 |
|
| 77 |
-
For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md). To read the code used for training
|
| 78 |
|
| 79 |
## Performance
|
| 80 |
|
|
|
|
| 28 |
|
| 29 |
# NabuOCR
|
| 30 |
|
| 31 |
+
NabuOCR is an OCR model for transcribing ancient cuneiform tablets directly from images to Unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision.
|
| 32 |
|
| 33 |
## Overview
|
| 34 |
|
| 35 |
+
NabuOCR processes images of cuneiform tablets and outputs Unicode transcriptions of cuneiform signs. While Assyriologists typically use [ATF (ASCII Transliteration Format)](http://oracc.ub.uni-muenchen.de/doc/help/editinginatf/primer/index.html), ATF's complexity proved too challenging for the 0.9B model within training constraints. Unicode transcription is a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus.
|
| 36 |
+
|
| 37 |
+
Built by fine-tuning [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) on cuneiform tablet images, NabuOCR can handle multi-view images of tablets and produce transcriptions of each face using markers like `@obverse`, `@reverse`, `@left`, `@right`, `@top`, and `@bottom`.
|
| 38 |
|
| 39 |
## Features
|
| 40 |
|
| 41 |
+
NabuOCR is based on the efficient 0.9B parameter [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) model with an **expanded tokenizer** that includes all unique cuneiform signs from the dataset plus special face markers. The model was trained on diverse tablet conditions from multiple periods.
|
| 42 |
|
| 43 |
+
It employs **end-to-end transcription** rather than a multi-stage pipeline, allowing it to leverage full tablet context when making predictions. It handles **multi-view images** containing obverse, reverse, and edge views all at once.
|
| 44 |
|
| 45 |
## Example Output
|
| 46 |
|
|
|
|
| 54 |
|
| 55 |
### Base Model
|
| 56 |
|
| 57 |
+
NabuOCR is built on [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) with an expanded tokenizer vocabulary to include cuneiform Unicode codepoints and special face markers (`@obverse`, `@reverse`, `@left`, `@right`, `@top`, `@bottom`).
|
| 58 |
|
| 59 |
### Dataset
|
| 60 |
|
| 61 |
+
The training data was built from the [Cuneiform Digital Library Initiative (CDLI)](https://cdli.ucla.edu/). Starting from 135,255 ATF transliterations, aggressive filtering removed damaged tablets, those outside Sumerian/Akkadian scope, entries without images, and low-quality black-and-white photos or with noisy backgrounds. The result was 33,257 high-quality examples split into 32,257 training samples and 1,000 held-out test samples. ATF was converted to Unicode for the final targets.
|
|
|
|
|
|
|
| 62 |
|
| 63 |
### SFT
|
| 64 |
|
| 65 |
+
The model was trained using [Unsloth](https://unsloth.ai/)'s FastVisionModel wrapper for full fine-tuning with gradient checkpointing:
|
| 66 |
+
|
| 67 |
+
- **Epochs:** 2 (~32,000 steps)
|
| 68 |
+
- **Batch size:** 2
|
| 69 |
+
- **Learning rate:** 2e-5 with linear decay
|
| 70 |
+
- **Warmup:** 5% of training steps
|
| 71 |
+
- **Optimizer:** AdamW (8-bit)
|
| 72 |
+
- **Precision:** BF16
|
| 73 |
+
- **Max sequence length:** 16,000 tokens
|
| 74 |
|
| 75 |

|
| 76 |
|
| 77 |
### GRPO
|
| 78 |
|
| 79 |
+
Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoint using DR-GRPO loss. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs.
|
| 80 |
+
|
| 81 |
+
- **LoRA rank:** 256 (RSLoRA with α=32)
|
| 82 |
+
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 83 |
+
- **Generations per prompt:** 5
|
| 84 |
+
- **Batch size:** 10 × 3 gradient accumulation = 30 effective
|
| 85 |
+
- **Learning rate:** 2e-6 with cosine decay
|
| 86 |
+
- **Warmup:** 3% of training steps
|
| 87 |
+
- **Optimizer:** AdamW (8-bit)
|
| 88 |
+
|
| 89 |
+
The reward function combined four components: face marker accuracy, cuneiform character ratio, length penalty, and a blended prefix/positional accuracy metric. The adapter was merged back into the base model at 16-bit precision.
|
| 90 |
|
| 91 |

|
| 92 |
|
| 93 |
### Story
|
| 94 |
|
| 95 |
+
For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md). To read the code used for training, see [training/](https://huggingface.co/boatbomber/NabuOCR/blob/main/training).
|
| 96 |
|
| 97 |
## Performance
|
| 98 |
|