boatbomber
/

NabuOCR

@@ -28,17 +28,19 @@ NabuOCR: Neural Cuneiform Transliteration
 # NabuOCR
-NabuOCR is a specialized OCR model for transliterating ancient cuneiform tablets directly from images to unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision.
 ## Overview
-NabuOCR processes images of cuneiform tablets and automatically generates scholarly unicode transliterations inspired by [ASCII Transliteration Format](http://oracc.ub.uni-muenchen.de/doc/help/editinginatf/primer/index.html), the standard used by assyriologists worldwide. Built by fine-tuning [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) on cuneiform tablet images, it can handle multi-view images of tablets and produce transliterations of each face.
 ## Features
-NabuOCR is based on the efficient 0.9B parameter [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) model and trained on diverse tablet conditions from multiple periods.
-It employs **multi-view processing** that handles obverse, reverse, and edge views of tablets all in one image. It generates unicode transcriptions formatted similarly to [other digital cuneiform projects](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.151).
 ## Example Output
@@ -52,29 +54,45 @@ TODO: demo here
 ### Base Model
-NabuOCR is built on [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) with an expanded tokenizer vocabulary to include the cuneiform unicode codepoints as tokens.
 ### Dataset
-The training data consists of 32.2K cuneiform tablet images and transliterations, and the test data consists of 1K cuneiform tablet images and transliterations, all from the [Cuneiform Digital Library Initiative (CDLI)](https://cdli.ucla.edu/) (CDLI).
-The images are in color with dimensions between 100px and 2048px, inclusive.
 ### SFT
-For SFT pre-training, the model was trained using full parameter fine-tuning for 2 epochs with a batch size of 2.
 ![sft-loss](./assets/sft-loss.png)
 ### GRPO
-For GRPO post-training, the model was trained using Rank Stabilized LoRA (r=256) for 1 epoch with 5 completions per prompt and a batch size of 30, then the adapter was merged back into the base at 16 bit precision.
 ![grpo-reward](./assets/grpo-reward.png)
 ### Story
-For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md). To read the code used for training with the specific hyperparameters and reward functions, see [training/](https://huggingface.co/boatbomber/NabuOCR/blob/main/training).
 ## Performance

 # NabuOCR
+NabuOCR is an OCR model for transcribing ancient cuneiform tablets directly from images to Unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision.
 ## Overview
+NabuOCR processes images of cuneiform tablets and outputs Unicode transcriptions of cuneiform signs. While Assyriologists typically use [ATF (ASCII Transliteration Format)](http://oracc.ub.uni-muenchen.de/doc/help/editinginatf/primer/index.html), ATF's complexity proved too challenging for the 0.9B model within training constraints. Unicode transcription is a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus.
+Built by fine-tuning [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) on cuneiform tablet images, NabuOCR can handle multi-view images of tablets and produce transcriptions of each face using markers like `@obverse`, `@reverse`, `@left`, `@right`, `@top`, and `@bottom`.
 ## Features
+NabuOCR is based on the efficient 0.9B parameter [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) model with an **expanded tokenizer** that includes all unique cuneiform signs from the dataset plus special face markers. The model was trained on diverse tablet conditions from multiple periods.
+It employs **end-to-end transcription** rather than a multi-stage pipeline, allowing it to leverage full tablet context when making predictions. It handles **multi-view images** containing obverse, reverse, and edge views all at once.
 ## Example Output
 ### Base Model
+NabuOCR is built on [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) with an expanded tokenizer vocabulary to include cuneiform Unicode codepoints and special face markers (`@obverse`, `@reverse`, `@left`, `@right`, `@top`, `@bottom`).
 ### Dataset
+The training data was built from the [Cuneiform Digital Library Initiative (CDLI)](https://cdli.ucla.edu/). Starting from 135,255 ATF transliterations, aggressive filtering removed damaged tablets, those outside Sumerian/Akkadian scope, entries without images, and low-quality black-and-white photos or with noisy backgrounds. The result was 33,257 high-quality examples split into 32,257 training samples and 1,000 held-out test samples. ATF was converted to Unicode for the final targets.
 ### SFT
+The model was trained using [Unsloth](https://unsloth.ai/)'s FastVisionModel wrapper for full fine-tuning with gradient checkpointing:
+- **Epochs:** 2 (~32,000 steps)
+- **Batch size:** 2
+- **Learning rate:** 2e-5 with linear decay
+- **Warmup:** 5% of training steps
+- **Optimizer:** AdamW (8-bit)
+- **Precision:** BF16
+- **Max sequence length:** 16,000 tokens
 ![sft-loss](./assets/sft-loss.png)
 ### GRPO
+Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoint using DR-GRPO loss. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs.
+- **LoRA rank:** 256 (RSLoRA with α=32)
+- **Trainable parameters:** 239M of 1.2B (20%)
+- **Generations per prompt:** 5
+- **Batch size:** 10 × 3 gradient accumulation = 30 effective
+- **Learning rate:** 2e-6 with cosine decay
+- **Warmup:** 3% of training steps
+- **Optimizer:** AdamW (8-bit)
+The reward function combined four components: face marker accuracy, cuneiform character ratio, length penalty, and a blended prefix/positional accuracy metric. The adapter was merged back into the base model at 16-bit precision.
 ![grpo-reward](./assets/grpo-reward.png)
 ### Story
+For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md). To read the code used for training, see [training/](https://huggingface.co/boatbomber/NabuOCR/blob/main/training).
 ## Performance