--- license: apache-2.0 pipeline_tag: image-text-to-text base_model: - PaddlePaddle/PaddleOCR-VL base_model_relation: finetune tags: - cuneiform - transliteration - ocr - image-to-text - Sumerian - Akkadian - PaddleOCR - PaddlePaddle ---

NabuOCR: Neural Cuneiform Transliteration

# NabuOCR NabuOCR is an OCR model for transcribing ancient cuneiform tablets directly from images to Unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision. Made for the ERNIE AI Developer Challenge, you can watch the submission video here: https://www.youtube.com/embed/hqmjepRLdfU?si=aJHpWdc12ThgWIxD ## Overview NabuOCR processes images of cuneiform tablets and outputs Unicode transcriptions of cuneiform signs. While Assyriologists typically use [ATF (ASCII Transliteration Format)](http://oracc.ub.uni-muenchen.de/doc/help/editinginatf/primer/index.html), ATF's complexity proved too challenging for the 0.9B model within training constraints. Unicode transcription is a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus. Built by fine-tuning [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) on cuneiform tablet images, NabuOCR can handle multi-view images of tablets and produce transcriptions of each face using markers like `@obverse`, `@reverse`, `@left`, `@right`, `@top`, and `@bottom`. ## Features NabuOCR is based on the efficient 0.9B parameter [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) model with an **expanded tokenizer** that includes all unique cuneiform signs from the dataset plus special face markers. The model was trained on diverse tablet conditions from multiple periods. It employs **end-to-end transcription** rather than a multi-stage pipeline, allowing it to leverage full tablet context when making predictions. It handles **multi-view images** containing obverse, reverse, and edge views all at once. ## Example Output ![result-demo-1](./assets/result-demo-1.png) ![result-demo-2](./assets/result-demo-2.png) ## Training ### Base Model NabuOCR is built on [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) with an expanded tokenizer vocabulary to include cuneiform Unicode codepoints and special face markers (`@obverse`, `@reverse`, `@left`, `@right`, `@top`, `@bottom`). ### Dataset The training data was built from the [Cuneiform Digital Library Initiative (CDLI)](https://cdli.ucla.edu/). Starting from 135,255 ATF transliterations, aggressive filtering removed damaged tablets, those outside Sumerian/Akkadian scope, entries without images, and low-quality black-and-white photos or with noisy backgrounds. The result was 33,257 high-quality examples split into 32,257 training samples and 1,000 held-out test samples. ATF was converted to Unicode for the final targets. ### SFT The model was trained using [Unsloth](https://unsloth.ai/)'s FastVisionModel wrapper for full fine-tuning with gradient checkpointing: - **Epochs:** 2 (~32,000 steps) - **Batch size:** 2 - **Learning rate:** 2e-5 with linear decay - **Warmup:** 5% of training steps - **Optimizer:** AdamW (8-bit) - **Precision:** BF16 - **Max sequence length:** 16,000 tokens ![sft-loss](./assets/sft-loss.png) ### GRPO Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoint using DR-GRPO loss. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs. - **LoRA rank:** 256 (RSLoRA with α=16) - **Trainable parameters:** 239M of 1.2B (20%) - **Generations per prompt:** 4 - **Batch size:** 16 - **Learning rate:** 5e-6 with cosine decay - **Warmup:** 3% of training steps - **Optimizer:** AdamW (8-bit) The reward function combined five components: weighted Token Error Rate using glyph visual similarity and curriculum learning, length deviation penalty, repetition penalty, line structure accuracy, and cuneiform character ratio. The adapter was merged back into the base model at 16-bit precision. ![grpo-reward](./assets/grpo-reward.png) ### Story For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md). To read the code used for training, see [training/](https://huggingface.co/boatbomber/NabuOCR/blob/main/training). ## Performance Evaluated on a held-out test set of 1,000 tablets using TER. Lower is better; 0% means perfect transcription. ![performance](./assets/performance.png) ## Usage ### Best Practices Provide high-resolution images when possible (minimum 800x800 recommended) and include all visible sides of the tablet in a single image. Ensure that the photographs are well-lit and have high contrast so that characters are readable, and remove excessive background from images. For more details on the best format for images, see the [CDLI guidlines](https://cdli.earth/docs/images-acquisition-and-processing). ### Limitations NabuOCR performs best on well-preserved tablets with clear impressions and may struggle with heavily damaged or eroded sections. Note that the model only supports the Sumerian and Akkadian languages, and limited support is available for complex literary texts with unusual sign variants. ## Citation If you use NabuOCR in your research, please cite: ```bibtex @software{nabuocr2025, title={NabuOCR: Neural Cuneiform Transliteration}, author={[Zack Williams]}, year={2025}, url={https://huggingface.co/boatbomber/NabuOCR} } ``` ## Acknowledgments - Built on [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR) - Training data courtesy of the [Cuneiform Digital Library Initiative (CDLI)](https://cdli.ucla.edu/) - ATF format specification from [ORACC](http://oracc.museum.upenn.edu/) - Inspired by [CuneiML: A Cuneiform Dataset for Machine Learning](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.151)