boatbomber commited on
Commit
afe3abb
·
1 Parent(s): 1b5a04c

Update README

Browse files
Files changed (1) hide show
  1. README.md +29 -11
README.md CHANGED
@@ -28,17 +28,19 @@ NabuOCR: Neural Cuneiform Transliteration
28
 
29
  # NabuOCR
30
 
31
- NabuOCR is a specialized OCR model for transliterating ancient cuneiform tablets directly from images to unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision.
32
 
33
  ## Overview
34
 
35
- NabuOCR processes images of cuneiform tablets and automatically generates scholarly unicode transliterations inspired by [ASCII Transliteration Format](http://oracc.ub.uni-muenchen.de/doc/help/editinginatf/primer/index.html), the standard used by assyriologists worldwide. Built by fine-tuning [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) on cuneiform tablet images, it can handle multi-view images of tablets and produce transliterations of each face.
 
 
36
 
37
  ## Features
38
 
39
- NabuOCR is based on the efficient 0.9B parameter [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) model and trained on diverse tablet conditions from multiple periods.
40
 
41
- It employs **multi-view processing** that handles obverse, reverse, and edge views of tablets all in one image. It generates unicode transcriptions formatted similarly to [other digital cuneiform projects](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.151).
42
 
43
  ## Example Output
44
 
@@ -52,29 +54,45 @@ TODO: demo here
52
 
53
  ### Base Model
54
 
55
- NabuOCR is built on [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) with an expanded tokenizer vocabulary to include the cuneiform unicode codepoints as tokens.
56
 
57
  ### Dataset
58
 
59
- The training data consists of 32.2K cuneiform tablet images and transliterations, and the test data consists of 1K cuneiform tablet images and transliterations, all from the [Cuneiform Digital Library Initiative (CDLI)](https://cdli.ucla.edu/) (CDLI).
60
-
61
- The images are in color with dimensions between 100px and 2048px, inclusive.
62
 
63
  ### SFT
64
 
65
- For SFT pre-training, the model was trained using full parameter fine-tuning for 2 epochs with a batch size of 2.
 
 
 
 
 
 
 
 
66
 
67
  ![sft-loss](./assets/sft-loss.png)
68
 
69
  ### GRPO
70
 
71
- For GRPO post-training, the model was trained using Rank Stabilized LoRA (r=256) for 1 epoch with 5 completions per prompt and a batch size of 30, then the adapter was merged back into the base at 16 bit precision.
 
 
 
 
 
 
 
 
 
 
72
 
73
  ![grpo-reward](./assets/grpo-reward.png)
74
 
75
  ### Story
76
 
77
- For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md). To read the code used for training with the specific hyperparameters and reward functions, see [training/](https://huggingface.co/boatbomber/NabuOCR/blob/main/training).
78
 
79
  ## Performance
80
 
 
28
 
29
  # NabuOCR
30
 
31
+ NabuOCR is an OCR model for transcribing ancient cuneiform tablets directly from images to Unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision.
32
 
33
  ## Overview
34
 
35
+ NabuOCR processes images of cuneiform tablets and outputs Unicode transcriptions of cuneiform signs. While Assyriologists typically use [ATF (ASCII Transliteration Format)](http://oracc.ub.uni-muenchen.de/doc/help/editinginatf/primer/index.html), ATF's complexity proved too challenging for the 0.9B model within training constraints. Unicode transcription is a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus.
36
+
37
+ Built by fine-tuning [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) on cuneiform tablet images, NabuOCR can handle multi-view images of tablets and produce transcriptions of each face using markers like `@obverse`, `@reverse`, `@left`, `@right`, `@top`, and `@bottom`.
38
 
39
  ## Features
40
 
41
+ NabuOCR is based on the efficient 0.9B parameter [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) model with an **expanded tokenizer** that includes all unique cuneiform signs from the dataset plus special face markers. The model was trained on diverse tablet conditions from multiple periods.
42
 
43
+ It employs **end-to-end transcription** rather than a multi-stage pipeline, allowing it to leverage full tablet context when making predictions. It handles **multi-view images** containing obverse, reverse, and edge views all at once.
44
 
45
  ## Example Output
46
 
 
54
 
55
  ### Base Model
56
 
57
+ NabuOCR is built on [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) with an expanded tokenizer vocabulary to include cuneiform Unicode codepoints and special face markers (`@obverse`, `@reverse`, `@left`, `@right`, `@top`, `@bottom`).
58
 
59
  ### Dataset
60
 
61
+ The training data was built from the [Cuneiform Digital Library Initiative (CDLI)](https://cdli.ucla.edu/). Starting from 135,255 ATF transliterations, aggressive filtering removed damaged tablets, those outside Sumerian/Akkadian scope, entries without images, and low-quality black-and-white photos or with noisy backgrounds. The result was 33,257 high-quality examples split into 32,257 training samples and 1,000 held-out test samples. ATF was converted to Unicode for the final targets.
 
 
62
 
63
  ### SFT
64
 
65
+ The model was trained using [Unsloth](https://unsloth.ai/)'s FastVisionModel wrapper for full fine-tuning with gradient checkpointing:
66
+
67
+ - **Epochs:** 2 (~32,000 steps)
68
+ - **Batch size:** 2
69
+ - **Learning rate:** 2e-5 with linear decay
70
+ - **Warmup:** 5% of training steps
71
+ - **Optimizer:** AdamW (8-bit)
72
+ - **Precision:** BF16
73
+ - **Max sequence length:** 16,000 tokens
74
 
75
  ![sft-loss](./assets/sft-loss.png)
76
 
77
  ### GRPO
78
 
79
+ Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoint using DR-GRPO loss. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs.
80
+
81
+ - **LoRA rank:** 256 (RSLoRA with α=32)
82
+ - **Trainable parameters:** 239M of 1.2B (20%)
83
+ - **Generations per prompt:** 5
84
+ - **Batch size:** 10 × 3 gradient accumulation = 30 effective
85
+ - **Learning rate:** 2e-6 with cosine decay
86
+ - **Warmup:** 3% of training steps
87
+ - **Optimizer:** AdamW (8-bit)
88
+
89
+ The reward function combined four components: face marker accuracy, cuneiform character ratio, length penalty, and a blended prefix/positional accuracy metric. The adapter was merged back into the base model at 16-bit precision.
90
 
91
  ![grpo-reward](./assets/grpo-reward.png)
92
 
93
  ### Story
94
 
95
+ For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md). To read the code used for training, see [training/](https://huggingface.co/boatbomber/NabuOCR/blob/main/training).
96
 
97
  ## Performance
98