Commit ·
80757ee
1
Parent(s): eb1edf1
Next draft of STORY
Browse files
STORY.md
CHANGED
|
@@ -1,39 +1,128 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
## Supervised
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## Reflection
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NabuOCR: Teaching AI to Read the World's Oldest Writing
|
| 2 |
|
| 3 |
+
Cuneiform is humanity's oldest writing system. Over 5,000 years ago, scribes pressed wedge-shaped marks into clay tablets to record everything from royal decrees to diaries. Hundreds of thousands of these tablets sit in museums worldwide, many still untranslated. When I saw the **Best PaddleOCR-VL Fine-Tune** challenge in the Baidu ERNIE AI hackathon, I knew exactly what I wanted to build: an OCR system for cuneiform.
|
| 4 |
|
| 5 |
+
The problem is both technically demanding and genuinely useful. Cuneiform OCR could lower the barrier for studying some of the earliest written records in human history. But it comes with unique challenges. The script is non-Latin. Artifacts are heavily worn. Images often show multiple tablet faces. Glyph shapes vary wildly between periods and regions. And labeled data is scarce. This is the story of how I tackled it, mistakes and all.
|
| 6 |
|
| 7 |
+
## Building the Dataset
|
| 8 |
|
| 9 |
+
The foundation was data from the Cuneiform Digital Library Initiative (CDLI). Raw CDLI data required substantial cleaning. Starting from 135,255 ATF transliterations, I filtered down to 88,626 viable examples by removing damaged tablets and those outside the Sumerian/Akkadian scope. Of those, only 46,535 had associated images, and after removing low-quality black-and-white photos and noisy backgrounds, I was left with 33,257 high-quality examples. I split these into 32,257 training samples and 1,000 held-out test samples.
|
| 10 |
+
The filtering was aggressive but necessary. A vision model can't learn from images it can't see clearly.
|
| 11 |
|
| 12 |
+
The result wasn't perfect, but it was coherent, reproducible, and ready for training.
|
| 13 |
|
| 14 |
+
## Wrestling with Compatibility
|
| 15 |
|
| 16 |
+
Getting training to actually run was its own battle. There were mismatches between the model implementation and the training libraries I wanted to use. Things like expecting ordered arguments vs keyword arguments, validation decorators throwing assertions, and returned tensor formats differing. Doing all this on Windows probably didn't help either. I kept fixing crash after crash until training finally ran all the way through. This debugging work was unglamorous but essential, and I reported issues upstream to maintainers to help benefit the community.
|
| 17 |
|
| 18 |
+
## Choosing a Target Format
|
| 19 |
|
| 20 |
+
My original goal was to output ATF (ASCII Transliteration Format), the standard notation Assyriologists use for cuneiform texts. ATF includes diacritics, separators, line markers, broken-sign annotations, and structural markup. The small 0.9B model couldn't reliably learn ATF's complexity within the hackathon's time and data constraints, leading to invalid ATF as the syntax rules were not followed strictly enough.
|
| 21 |
|
| 22 |
+
After wrestling with disappointing results, I made a pragmatic pivot: Unicode-based transcriptions of cuneiform signs. This simplified the target space dramatically and aligned with what the model could reasonably handle.
|
| 23 |
|
| 24 |
+
## Supervised Fine-Tuning
|
| 25 |
|
| 26 |
+
SFT required solving a problem most OCR systems don't face: the target script doesn't exist in the model's vocabulary. PaddleOCR-VL's tokenizer knows Latin, Chinese, Arabic—but not cuneiform. I extracted all unique signs from the dataset and added them directly to the tokenizer, along with special tokens for tablet face markers (`@obverse`, `@reverse`, `@left`, `@right`, `@top`, `@bottom`). This expanded the vocabulary and required resizing the model's embedding layer to match.
|
| 27 |
|
| 28 |
+
I used Unsloth's FastVisionModel wrapper for full fine-tuning (not LoRA) with gradient checkpointing to fit training into available VRAM. The training configuration:
|
| 29 |
|
| 30 |
+
- **Epochs:** 2 (~32,000 steps)
|
| 31 |
+
- **Batch size:** 2 (limited by sequence length)
|
| 32 |
+
- **Learning rate:** 2e-5 with linear decay
|
| 33 |
+
- **Warmup:** 5% of training steps
|
| 34 |
+
- **Optimizer:** AdamW (8-bit)
|
| 35 |
+
- **Precision:** BF16
|
| 36 |
+
- **Max sequence length:** 16,000 tokens
|
| 37 |
|
| 38 |
+
Training loss started around 11, dropped sharply to 1.0 within the first 1,200 steps, then slowly declined to 0.35 over the remaining 30,000+ steps. The rapid initial drop suggests the model quickly learned the basic structure of the task—producing cuneiform characters with appropriate face markers. The long tail of gradual improvement reflects the harder work of learning precise glyph-to-character mappings.
|
| 39 |
|
| 40 |
+
After SFT, the model understood the basic task: given an image of a cuneiform tablet, produce a sequence of Unicode cuneiform characters with face markers. But qualitative inspection revealed persistent issues—systematic substitution errors between visually similar signs, hallucinated characters, and occasional drift into repetitive sequences. SFT gave the model a foundation; reinforcement learning would need to refine it.
|
| 41 |
|
| 42 |
+
## Reinforcement Learning: False Starts and Fixes
|
| 43 |
+
|
| 44 |
+
To push past SFT's limitations, I applied Group Relative Policy Optimization (GRPO) on top of the SFT checkpoint. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs. This lets the model learn from its own mistakes in a way that supervised learning can't.
|
| 45 |
+
|
| 46 |
+
For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure during the multi-generation rollouts. The configuration:
|
| 47 |
+
|
| 48 |
+
- **LoRA rank:** 256
|
| 49 |
+
- **LoRA alpha:** 32 (using RSLoRA scaling: $\alpha = 2\sqrt{r}$)
|
| 50 |
+
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 51 |
+
- **Generations per prompt:** 5
|
| 52 |
+
- **Batch size:** 10 × 3 gradient accumulation = 30 effective
|
| 53 |
+
- **Learning rate:** 2e-6 with cosine decay (10× lower than SFT)
|
| 54 |
+
- **Warmup:** 3% of training steps
|
| 55 |
+
- **Optimizer:** AdamW (8-bit), $\beta_1 = 0.9$, $\beta_2 = 0.99$
|
| 56 |
+
- **Weight decay:** 0.03
|
| 57 |
+
- **Max gradient norm:** 0.5
|
| 58 |
+
- **Loss type:** DR-GRPO
|
| 59 |
+
|
| 60 |
+
I made three significant mistakes along the way.
|
| 61 |
+
|
| 62 |
+
**GSPO was the wrong algorithm.** I tried it because it was new and interesting, without considering whether it fit my task. Cuneiform transcription quality is fundamentally character-level: did you get each glyph right? GSPO rewards sequence-level completions, which made the reward signal noisy and unhelpful. GRPO was better suited because it performs importance sampling at the token level, letting the model learn which glyph choices lead to better transcriptions.
|
| 63 |
+
|
| 64 |
+
**I misconfigured RSLoRA.** When I switched to GRPO, I uncovered a subtle but critical error in my LoRA setup. Standard LoRA scales adapter contributions by:
|
| 65 |
+
|
| 66 |
+
$$\hat{W} = W + \frac{\alpha}{r} \times AB$$
|
| 67 |
+
|
| 68 |
+
With Rank-Stabilized LoRA (RSLoRA), the scaling changes to:
|
| 69 |
+
|
| 70 |
+
$$\hat{W}_{\text{rslora}} = W + \frac{\alpha}{\sqrt{r}} \times AB$$
|
| 71 |
+
|
| 72 |
+
I forgot to adjust alpha. With rank $r = 256$ and my original $\alpha = 2r = 512$, the effective scaling under RSLoRA was:
|
| 73 |
+
|
| 74 |
+
$$\frac{512}{\sqrt{256}} = \frac{512}{16} = 32$$
|
| 75 |
+
|
| 76 |
+
I had intended a scaling factor of 2. The fix was setting $\alpha = 2\sqrt{r} = 32$:
|
| 77 |
+
|
| 78 |
+
$$\frac{32}{\sqrt{256}} = \frac{32}{16} = 2$$
|
| 79 |
+
|
| 80 |
+
This misconfiguration made LoRA updates an order of magnitude too aggressive. Once corrected, training dynamics improved significantly.
|
| 81 |
+
|
| 82 |
+
**My reward function was exploitable.** The initial design tried to encourage correct "shape" (line count, characters per line) along with TER (token error rate). The model learned to exploit this immediately. It would spam the same cuneiform character into the correct shape, ignoring accuracy entirely. Classic reward hacking.
|
| 83 |
+
|
| 84 |
+
## The Final Reward Design
|
| 85 |
+
|
| 86 |
+
I simplified to four reward functions that aligned with how we actually judge transcription quality.
|
| 87 |
+
|
| 88 |
+
**Faces Reward** captures correct identification of @obverse, @reverse, and other face markers:
|
| 89 |
+
|
| 90 |
+
$$R_{\text{faces}} = \frac{1}{2} \cdot \frac{1 - r}{1 + r}, \quad r = \frac{e}{\max(1, n)}$$
|
| 91 |
+
|
| 92 |
+
where $e$ is the count of incorrect face markers and $n$ is the expected count.
|
| 93 |
+
|
| 94 |
+
**Character Reward** encourages using cuneiform script over other characters:
|
| 95 |
+
|
| 96 |
+
$$R_{\text{char}} = 0.2 \cdot \frac{c}{c + u} - 0.1$$
|
| 97 |
+
|
| 98 |
+
where $c$ is the count of cuneiform characters and $u$ is unwanted characters (excluding whitespace, punctuation, and face markers).
|
| 99 |
+
|
| 100 |
+
**Length Reward** penalizes deviation from target length:
|
| 101 |
+
|
| 102 |
+
$$R_{\text{len}} = 0.2 - 0.4 \cdot \min\left(1, \frac{|L_{\text{target}} - L_{\text{pred}}|}{L_{\text{target}}}\right)$$
|
| 103 |
+
|
| 104 |
+
**Accuracy Reward** blends prefix and positional correctness:
|
| 105 |
+
|
| 106 |
+
$$R_{\text{acc}} = 6 \left( 0.15 \cdot \frac{p}{n} + 0.85 \cdot \frac{m}{n} \right) - 3$$
|
| 107 |
+
|
| 108 |
+
where $p$ is the correct prefix length, $m$ is the count of positionally correct tokens, and $n$ is the ground truth length. Prefix accuracy encourages the model to get the beginning right and extend correctness step by step. Positional accuracy rewards characters that appear at the right indices regardless of what comes before.
|
| 109 |
+
|
| 110 |
+
**Total Reward**
|
| 111 |
+
|
| 112 |
+
$$R_{\text{total}} = R_{\text{faces}} + R_{\text{char}} + R_{\text{len}} + R_{\text{acc}} \in [-3.8, 3.8]$$
|
| 113 |
+
|
| 114 |
+
The coefficients of each reward function were chosen via empirical tuning using small test runs (a few hundred steps each) along with principled reasoning about the relative weight of each reward and how much it matters towards our goal.
|
| 115 |
+
|
| 116 |
+
With this design, training became stable and outputs became coherent.
|
| 117 |
+
|
| 118 |
+
## Results
|
| 119 |
+
|
| 120 |
+
TODO: Add results here!
|
| 121 |
|
| 122 |
## Reflection
|
| 123 |
|
| 124 |
+
This project became as much about infrastructure, debugging, and reward design as it was about cuneiform itself. I started with a straightforward goal and discovered a chain of interlocking decisions where each choice constrained the next. Data formats affected what targets were learnable. Model compatibility issues ate up time I'd planned for experimentation. Reward design determined whether the model learned anything useful or just found clever ways to game the metric.
|
| 125 |
+
|
| 126 |
+
Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
|
| 127 |
+
|
| 128 |
+
The tablets have waited five thousand years. With better tools, maybe we won't keep them waiting much longer.
|