boatbomber
/

NabuOCR

@@ -1,39 +1,128 @@
-# Story
-I'd like to share the story of how I created this model so that others may learn from my mistakes and my successes.
-## Picking A Project
-I started this project by joining the Baidu ERNIE AI hackathon on Devpost, aiming specifically for the **Best PaddleOCR-VL Fine-Tune** task: building the best fine-tuned PaddleOCR-VL model optimized for a focused, impactful use case. From the beginning, I knew I didn't just want to demonstrate a generic OCR system; I wanted to tackle a problem that felt meaningful, underexplored, and technically challenging. That combination led me to the oldest writing system in the world: cuneiform!
-Cuneiform OCR is both awesome and useful. It promises to lower the barrier to studying some of the earliest written records in human history, but it comes with a unique set of difficulties: non-Latin scripts, heavily worn artifacts, multiple tablet faces in a single image, non-standard glyph shapes, and a relatively small pool of high-quality labeled data. Despite all of that (or maybe because of it) I decided that building a cuneiform OCR system with PaddleOCR-VL would be the core of my hackathon project.
-## Building The Dataset
-The first major milestone was the dataset. I had to assemble, clean, and normalize data from the Cuneiform Digital Library Initiative. I invested a lot of time in cleaning and formatting the data so that the model could actually learn something useful instead of memorizing noise. This involved aligning images with their transcriptions, removing tablets that were unreadably broken, low quality B&W photos, noisy image backgrounds, etc. I also limited the scope to tablets written in Sumerian and Akkadian. By the end of this phase, I had a dataset that was far from perfect, but at least coherent, reproducible, and ready for experimentation.
-## Model Incompatibilities
-Once the data was ready, the next challenge was getting training to run smoothly. Off-the-shelf, there were a number of incompatibilities between the model and training libraries I wanted to use. Being on Windows instead of Linux probably didn't help either. Some of these were minor mismatches in expected tensor shapes or keyword args; others were deeper issues tied to how certain model components were wired up. I spent a fair amount of time patching these incompatibilities and working around model issues so that training wouldn't crash halfway through an epoch. That debugging work was unglamorous but essential: without it, no amount of clever ideas about rewards or architectures would matter.
-## Choosing A Target
-Originally, I wanted the model to output ATF (the transliteration format commonly used by Assyriologists), because it's the standard way scholars work with cuneiform texts. In practice, this turned out to be too ambitious for the constraints of the hackathon. ATF has a lot of structure to it (diacritics, separators, line markers, broken signs, annotations) and it was simply too hard for the 0.9B model to learn reliably under the time and data limitations I was facing. After wrestling with disappointing results, I made a pragmatic pivot: instead of ATF, I switched to Unicode-based transcriptions of the cuneiform signs. This greatly simplified the target space and aligned better with what the model could reasonably handle.
-## Supervised Finetuning
-With the targets simplified, I started with supervised fine-tuning (SFT). SFT gave the model a general idea of what cuneiform text "should" look like: the shapes of the glyphs, the rough mapping to Unicode code points, and the basic structure of the sequences. The model became capable of producing outputs that were qualitatively in the right ballpark, but the accuracy still wasn't where it needed to be. It made systematic mistakes, hallucinated characters, and sometimes drifted off into repetitive or malformed sequences. SFT alone, while necessary, was not sufficient.
-## Reinforcement Learning
-To push performance further, I experimented with reinforcement learning on top of the SFT checkpoint. My first attempt used GSPO, which in hindsight was a poor fit for this task. The reward signal is defined at the character level, so rewarding sequence-level completions just didn't make sense. I guess I just wanted to try the shiny new thing without really considering if it was right for my use case first.
-I then moved to GRPO, hoping for a more stable and expressive optimization process. However, the initial reward design was unbalanced. I had attempted to reward the model for getting the overall shape (ie: line count, chars per line). The model quickly learned to exploit this by ignoring the overall accuracy reward and simply spamming the same cuneiform character into the correct shape instead of producing a meaningful transcription. The behavior was a clear signal that the reward design needed refinement, not just more training.
-In response, I simplified the reward functions. I replaced TER (Token Error Rate), with a blend of **prefix accuracy** and **positional accuracy**. Prefix-based rewards encouraged the model to get the beginning of the transcription right and then extend that correctness step by step, while positional accuracy rewarded characters that were correct at the right indices. I replaced the shape reward with a simple reward for total length. This structure aligned more closely with how we actually judge transcription quality and provided a smoother, less exploitable learning signal. With this simpler reward setup, training became more stable and the model's outputs became more coherent and accurate.
-Along the way, I also uncovered a subtle but important mistake in my LoRA configuration. When I enabled Rank-Stabilized LoRA, I forgot to adjust my alpha accordingly! I left it at `r*2` when it needed to be `√r*2` to account for the RSLoRA divisor! This misconfiguration effectively changed the scale of the LoRA updates, making them too aggressive by an order of magnitude. Once I corrected this and set alpha properly, the fine-tuning dynamics improved and the model converged more smoothly.
 ## Reflection
-In the end, this project became as much about infrastructure, debugging, and reward design as it was about cuneiform itself. I started with a straightforward goal of finetuning PaddleOCR-VL for cuneiform OCR and discovered a long chain of interlocking decisions about data formats, model compatibility, supervision strategies, and reinforcement learning rewards. Some choices turned out to be mistakes, others were small but critical course corrections, and together they shaped the final system. My hope is that by sharing not just the successes but also the missteps, others can build on this work more quickly and push cuneiform OCR, and specialized OCR tasks in general, even further.

+# NabuOCR: Teaching AI to Read the World's Oldest Writing
+Cuneiform is humanity's oldest writing system. Over 5,000 years ago, scribes pressed wedge-shaped marks into clay tablets to record everything from royal decrees to diaries. Hundreds of thousands of these tablets sit in museums worldwide, many still untranslated. When I saw the **Best PaddleOCR-VL Fine-Tune** challenge in the Baidu ERNIE AI hackathon, I knew exactly what I wanted to build: an OCR system for cuneiform.
+The problem is both technically demanding and genuinely useful. Cuneiform OCR could lower the barrier for studying some of the earliest written records in human history. But it comes with unique challenges. The script is non-Latin. Artifacts are heavily worn. Images often show multiple tablet faces. Glyph shapes vary wildly between periods and regions. And labeled data is scarce. This is the story of how I tackled it, mistakes and all.
+## Building the Dataset
+The foundation was data from the Cuneiform Digital Library Initiative (CDLI). Raw CDLI data required substantial cleaning. Starting from 135,255 ATF transliterations, I filtered down to 88,626 viable examples by removing damaged tablets and those outside the Sumerian/Akkadian scope. Of those, only 46,535 had associated images, and after removing low-quality black-and-white photos and noisy backgrounds, I was left with 33,257 high-quality examples. I split these into 32,257 training samples and 1,000 held-out test samples.
+The filtering was aggressive but necessary. A vision model can't learn from images it can't see clearly.
+The result wasn't perfect, but it was coherent, reproducible, and ready for training.
+## Wrestling with Compatibility
+Getting training to actually run was its own battle. There were mismatches between the model implementation and the training libraries I wanted to use. Things like expecting ordered arguments vs keyword arguments, validation decorators throwing assertions, and returned tensor formats differing. Doing all this on Windows probably didn't help either. I kept fixing crash after crash until training finally ran all the way through. This debugging work was unglamorous but essential, and I reported issues upstream to maintainers to help benefit the community.
+## Choosing a Target Format
+My original goal was to output ATF (ASCII Transliteration Format), the standard notation Assyriologists use for cuneiform texts. ATF includes diacritics, separators, line markers, broken-sign annotations, and structural markup. The small 0.9B model couldn't reliably learn ATF's complexity within the hackathon's time and data constraints, leading to invalid ATF as the syntax rules were not followed strictly enough.
+After wrestling with disappointing results, I made a pragmatic pivot: Unicode-based transcriptions of cuneiform signs. This simplified the target space dramatically and aligned with what the model could reasonably handle.
+## Supervised Fine-Tuning
+SFT required solving a problem most OCR systems don't face: the target script doesn't exist in the model's vocabulary. PaddleOCR-VL's tokenizer knows Latin, Chinese, Arabic—but not cuneiform. I extracted all unique signs from the dataset and added them directly to the tokenizer, along with special tokens for tablet face markers (`@obverse`, `@reverse`, `@left`, `@right`, `@top`, `@bottom`). This expanded the vocabulary and required resizing the model's embedding layer to match.
+I used Unsloth's FastVisionModel wrapper for full fine-tuning (not LoRA) with gradient checkpointing to fit training into available VRAM. The training configuration:
+- **Epochs:** 2 (~32,000 steps)
+- **Batch size:** 2 (limited by sequence length)
+- **Learning rate:** 2e-5 with linear decay
+- **Warmup:** 5% of training steps
+- **Optimizer:** AdamW (8-bit)
+- **Precision:** BF16
+- **Max sequence length:** 16,000 tokens
+Training loss started around 11, dropped sharply to 1.0 within the first 1,200 steps, then slowly declined to 0.35 over the remaining 30,000+ steps. The rapid initial drop suggests the model quickly learned the basic structure of the task—producing cuneiform characters with appropriate face markers. The long tail of gradual improvement reflects the harder work of learning precise glyph-to-character mappings.
+After SFT, the model understood the basic task: given an image of a cuneiform tablet, produce a sequence of Unicode cuneiform characters with face markers. But qualitative inspection revealed persistent issues—systematic substitution errors between visually similar signs, hallucinated characters, and occasional drift into repetitive sequences. SFT gave the model a foundation; reinforcement learning would need to refine it.
+## Reinforcement Learning: False Starts and Fixes
+To push past SFT's limitations, I applied Group Relative Policy Optimization (GRPO) on top of the SFT checkpoint. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs. This lets the model learn from its own mistakes in a way that supervised learning can't.
+For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure during the multi-generation rollouts. The configuration:
+- **LoRA rank:** 256
+- **LoRA alpha:** 32 (using RSLoRA scaling: $\alpha = 2\sqrt{r}$)
+- **Trainable parameters:** 239M of 1.2B (20%)
+- **Generations per prompt:** 5
+- **Batch size:** 10 × 3 gradient accumulation = 30 effective
+- **Learning rate:** 2e-6 with cosine decay (10× lower than SFT)
+- **Warmup:** 3% of training steps
+- **Optimizer:** AdamW (8-bit), $\beta_1 = 0.9$, $\beta_2 = 0.99$
+- **Weight decay:** 0.03
+- **Max gradient norm:** 0.5
+- **Loss type:** DR-GRPO
+I made three significant mistakes along the way.
+**GSPO was the wrong algorithm.** I tried it because it was new and interesting, without considering whether it fit my task. Cuneiform transcription quality is fundamentally character-level: did you get each glyph right? GSPO rewards sequence-level completions, which made the reward signal noisy and unhelpful. GRPO was better suited because it performs importance sampling at the token level, letting the model learn which glyph choices lead to better transcriptions.
+**I misconfigured RSLoRA.** When I switched to GRPO, I uncovered a subtle but critical error in my LoRA setup. Standard LoRA scales adapter contributions by:
+$$\hat{W} = W + \frac{\alpha}{r} \times AB$$
+With Rank-Stabilized LoRA (RSLoRA), the scaling changes to:
+$$\hat{W}_{\text{rslora}} = W + \frac{\alpha}{\sqrt{r}} \times AB$$
+I forgot to adjust alpha. With rank $r = 256$ and my original $\alpha = 2r = 512$, the effective scaling under RSLoRA was:
+$$\frac{512}{\sqrt{256}} = \frac{512}{16} = 32$$
+I had intended a scaling factor of 2. The fix was setting $\alpha = 2\sqrt{r} = 32$:
+$$\frac{32}{\sqrt{256}} = \frac{32}{16} = 2$$
+This misconfiguration made LoRA updates an order of magnitude too aggressive. Once corrected, training dynamics improved significantly.
+**My reward function was exploitable.** The initial design tried to encourage correct "shape" (line count, characters per line) along with TER (token error rate). The model learned to exploit this immediately. It would spam the same cuneiform character into the correct shape, ignoring accuracy entirely. Classic reward hacking.
+## The Final Reward Design
+I simplified to four reward functions that aligned with how we actually judge transcription quality.
+**Faces Reward** captures correct identification of @obverse, @reverse, and other face markers:
+$$R_{\text{faces}} = \frac{1}{2} \cdot \frac{1 - r}{1 + r}, \quad r = \frac{e}{\max(1, n)}$$
+where $e$ is the count of incorrect face markers and $n$ is the expected count.
+**Character Reward** encourages using cuneiform script over other characters:
+$$R_{\text{char}} = 0.2 \cdot \frac{c}{c + u} - 0.1$$
+where $c$ is the count of cuneiform characters and $u$ is unwanted characters (excluding whitespace, punctuation, and face markers).
+**Length Reward** penalizes deviation from target length:
+$$R_{\text{len}} = 0.2 - 0.4 \cdot \min\left(1, \frac{|L_{\text{target}} - L_{\text{pred}}|}{L_{\text{target}}}\right)$$
+**Accuracy Reward** blends prefix and positional correctness:
+$$R_{\text{acc}} = 6 \left( 0.15 \cdot \frac{p}{n} + 0.85 \cdot \frac{m}{n} \right) - 3$$
+where $p$ is the correct prefix length, $m$ is the count of positionally correct tokens, and $n$ is the ground truth length. Prefix accuracy encourages the model to get the beginning right and extend correctness step by step. Positional accuracy rewards characters that appear at the right indices regardless of what comes before.
+**Total Reward**
+$$R_{\text{total}} = R_{\text{faces}} + R_{\text{char}} + R_{\text{len}} + R_{\text{acc}} \in [-3.8, 3.8]$$
+The coefficients of each reward function were chosen via empirical tuning using small test runs (a few hundred steps each) along with principled reasoning about the relative weight of each reward and how much it matters towards our goal.
+With this design, training became stable and outputs became coherent.
+## Results
+TODO: Add results here!
 ## Reflection
+This project became as much about infrastructure, debugging, and reward design as it was about cuneiform itself. I started with a straightforward goal and discovered a chain of interlocking decisions where each choice constrained the next. Data formats affected what targets were learnable. Model compatibility issues ate up time I'd planned for experimentation. Reward design determined whether the model learned anything useful or just found clever ways to game the metric.
+Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
+The tablets have waited five thousand years. With better tools, maybe we won't keep them waiting much longer.