boatbomber
/

NabuOCR

 TODO: details about rslora & rewards
+### Story
+For the more detailed story of how this model was trained, see [STORY.md](https://huggingface.co/boatbomber/NabuOCR/blob/main/STORY.md).
 ## Performance
 TODO: Include benchmarks

STORY.md ADDED Viewed

	@@ -0,0 +1,39 @@

+# Story
+I'd like to share the story of how I created this model so that others may learn from my mistakes and my successes.
+## Picking A Project
+I started this project by joining the Baidu ERNIE AI hackathon on Devpost, aiming specifically for the **Best PaddleOCR-VL Fine-Tune** task: building the best fine-tuned PaddleOCR-VL model optimized for a focused, impactful use case. From the beginning, I knew I didn't just want to demonstrate a generic OCR system; I wanted to tackle a problem that felt meaningful, underexplored, and technically challenging. That combination led me to the oldest writing system in the world: cuneiform!
+Cuneiform OCR is both awesome and useful. It promises to lower the barrier to studying some of the earliest written records in human history, but it comes with a unique set of difficulties: non-Latin scripts, heavily worn artifacts, multiple tablet faces in a single image, non-standard glyph shapes, and a relatively small pool of high-quality labeled data. Despite all of that (or maybe because of it) I decided that building a cuneiform OCR system with PaddleOCR-VL would be the core of my hackathon project.
+## Building The Dataset
+The first major milestone was the dataset. I had to assemble, clean, and normalize data from the Cuneiform Digital Library Initiative. I invested a lot of time in cleaning and formatting the data so that the model could actually learn something useful instead of memorizing noise. This involved aligning images with their transcriptions, removing tablets that were unreadably broken, low quality B&W photos, noisy image backgrounds, etc. I also limited the scope to tablets written in Sumerian and Akkadian. By the end of this phase, I had a dataset that was far from perfect, but at least coherent, reproducible, and ready for experimentation.
+## Model Incompatibilities
+Once the data was ready, the next challenge was getting training to run smoothly. Off-the-shelf, there were a number of incompatibilities between the model and training libraries I wanted to use. Being on Windows instead of Linux probably didn't help either. Some of these were minor mismatches in expected tensor shapes or keyword args; others were deeper issues tied to how certain model components were wired up. I spent a fair amount of time patching these incompatibilities and working around model issues so that training wouldn't crash halfway through an epoch. That debugging work was unglamorous but essential: without it, no amount of clever ideas about rewards or architectures would matter.
+## Choosing A Target
+Originally, I wanted the model to output ATF (the transliteration format commonly used by Assyriologists), because it's the standard way scholars work with cuneiform texts. In practice, this turned out to be too ambitious for the constraints of the hackathon. ATF has a lot of structure to it (diacritics, separators, line markers, broken signs, annotations) and it was simply too hard for the 0.9B model to learn reliably under the time and data limitations I was facing. After wrestling with disappointing results, I made a pragmatic pivot: instead of ATF, I switched to Unicode-based transcriptions of the cuneiform signs. This greatly simplified the target space and aligned better with what the model could reasonably handle.
+## Supervised Finetuning
+With the targets simplified, I started with supervised fine-tuning (SFT). SFT gave the model a general idea of what cuneiform text "should" look like: the shapes of the glyphs, the rough mapping to Unicode code points, and the basic structure of the sequences. The model became capable of producing outputs that were qualitatively in the right ballpark, but the accuracy still wasn't where it needed to be. It made systematic mistakes, hallucinated characters, and sometimes drifted off into repetitive or malformed sequences. SFT alone, while necessary, was not sufficient.
+## Reinforcement Learning
+To push performance further, I experimented with reinforcement learning on top of the SFT checkpoint. My first attempt used GSPO, which in hindsight was a poor fit for this task. The reward signal is defined at the character level, so rewarding sequence-level completions just didn't make sense. I guess I just wanted to try the shiny new thing without really considering if it was right for my use case first.
+I then moved to GRPO, hoping for a more stable and expressive optimization process. However, the initial reward design was unbalanced. I had attempted to reward the model for getting the overall shape (ie: line count, chars per line). The model quickly learned to exploit this by ignoring the overall accuracy reward and simply spamming the same cuneiform character into the correct shape instead of producing a meaningful transcription. The behavior was a clear signal that the reward design needed refinement, not just more training.
+In response, I simplified the reward function. Instead of relying on TER (Token Error Rate) alone, I focused on **prefix accuracy** and **positional accuracy**. Prefix-based rewards encouraged the model to get the beginning of the transcription right and then extend that correctness step by step, while positional accuracy rewarded characters that were correct at the right indices. This structure aligned more closely with how we actually judge transcription quality and provided a smoother, less exploitable learning signal. With this simpler reward setup, training became more stable and the model's outputs became more coherent and accurate.
+Along the way, I also uncovered a subtle but important bug in my LoRA configuration. I had made a mistake with the LoRA alpha parameter—specifically, I forgot to apply the square root scaling that was expected when I switched to Rank Stabilized LoRA. This misconfiguration effectively changed the scale of the LoRA updates, making them too aggressive by a order of magnitude. Once I corrected this and set alpha properly, the fine-tuning dynamics improved, and the model converged more smoothly.
+## Reflection
+In the end, this project became as much about infrastructure, debugging, and reward design as it was about cuneiform itself. I started with a straightforward goal of finetuning PaddleOCR-VL for cuneiform OCR and discovered a long chain of interlocking decisions about data formats, model compatibility, supervision strategies, and reinforcement learning rewards. Some choices turned out to be mistakes, others were small but critical course corrections, and together they shaped the final system. My hope is that by sharing not just the successes but also the missteps, others can build on this work more quickly and push cuneiform OCR, and specialized OCR tasks in general, even further.