boatbomber
/

NabuOCR

@@ -1,10 +1,10 @@
 # NabuOCR: Teaching AI to Read the World's Oldest Writing
-Cuneiform is humanity's oldest writing system. Over 5,000 years ago, scribes pressed wedge-shaped marks into clay tablets to record everything from royal decrees to diaries. Hundreds of thousands of these tablets sit in museums worldwide, many still untranslated. When I saw the **Best PaddleOCR-VL Fine-Tune** challenge in the Baidu ERNIE AI hackathon, I knew exactly what I wanted to build: an OCR system for cuneiform.
-The name comes from Nabu, the Mesopotamian god of writing and scribes. It felt fitting for a project trying to teach machines to read what his ancient devotees once wrote.
-The problem is both technically demanding and genuinely useful. Cuneiform OCR could lower the barrier for studying some of the earliest written records in human history. But it comes with unique challenges. The script is non-Latin. Artifacts are heavily worn. Images often show multiple tablet faces. Glyph shapes vary wildly between periods and regions. And labeled data is scarce. This is the story of how I tackled it, mistakes and all.
 ## Prior Work
@@ -17,7 +17,7 @@ NabuOCR takes a different approach: a single vision-language model that accepts
 My original goal was to output ATF (ASCII Transliteration Format), the standard notation Assyriologists use for cuneiform texts. ATF includes diacritics, separators, line markers, broken-sign annotations, and structural markup. The small 0.9B model couldn't reliably learn ATF's complexity within the hackathon's time and data constraints, leading to invalid ATF as the syntax rules were not followed strictly enough.
 After wrestling with disappointing results, I made a pragmatic pivot: Unicode-based transcriptions of cuneiform signs instead of full ATF. This simplified the target space dramatically and aligned with what the model could reasonably handle.
-This is a real tradeoff. ATF is what scholars actually use, and it encodes linguistic information that raw Unicode doesn't capture. But Unicode transcription isn't useless. It's a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus. Think of it as moving from "I can't read this at all" to "here are the glyphs." It's not the whole journey, but it's a genuine step forward.
 ## Building the Dataset
@@ -68,7 +68,7 @@ For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure dur
 I made three significant mistakes along the way.
-**GSPO was the wrong algorithm.** I tried it because it was new and interesting, without considering whether it fit my task. Cuneiform transcription quality is fundamentally character-level: did you get each glyph right? GSPO rewards sequence-level completions, which made the reward signal noisy and unhelpful. GRPO was better suited because it performs importance sampling at the token level, letting the model learn which glyph choices lead to better transcriptions.
 *Lesson: Before adopting a new algorithm, verify its inductive biases match your task's structure. Novelty isn't a reason to use something.*
@@ -140,4 +140,4 @@ This project became as much about infrastructure, debugging, and reward design a
 Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
-The tablets have waited five thousand years. With better tools, maybe we won't keep them waiting much longer.

 # NabuOCR: Teaching AI to Read the World's Oldest Writing
+Cuneiform is humanity's oldest writing system. Over 5,000 years ago, scribes pressed wedge-shaped marks into clay tablets to record everything from royal decrees to diaries. There's a backlog of hundreds of thousands of these tablets waiting for Assyriologists to transcribe them. When I saw the **Best PaddleOCR-VL Fine-Tune** challenge in the Baidu ERNIE AI hackathon, I knew exactly what I wanted to build: an OCR system for cuneiform.
+The name NabuOCR comes from Nabu, the Mesopotamian god of writing and scribes. It felt fitting for a project that teaches machines to read what his ancient devotees once wrote.
+The problem is both technically demanding and genuinely useful. Cuneiform OCR could lower the barrier for studying some of the earliest written records in human history. But it comes with unique challenges. The script is non-Latin. Artifacts are heavily worn. Images show multiple tablet faces at once. Glyph shapes vary wildly between periods and regions. And labeled data is scarce. This is the story of how I tackled it, mistakes and all.
 ## Prior Work
 My original goal was to output ATF (ASCII Transliteration Format), the standard notation Assyriologists use for cuneiform texts. ATF includes diacritics, separators, line markers, broken-sign annotations, and structural markup. The small 0.9B model couldn't reliably learn ATF's complexity within the hackathon's time and data constraints, leading to invalid ATF as the syntax rules were not followed strictly enough.
 After wrestling with disappointing results, I made a pragmatic pivot: Unicode-based transcriptions of cuneiform signs instead of full ATF. This simplified the target space dramatically and aligned with what the model could reasonably handle.
+This is a real tradeoff. ATF is what scholars actually use, and it encodes linguistic information that raw Unicode doesn't capture. But Unicode transcription isn't useless. It's a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus. It's a step in the right direction.
 ## Building the Dataset
 I made three significant mistakes along the way.
+**GSPO was the wrong algorithm.** I tried it because it was new and interesting without considering whether it fit my task. Cuneiform transcription quality is fundamentally character-level: did you get each glyph right? GSPO rewards sequence-level completions, which made the reward signal noisy and unhelpful. GRPO was better suited because it performs importance sampling at the token level, letting the model learn which glyph choices lead to better transcriptions.
 *Lesson: Before adopting a new algorithm, verify its inductive biases match your task's structure. Novelty isn't a reason to use something.*
 Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
+The tablets have waited five thousand years to be understood. With better tools, maybe we won't keep them waiting much longer.