boatbomber
/

NabuOCR

Image-Text-to-Text

transliteration

Model card Files Files and versions

boatbomber commited on Dec 18, 2025

Commit

9f5c0fb

·

1 Parent(s): afe3abb

Add WIP results

Files changed (1) hide show

STORY.md +25 -1

STORY.md CHANGED Viewed

@@ -138,7 +138,31 @@ With this design, training became stable and outputs became coherent.
 ## Results
-TODO: Add results here!
 ## Reflection

 ## Results
+I evaluated on a held-out test set of 1,000 tablets using Token Error Rate (TER), which is the edit distance between predicted and ground truth transcriptions normalized by ground truth length:
+$$\text{TER} = \frac{S + D + I}{N}$$
+where $S$ is substitutions, $D$ is deletions, $I$ is insertions, and $N$ is the number of tokens in the ground truth. Lower is better; 0% means perfect transcription. TER can exceed 100% when insertions outnumber correct tokens.
+**Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
+The base model achieved a mean TER of 99.67%, with median TER at exactly 100%. It produced zero good predictions (TER < 10%). In practice, it either output a handful of hallucinated characters or refused to engage with the task entirely. This confirms cuneiform is genuinely out-of-distribution as the model has no prior knowledge of the script.
+**After SFT:**
+Supervised fine-tuning reduced mean TER to 59.83% (median 61.05%), a 40 percentage point improvement. The model achieved one perfect transcription and two predictions under 10% TER. The 25th percentile reached 52.69% TER, meaning a quarter of predictions got nearly half the characters right.
+Qualitative analysis of the best predictions reveals what the model learned:
+- **Structural understanding is strong.** Face markers (@obverse, @reverse, @left) were correct across all top predictions. The model reliably learned tablet organization.
+- **Formulaic phrases transfer well.** Common administrative phrases were transcribed accurately across multiple tablets.
+- **Visually similar signs cause systematic errors.** Signs differing by one or two small wedge marks (𒐉 vs 𒀀, 𒐌 vs 𒐊) were frequently confused. This is an understandable error given their near-identical appearance.
+- **Difficult lines trigger fallback behavior.** When the model couldn't read a line clearly, it sometimes substituted a memorized common phrase rather than attempting the actual content.
+**After GRPO:**
+[Results pending: training still in progress]
 ## Reflection