boatbomber commited on
Commit
9f5c0fb
ยท
1 Parent(s): afe3abb

Add WIP results

Browse files
Files changed (1) hide show
  1. STORY.md +25 -1
STORY.md CHANGED
@@ -138,7 +138,31 @@ With this design, training became stable and outputs became coherent.
138
 
139
  ## Results
140
 
141
- TODO: Add results here!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
  ## Reflection
144
 
 
138
 
139
  ## Results
140
 
141
+ I evaluated on a held-out test set of 1,000 tablets using Token Error Rate (TER), which is the edit distance between predicted and ground truth transcriptions normalized by ground truth length:
142
+
143
+ $$\text{TER} = \frac{S + D + I}{N}$$
144
+
145
+ where $S$ is substitutions, $D$ is deletions, $I$ is insertions, and $N$ is the number of tokens in the ground truth. Lower is better; 0% means perfect transcription. TER can exceed 100% when insertions outnumber correct tokens.
146
+
147
+ **Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
148
+
149
+ The base model achieved a mean TER of 99.67%, with median TER at exactly 100%. It produced zero good predictions (TER < 10%). In practice, it either output a handful of hallucinated characters or refused to engage with the task entirely. This confirms cuneiform is genuinely out-of-distribution as the model has no prior knowledge of the script.
150
+
151
+ **After SFT:**
152
+
153
+ Supervised fine-tuning reduced mean TER to 59.83% (median 61.05%), a 40 percentage point improvement. The model achieved one perfect transcription and two predictions under 10% TER. The 25th percentile reached 52.69% TER, meaning a quarter of predictions got nearly half the characters right.
154
+
155
+ Qualitative analysis of the best predictions reveals what the model learned:
156
+
157
+ - **Structural understanding is strong.** Face markers (@obverse, @reverse, @left) were correct across all top predictions. The model reliably learned tablet organization.
158
+ - **Formulaic phrases transfer well.** Common administrative phrases were transcribed accurately across multiple tablets.
159
+ - **Visually similar signs cause systematic errors.** Signs differing by one or two small wedge marks (๐’‰ vs ๐’€€, ๐’Œ vs ๐’Š) were frequently confused. This is an understandable error given their near-identical appearance.
160
+ - **Difficult lines trigger fallback behavior.** When the model couldn't read a line clearly, it sometimes substituted a memorized common phrase rather than attempting the actual content.
161
+
162
+ **After GRPO:**
163
+
164
+ [Results pending: training still in progress]
165
+
166
 
167
  ## Reflection
168