boatbomber commited on
Commit
a0fa8ae
Β·
1 Parent(s): 699404c

Update story with GRPO results

Browse files
Files changed (1) hide show
  1. STORY.md +13 -5
STORY.md CHANGED
@@ -104,7 +104,7 @@ The initial design tried to encourage correct "shape" (line count, characters pe
104
 
105
  ### The Final Reward Design
106
 
107
- I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
108
 
109
  #### Visual Similarity for Cuneiform Glyphs
110
 
@@ -176,23 +176,29 @@ I evaluated on a held-out test set of 1,000 tablets using TER. Lower is better;
176
 
177
  **Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
178
 
179
- The base model achieved a mean TER of 99.67%, with median TER at exactly 100%. It produced zero good predictions (TER < 10%). In practice, it either output a handful of hallucinated characters or refused to engage with the task entirely. This confirms cuneiform is genuinely out-of-distribution as the model has no prior knowledge of the script.
180
 
181
  **After SFT:**
182
 
183
- Supervised fine-tuning reduced mean TER to 59.83% (median 61.05%), a 40 percentage point improvement. The model achieved one perfect transcription and two predictions under 10% TER. The 25th percentile reached 52.69% TER, meaning a quarter of predictions got nearly half the characters right.
184
 
185
  Qualitative analysis of the best predictions reveals what the model learned:
186
 
187
- - **Structural understanding is strong.** Face markers (@obverse, @reverse, @left) were correct across all top predictions. The model reliably learned tablet organization.
188
  - **Formulaic phrases transfer well.** Common administrative phrases were transcribed accurately across multiple tablets.
189
  - **Visually similar signs cause systematic errors.** Signs differing by one or two small wedge marks (𒐉 vs π’€€, π’Œ vs π’Š) were frequently confused. This is an understandable error given their near-identical appearance.
190
  - **Difficult lines trigger fallback behavior.** When the model couldn't read a line clearly, it sometimes substituted a memorized common phrase rather than attempting the actual content.
191
 
192
  **After GRPO:**
193
 
194
- [Results pending: training still in progress]
195
 
 
 
 
 
 
 
196
 
197
  ## Reflection
198
 
@@ -200,4 +206,6 @@ This project became as much about infrastructure, debugging, and reward design a
200
 
201
  Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
202
 
 
 
203
  The tablets have waited five thousand years to be understood. With better tools, maybe we won't keep them waiting much longer.
 
104
 
105
  ### The Final Reward Design
106
 
107
+ After many experimental runs (and failed runs), I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
108
 
109
  #### Visual Similarity for Cuneiform Glyphs
110
 
 
176
 
177
  **Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
178
 
179
+ The base model achieved a mean TER of 99.67%, with median TER at exactly 100%. It produced zero good predictions. In practice, it either output a handful of hallucinated characters or refused to engage with the task entirely. This confirms cuneiform is genuinely out-of-distribution as the model has no prior knowledge of the script.
180
 
181
  **After SFT:**
182
 
183
+ Supervised fine-tuning reduced median TER to median 60.48%, a ~39 percentage point improvement. The 25th percentile reached 52.81% TER, meaning a quarter of predictions got nearly half the characters right.
184
 
185
  Qualitative analysis of the best predictions reveals what the model learned:
186
 
187
+ - **Structural understanding is strong.** Face markers (@obverse, @reverse, @left) were correct across top predictions. The model had learned tablet organization.
188
  - **Formulaic phrases transfer well.** Common administrative phrases were transcribed accurately across multiple tablets.
189
  - **Visually similar signs cause systematic errors.** Signs differing by one or two small wedge marks (𒐉 vs π’€€, π’Œ vs π’Š) were frequently confused. This is an understandable error given their near-identical appearance.
190
  - **Difficult lines trigger fallback behavior.** When the model couldn't read a line clearly, it sometimes substituted a memorized common phrase rather than attempting the actual content.
191
 
192
  **After GRPO:**
193
 
194
+ Reinforcement learning helped sharpen the model, increasing overall reward by 127% over the course of training. This lowered median TER from 60.48% to 59.70%. The 98th percentile lowered from 93.78% to 75.40% TER, meaning the worst case performance improved dramatically.
195
 
196
+ Qualitative analysis of the predictions after GRPO:
197
+
198
+ - **More consistent outputs.** The model became more reliable and consistent in its behaviors. Std deviation (variation between response quality) was cut in half.
199
+ - **Visually closer to correct.** TER doesn't tell the whole story, since the errors it makes now are "less wrong" in that the errors are more visually closer to the truth than the previous model's errors.
200
+ - **Long lines are hard.** It struggled with accuracy on lines with many characters. These are less common, so perhaps under-represented in training.
201
+ - **Focused on obverse.** The obverse face has the highest accuracy, while reverse and edge faces struggle. It seems the model found the highest rewards in training if it matched the primary face closely, even if that meant sacrificing accuracy of the less common faces.
202
 
203
  ## Reflection
204
 
 
206
 
207
  Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
208
 
209
+ In hindsight, what would I do differently? I wish I made my reward functions apply per-face to avoid the model focusing on getting one face correct at the expense of others. I could have spent more time ensuring my dataset was as clean as possible- noise from bad data created small grad norm spikes that potentially harmed model performance. Overall though, I am quite happy with how things turned out.
210
+
211
  The tablets have waited five thousand years to be understood. With better tools, maybe we won't keep them waiting much longer.