Commit
Β·
a0fa8ae
1
Parent(s):
699404c
Update story with GRPO results
Browse files
STORY.md
CHANGED
|
@@ -104,7 +104,7 @@ The initial design tried to encourage correct "shape" (line count, characters pe
|
|
| 104 |
|
| 105 |
### The Final Reward Design
|
| 106 |
|
| 107 |
-
I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
|
| 108 |
|
| 109 |
#### Visual Similarity for Cuneiform Glyphs
|
| 110 |
|
|
@@ -176,23 +176,29 @@ I evaluated on a held-out test set of 1,000 tablets using TER. Lower is better;
|
|
| 176 |
|
| 177 |
**Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
|
| 178 |
|
| 179 |
-
The base model achieved a mean TER of 99.67%, with median TER at exactly 100%. It produced zero good predictions
|
| 180 |
|
| 181 |
**After SFT:**
|
| 182 |
|
| 183 |
-
Supervised fine-tuning reduced
|
| 184 |
|
| 185 |
Qualitative analysis of the best predictions reveals what the model learned:
|
| 186 |
|
| 187 |
-
- **Structural understanding is strong.** Face markers (@obverse, @reverse, @left) were correct across
|
| 188 |
- **Formulaic phrases transfer well.** Common administrative phrases were transcribed accurately across multiple tablets.
|
| 189 |
- **Visually similar signs cause systematic errors.** Signs differing by one or two small wedge marks (π vs π, π vs π) were frequently confused. This is an understandable error given their near-identical appearance.
|
| 190 |
- **Difficult lines trigger fallback behavior.** When the model couldn't read a line clearly, it sometimes substituted a memorized common phrase rather than attempting the actual content.
|
| 191 |
|
| 192 |
**After GRPO:**
|
| 193 |
|
| 194 |
-
|
| 195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
|
| 197 |
## Reflection
|
| 198 |
|
|
@@ -200,4 +206,6 @@ This project became as much about infrastructure, debugging, and reward design a
|
|
| 200 |
|
| 201 |
Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
|
| 202 |
|
|
|
|
|
|
|
| 203 |
The tablets have waited five thousand years to be understood. With better tools, maybe we won't keep them waiting much longer.
|
|
|
|
| 104 |
|
| 105 |
### The Final Reward Design
|
| 106 |
|
| 107 |
+
After many experimental runs (and failed runs), I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
|
| 108 |
|
| 109 |
#### Visual Similarity for Cuneiform Glyphs
|
| 110 |
|
|
|
|
| 176 |
|
| 177 |
**Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
|
| 178 |
|
| 179 |
+
The base model achieved a mean TER of 99.67%, with median TER at exactly 100%. It produced zero good predictions. In practice, it either output a handful of hallucinated characters or refused to engage with the task entirely. This confirms cuneiform is genuinely out-of-distribution as the model has no prior knowledge of the script.
|
| 180 |
|
| 181 |
**After SFT:**
|
| 182 |
|
| 183 |
+
Supervised fine-tuning reduced median TER to median 60.48%, a ~39 percentage point improvement. The 25th percentile reached 52.81% TER, meaning a quarter of predictions got nearly half the characters right.
|
| 184 |
|
| 185 |
Qualitative analysis of the best predictions reveals what the model learned:
|
| 186 |
|
| 187 |
+
- **Structural understanding is strong.** Face markers (@obverse, @reverse, @left) were correct across top predictions. The model had learned tablet organization.
|
| 188 |
- **Formulaic phrases transfer well.** Common administrative phrases were transcribed accurately across multiple tablets.
|
| 189 |
- **Visually similar signs cause systematic errors.** Signs differing by one or two small wedge marks (π vs π, π vs π) were frequently confused. This is an understandable error given their near-identical appearance.
|
| 190 |
- **Difficult lines trigger fallback behavior.** When the model couldn't read a line clearly, it sometimes substituted a memorized common phrase rather than attempting the actual content.
|
| 191 |
|
| 192 |
**After GRPO:**
|
| 193 |
|
| 194 |
+
Reinforcement learning helped sharpen the model, increasing overall reward by 127% over the course of training. This lowered median TER from 60.48% to 59.70%. The 98th percentile lowered from 93.78% to 75.40% TER, meaning the worst case performance improved dramatically.
|
| 195 |
|
| 196 |
+
Qualitative analysis of the predictions after GRPO:
|
| 197 |
+
|
| 198 |
+
- **More consistent outputs.** The model became more reliable and consistent in its behaviors. Std deviation (variation between response quality) was cut in half.
|
| 199 |
+
- **Visually closer to correct.** TER doesn't tell the whole story, since the errors it makes now are "less wrong" in that the errors are more visually closer to the truth than the previous model's errors.
|
| 200 |
+
- **Long lines are hard.** It struggled with accuracy on lines with many characters. These are less common, so perhaps under-represented in training.
|
| 201 |
+
- **Focused on obverse.** The obverse face has the highest accuracy, while reverse and edge faces struggle. It seems the model found the highest rewards in training if it matched the primary face closely, even if that meant sacrificing accuracy of the less common faces.
|
| 202 |
|
| 203 |
## Reflection
|
| 204 |
|
|
|
|
| 206 |
|
| 207 |
Some of those choices were mistakes. Others were small but critical course corrections. By sharing both the successes and the missteps, I hope others can build on this work more quickly and push cuneiform OCR even further.
|
| 208 |
|
| 209 |
+
In hindsight, what would I do differently? I wish I made my reward functions apply per-face to avoid the model focusing on getting one face correct at the expense of others. I could have spent more time ensuring my dataset was as clean as possible- noise from bad data created small grad norm spikes that potentially harmed model performance. Overall though, I am quite happy with how things turned out.
|
| 210 |
+
|
| 211 |
The tablets have waited five thousand years to be understood. With better tools, maybe we won't keep them waiting much longer.
|