Commit
ยท
9f5c0fb
1
Parent(s):
afe3abb
Add WIP results
Browse files
STORY.md
CHANGED
|
@@ -138,7 +138,31 @@ With this design, training became stable and outputs became coherent.
|
|
| 138 |
|
| 139 |
## Results
|
| 140 |
|
| 141 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
## Reflection
|
| 144 |
|
|
|
|
| 138 |
|
| 139 |
## Results
|
| 140 |
|
| 141 |
+
I evaluated on a held-out test set of 1,000 tablets using Token Error Rate (TER), which is the edit distance between predicted and ground truth transcriptions normalized by ground truth length:
|
| 142 |
+
|
| 143 |
+
$$\text{TER} = \frac{S + D + I}{N}$$
|
| 144 |
+
|
| 145 |
+
where $S$ is substitutions, $D$ is deletions, $I$ is insertions, and $N$ is the number of tokens in the ground truth. Lower is better; 0% means perfect transcription. TER can exceed 100% when insertions outnumber correct tokens.
|
| 146 |
+
|
| 147 |
+
**Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
|
| 148 |
+
|
| 149 |
+
The base model achieved a mean TER of 99.67%, with median TER at exactly 100%. It produced zero good predictions (TER < 10%). In practice, it either output a handful of hallucinated characters or refused to engage with the task entirely. This confirms cuneiform is genuinely out-of-distribution as the model has no prior knowledge of the script.
|
| 150 |
+
|
| 151 |
+
**After SFT:**
|
| 152 |
+
|
| 153 |
+
Supervised fine-tuning reduced mean TER to 59.83% (median 61.05%), a 40 percentage point improvement. The model achieved one perfect transcription and two predictions under 10% TER. The 25th percentile reached 52.69% TER, meaning a quarter of predictions got nearly half the characters right.
|
| 154 |
+
|
| 155 |
+
Qualitative analysis of the best predictions reveals what the model learned:
|
| 156 |
+
|
| 157 |
+
- **Structural understanding is strong.** Face markers (@obverse, @reverse, @left) were correct across all top predictions. The model reliably learned tablet organization.
|
| 158 |
+
- **Formulaic phrases transfer well.** Common administrative phrases were transcribed accurately across multiple tablets.
|
| 159 |
+
- **Visually similar signs cause systematic errors.** Signs differing by one or two small wedge marks (๐ vs ๐, ๐ vs ๐) were frequently confused. This is an understandable error given their near-identical appearance.
|
| 160 |
+
- **Difficult lines trigger fallback behavior.** When the model couldn't read a line clearly, it sometimes substituted a memorized common phrase rather than attempting the actual content.
|
| 161 |
+
|
| 162 |
+
**After GRPO:**
|
| 163 |
+
|
| 164 |
+
[Results pending: training still in progress]
|
| 165 |
+
|
| 166 |
|
| 167 |
## Reflection
|
| 168 |
|