boatbomber commited on
Commit
f118dd7
·
1 Parent(s): 3e1a4b0

Update docs

Browse files
Files changed (2) hide show
  1. README.md +3 -3
  2. STORY.md +39 -17
README.md CHANGED
@@ -81,12 +81,12 @@ Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoi
81
  - **LoRA rank:** 256 (RSLoRA with α=16)
82
  - **Trainable parameters:** 239M of 1.2B (20%)
83
  - **Generations per prompt:** 4
84
- - **Batch size:** 24
85
- - **Learning rate:** 2e-6 with cosine decay
86
  - **Warmup:** 3% of training steps
87
  - **Optimizer:** AdamW (8-bit)
88
 
89
- The reward function combined five components: Token Error Rate (TER), repetition penalty, face marker accuracy, line structure accuracy, and cuneiform character ratio. The adapter was merged back into the base model at 16-bit precision.
90
 
91
  ![grpo-reward](./assets/grpo-reward.png)
92
 
 
81
  - **LoRA rank:** 256 (RSLoRA with α=16)
82
  - **Trainable parameters:** 239M of 1.2B (20%)
83
  - **Generations per prompt:** 4
84
+ - **Batch size:** 16
85
+ - **Learning rate:** 5e-6 with cosine decay
86
  - **Warmup:** 3% of training steps
87
  - **Optimizer:** AdamW (8-bit)
88
 
89
+ The reward function combined five components: weighted Token Error Rate using glyph visual similarity and curriculum learning, length deviation penalty, repetition penalty, line structure accuracy, and cuneiform character ratio. The adapter was merged back into the base model at 16-bit precision.
90
 
91
  ![grpo-reward](./assets/grpo-reward.png)
92
 
STORY.md CHANGED
@@ -58,12 +58,12 @@ For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure dur
58
  - **LoRA alpha:** 16 (using RSLoRA scaling: \\(\alpha = \sqrt{r}\\))
59
  - **Trainable parameters:** 239M of 1.2B (20%)
60
  - **Generations per prompt:** 4
61
- - **Batch size:** 24
62
- - **Learning rate:** 2e-6 with cosine decay (10× lower than SFT)
63
  - **Warmup:** 3% of training steps
64
  - **Optimizer:** AdamW (8-bit), \\(\beta_1 = 0.9\\), \\(\beta_2 = 0.99\\)
65
  - **Weight decay:** 0.03
66
- - **Max gradient norm:** 0.8
67
  - **Loss type:** DR-GRPO
68
 
69
  I made three significant mistakes along the way that taught me some lessons. I've included them here in hopes that others can avoid making these same mistakes.
@@ -106,27 +106,49 @@ The initial design tried to encourage correct "shape" (line count, characters pe
106
 
107
  I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
108
 
109
- **TER Reward** is the primary accuracy signal, based on Token Error Rate (TER), which measures edit distance normalized by ground truth length:
110
 
111
- $$\text{TER} = \frac{S + D + I}{N}$$
112
 
113
- where \\(S\\) is substitutions, \\(D\\) is deletions, \\(I\\) is insertions, and \\(N\\) is the number of tokens in the ground truth. The reward transforms this into:
114
 
115
- $$R_{\text{ter}} = 3 \cdot \frac{1 - 2 \cdot \text{TER}}{1 + 2 \cdot \text{TER}} \in [-3, 3]$$
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
  At 0% error this gives +3, at 50% error it gives 0, and it approaches -3 as errors increase. This dominates the reward signal because accurate transcription is the ultimate goal.
118
 
119
- **Repetition Reward** penalizes excess token repetitions beyond what the ground truth contains:
120
 
121
- $$R_{\text{rep}} = \frac{-r}{1 + r} \in [-1, 0], \quad r = \frac{\max(0, \text{pred\_reps} - \text{truth\_reps})}{N}$$
122
 
123
- where \\(N\\) is the ground truth length. This specifically targets the repetition loops that plagued earlier training runs without penalizing legitimate repeated characters that appear in the ground truth.
124
 
125
- **Faces Reward** captures correct identification of tablet face markers (@obverse, @reverse, etc.):
126
 
127
- $$R_{\text{faces}} = 0.5 \cdot \frac{1 - r}{1 + r} \in [-0.5, 0.5], \quad r = \frac{e}{\max(1, n)}$$
128
 
129
- where \\(e\\) is the count of incorrect face markers (missing, duplicated, or unexpected) and \\(n\\) is the expected count.
130
 
131
  **Lines Reward** captures structural correctness by measuring deviation from expected line count and line lengths:
132
 
@@ -136,15 +158,15 @@ where \\(d\\) is the average deviation across line count and per-line character
136
 
137
  **Character Reward** encourages using cuneiform script over other characters:
138
 
139
- $$R_{\text{char}} = 0.2 \cdot \frac{c}{c + u} - 0.1 \in [-0.1, 0.1]$$
140
 
141
- where \\(c\\) is the count of cuneiform characters and \\(u\\) is unwanted characters (excluding whitespace, punctuation, and face markers).
142
 
143
  **Total Reward**
144
 
145
- $$R_{\text{total}} = R_{\text{ter}} + R_{\text{rep}} + R_{\text{faces}} + R_{\text{lines}} + R_{\text{char}} \in [-4.85, 3.85]$$
146
 
147
- The TER reward dominates by design, with the other components providing shaping signals that help the model avoid common failure modes during early training.
148
 
149
  With this design, training became stable and outputs became coherent.
150
 
 
58
  - **LoRA alpha:** 16 (using RSLoRA scaling: \\(\alpha = \sqrt{r}\\))
59
  - **Trainable parameters:** 239M of 1.2B (20%)
60
  - **Generations per prompt:** 4
61
+ - **Batch size:** 16
62
+ - **Learning rate:** 5e-6 with cosine decay
63
  - **Warmup:** 3% of training steps
64
  - **Optimizer:** AdamW (8-bit), \\(\beta_1 = 0.9\\), \\(\beta_2 = 0.99\\)
65
  - **Weight decay:** 0.03
66
+ - **Max gradient norm:** 1.5
67
  - **Loss type:** DR-GRPO
68
 
69
  I made three significant mistakes along the way that taught me some lessons. I've included them here in hopes that others can avoid making these same mistakes.
 
106
 
107
  I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
108
 
109
+ #### Visual Similarity for Cuneiform Glyphs
110
 
111
+ A major challenge in cuneiform OCR is that many signs are visually similar, differing by only one or two small wedge marks. Standard edit distance treats all substitution errors equally, which doesn't reflect the reality that confusing 𒐉 with 𒀀 (nearly identical shapes) is more forgivable than confusing 𒐉 with 𒀭 (completely different).
112
 
113
+ To address this, I built a visual similarity matrix using the Dice coefficient. For each cuneiform token in the vocabulary, I rendered it as a 64×64 binary image using the NotoSansCuneiform font, then computed pairwise similarities:
114
 
115
+ $$\text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|}$$
116
+
117
+ where \\(A\\) and \\(B\\) are binary masks of the rendered glyphs, and \\(|\cdot|\\) denotes pixel count. The Dice coefficient ranges from 0 (completely different) to 1 (identical). This precomputed matrix of 424,452 pairwise similarities is cached to disk.
118
+
119
+ These similarities are used in a weighted Levenshtein distance where substitution costs are reduced for visually similar tokens:
120
+
121
+ $$\text{cost}_{\text{sub}}(a, b) = 1.0 - (\text{Dice}(a, b) \times 0.7) \in [0.3, 1.0]$$
122
+
123
+ Insertion and deletion costs remain 1.0. This gives partial credit for "close" mistakes, which is crucial for stable reinforcement learning when the model is still learning to distinguish similar signs.
124
+
125
+ **Annealed Weighted TER Reward** is the primary accuracy signal. It uses the weighted Levenshtein distance described above, normalized by ground truth length to produce Token Error Rate (TER):
126
+
127
+ $$\text{TER}_{\text{weighted}} = \frac{\text{weighted\_distance}(\text{pred}, \text{truth})}{N}$$
128
+
129
+ where \\(N\\) is the number of tokens in the ground truth. To implement curriculum learning, the similarity scaling is annealed over training:
130
+
131
+ $$\text{sharpness} = 0.3 + 0.7 \times \frac{\text{step}}{\text{total\_steps}}$$
132
+
133
+ $$\text{similarity}_{\text{scaled}}(a, b) = \text{Dice}(a, b) \times (1.0 - \text{sharpness})$$
134
+
135
+ Early in training (sharpness = 0.3), the model receives generous partial credit for visually similar substitutions. Late in training (sharpness = 1.0), the similarity scaling approaches zero, and the reward converges to standard TER. The reward transforms TER into:
136
+
137
+ $$R_{\text{ter}} = 3 \cdot \frac{1 - 2 \cdot \text{TER}_{\text{weighted}}}{1 + 2 \cdot \text{TER}_{\text{weighted}}} \in [-3, 3]$$
138
 
139
  At 0% error this gives +3, at 50% error it gives 0, and it approaches -3 as errors increase. This dominates the reward signal because accurate transcription is the ultimate goal.
140
 
141
+ **Length Reward** penalizes length deviation from the expected output:
142
 
143
+ $$R_{\text{len}} = \frac{-0.5 \cdot d}{1 + d} \in [-0.5, 0], \quad d = \left| 1 - \frac{|\text{pred}|}{N} \right|$$
144
 
145
+ where \\(N\\) is the ground truth length. This works with the weighted TER reward to ensure symmetric penalties for insertions and deletions, since the weighted edit distance alone is normalized by truth length.
146
 
147
+ **Repetition Reward** penalizes excess token repetitions beyond what the ground truth contains:
148
 
149
+ $$R_{\text{rep}} = \frac{-r}{1 + r} \in [-1, 0], \quad r = \frac{\max(0, \text{pred\_reps} - \text{truth\_reps})}{N}$$
150
 
151
+ where \\(N\\) is the ground truth length. This specifically targets the repetition loops that plagued earlier training runs without penalizing legitimate repeated characters that appear in the ground truth.
152
 
153
  **Lines Reward** captures structural correctness by measuring deviation from expected line count and line lengths:
154
 
 
158
 
159
  **Character Reward** encourages using cuneiform script over other characters:
160
 
161
+ $$R_{\text{char}} = \frac{c}{c + u} - 0.5 \in [-0.5, 0.5]$$
162
 
163
+ where \\(c\\) is the count of cuneiform characters (U+12000–U+1254F) and \\(u\\) is unwanted characters (excluding whitespace, punctuation, structural markers, and face markers).
164
 
165
  **Total Reward**
166
 
167
+ $$R_{\text{total}} = R_{\text{ter}} + R_{\text{len}} + R_{\text{rep}} + R_{\text{lines}} + R_{\text{char}} \in [-4.25, 3.75]$$
168
 
169
+ The weighted TER reward dominates by design, with curriculum learning gradually increasing its strictness. The other components provide shaping signals that help the model avoid common failure modes during training.
170
 
171
  With this design, training became stable and outputs became coherent.
172