Commit
·
f118dd7
1
Parent(s):
3e1a4b0
Update docs
Browse files
README.md
CHANGED
|
@@ -81,12 +81,12 @@ Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoi
|
|
| 81 |
- **LoRA rank:** 256 (RSLoRA with α=16)
|
| 82 |
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 83 |
- **Generations per prompt:** 4
|
| 84 |
-
- **Batch size:**
|
| 85 |
-
- **Learning rate:**
|
| 86 |
- **Warmup:** 3% of training steps
|
| 87 |
- **Optimizer:** AdamW (8-bit)
|
| 88 |
|
| 89 |
-
The reward function combined five components: Token Error Rate
|
| 90 |
|
| 91 |

|
| 92 |
|
|
|
|
| 81 |
- **LoRA rank:** 256 (RSLoRA with α=16)
|
| 82 |
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 83 |
- **Generations per prompt:** 4
|
| 84 |
+
- **Batch size:** 16
|
| 85 |
+
- **Learning rate:** 5e-6 with cosine decay
|
| 86 |
- **Warmup:** 3% of training steps
|
| 87 |
- **Optimizer:** AdamW (8-bit)
|
| 88 |
|
| 89 |
+
The reward function combined five components: weighted Token Error Rate using glyph visual similarity and curriculum learning, length deviation penalty, repetition penalty, line structure accuracy, and cuneiform character ratio. The adapter was merged back into the base model at 16-bit precision.
|
| 90 |
|
| 91 |

|
| 92 |
|
STORY.md
CHANGED
|
@@ -58,12 +58,12 @@ For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure dur
|
|
| 58 |
- **LoRA alpha:** 16 (using RSLoRA scaling: \\(\alpha = \sqrt{r}\\))
|
| 59 |
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 60 |
- **Generations per prompt:** 4
|
| 61 |
-
- **Batch size:**
|
| 62 |
-
- **Learning rate:**
|
| 63 |
- **Warmup:** 3% of training steps
|
| 64 |
- **Optimizer:** AdamW (8-bit), \\(\beta_1 = 0.9\\), \\(\beta_2 = 0.99\\)
|
| 65 |
- **Weight decay:** 0.03
|
| 66 |
-
- **Max gradient norm:**
|
| 67 |
- **Loss type:** DR-GRPO
|
| 68 |
|
| 69 |
I made three significant mistakes along the way that taught me some lessons. I've included them here in hopes that others can avoid making these same mistakes.
|
|
@@ -106,27 +106,49 @@ The initial design tried to encourage correct "shape" (line count, characters pe
|
|
| 106 |
|
| 107 |
I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
At 0% error this gives +3, at 50% error it gives 0, and it approaches -3 as errors increase. This dominates the reward signal because accurate transcription is the ultimate goal.
|
| 118 |
|
| 119 |
-
**
|
| 120 |
|
| 121 |
-
$$R_{\text{
|
| 122 |
|
| 123 |
-
where \\(N\\) is the ground truth length. This
|
| 124 |
|
| 125 |
-
**
|
| 126 |
|
| 127 |
-
$$R_{\text{
|
| 128 |
|
| 129 |
-
where \\(
|
| 130 |
|
| 131 |
**Lines Reward** captures structural correctness by measuring deviation from expected line count and line lengths:
|
| 132 |
|
|
@@ -136,15 +158,15 @@ where \\(d\\) is the average deviation across line count and per-line character
|
|
| 136 |
|
| 137 |
**Character Reward** encourages using cuneiform script over other characters:
|
| 138 |
|
| 139 |
-
$$R_{\text{char}} =
|
| 140 |
|
| 141 |
-
where \\(c\\) is the count of cuneiform characters and \\(u\\) is unwanted characters (excluding whitespace, punctuation, and face markers).
|
| 142 |
|
| 143 |
**Total Reward**
|
| 144 |
|
| 145 |
-
$$R_{\text{total}} = R_{\text{ter}} + R_{\text{
|
| 146 |
|
| 147 |
-
The TER reward dominates by design, with
|
| 148 |
|
| 149 |
With this design, training became stable and outputs became coherent.
|
| 150 |
|
|
|
|
| 58 |
- **LoRA alpha:** 16 (using RSLoRA scaling: \\(\alpha = \sqrt{r}\\))
|
| 59 |
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 60 |
- **Generations per prompt:** 4
|
| 61 |
+
- **Batch size:** 16
|
| 62 |
+
- **Learning rate:** 5e-6 with cosine decay
|
| 63 |
- **Warmup:** 3% of training steps
|
| 64 |
- **Optimizer:** AdamW (8-bit), \\(\beta_1 = 0.9\\), \\(\beta_2 = 0.99\\)
|
| 65 |
- **Weight decay:** 0.03
|
| 66 |
+
- **Max gradient norm:** 1.5
|
| 67 |
- **Loss type:** DR-GRPO
|
| 68 |
|
| 69 |
I made three significant mistakes along the way that taught me some lessons. I've included them here in hopes that others can avoid making these same mistakes.
|
|
|
|
| 106 |
|
| 107 |
I simplified to five reward functions that aligned with how we actually judge transcription quality. All use smooth mappings of the form \\((1 - r) / (1 + r)\\) to avoid sharp cliffs and provide continuous gradients.
|
| 108 |
|
| 109 |
+
#### Visual Similarity for Cuneiform Glyphs
|
| 110 |
|
| 111 |
+
A major challenge in cuneiform OCR is that many signs are visually similar, differing by only one or two small wedge marks. Standard edit distance treats all substitution errors equally, which doesn't reflect the reality that confusing 𒐉 with 𒀀 (nearly identical shapes) is more forgivable than confusing 𒐉 with 𒀭 (completely different).
|
| 112 |
|
| 113 |
+
To address this, I built a visual similarity matrix using the Dice coefficient. For each cuneiform token in the vocabulary, I rendered it as a 64×64 binary image using the NotoSansCuneiform font, then computed pairwise similarities:
|
| 114 |
|
| 115 |
+
$$\text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|}$$
|
| 116 |
+
|
| 117 |
+
where \\(A\\) and \\(B\\) are binary masks of the rendered glyphs, and \\(|\cdot|\\) denotes pixel count. The Dice coefficient ranges from 0 (completely different) to 1 (identical). This precomputed matrix of 424,452 pairwise similarities is cached to disk.
|
| 118 |
+
|
| 119 |
+
These similarities are used in a weighted Levenshtein distance where substitution costs are reduced for visually similar tokens:
|
| 120 |
+
|
| 121 |
+
$$\text{cost}_{\text{sub}}(a, b) = 1.0 - (\text{Dice}(a, b) \times 0.7) \in [0.3, 1.0]$$
|
| 122 |
+
|
| 123 |
+
Insertion and deletion costs remain 1.0. This gives partial credit for "close" mistakes, which is crucial for stable reinforcement learning when the model is still learning to distinguish similar signs.
|
| 124 |
+
|
| 125 |
+
**Annealed Weighted TER Reward** is the primary accuracy signal. It uses the weighted Levenshtein distance described above, normalized by ground truth length to produce Token Error Rate (TER):
|
| 126 |
+
|
| 127 |
+
$$\text{TER}_{\text{weighted}} = \frac{\text{weighted\_distance}(\text{pred}, \text{truth})}{N}$$
|
| 128 |
+
|
| 129 |
+
where \\(N\\) is the number of tokens in the ground truth. To implement curriculum learning, the similarity scaling is annealed over training:
|
| 130 |
+
|
| 131 |
+
$$\text{sharpness} = 0.3 + 0.7 \times \frac{\text{step}}{\text{total\_steps}}$$
|
| 132 |
+
|
| 133 |
+
$$\text{similarity}_{\text{scaled}}(a, b) = \text{Dice}(a, b) \times (1.0 - \text{sharpness})$$
|
| 134 |
+
|
| 135 |
+
Early in training (sharpness = 0.3), the model receives generous partial credit for visually similar substitutions. Late in training (sharpness = 1.0), the similarity scaling approaches zero, and the reward converges to standard TER. The reward transforms TER into:
|
| 136 |
+
|
| 137 |
+
$$R_{\text{ter}} = 3 \cdot \frac{1 - 2 \cdot \text{TER}_{\text{weighted}}}{1 + 2 \cdot \text{TER}_{\text{weighted}}} \in [-3, 3]$$
|
| 138 |
|
| 139 |
At 0% error this gives +3, at 50% error it gives 0, and it approaches -3 as errors increase. This dominates the reward signal because accurate transcription is the ultimate goal.
|
| 140 |
|
| 141 |
+
**Length Reward** penalizes length deviation from the expected output:
|
| 142 |
|
| 143 |
+
$$R_{\text{len}} = \frac{-0.5 \cdot d}{1 + d} \in [-0.5, 0], \quad d = \left| 1 - \frac{|\text{pred}|}{N} \right|$$
|
| 144 |
|
| 145 |
+
where \\(N\\) is the ground truth length. This works with the weighted TER reward to ensure symmetric penalties for insertions and deletions, since the weighted edit distance alone is normalized by truth length.
|
| 146 |
|
| 147 |
+
**Repetition Reward** penalizes excess token repetitions beyond what the ground truth contains:
|
| 148 |
|
| 149 |
+
$$R_{\text{rep}} = \frac{-r}{1 + r} \in [-1, 0], \quad r = \frac{\max(0, \text{pred\_reps} - \text{truth\_reps})}{N}$$
|
| 150 |
|
| 151 |
+
where \\(N\\) is the ground truth length. This specifically targets the repetition loops that plagued earlier training runs without penalizing legitimate repeated characters that appear in the ground truth.
|
| 152 |
|
| 153 |
**Lines Reward** captures structural correctness by measuring deviation from expected line count and line lengths:
|
| 154 |
|
|
|
|
| 158 |
|
| 159 |
**Character Reward** encourages using cuneiform script over other characters:
|
| 160 |
|
| 161 |
+
$$R_{\text{char}} = \frac{c}{c + u} - 0.5 \in [-0.5, 0.5]$$
|
| 162 |
|
| 163 |
+
where \\(c\\) is the count of cuneiform characters (U+12000–U+1254F) and \\(u\\) is unwanted characters (excluding whitespace, punctuation, structural markers, and face markers).
|
| 164 |
|
| 165 |
**Total Reward**
|
| 166 |
|
| 167 |
+
$$R_{\text{total}} = R_{\text{ter}} + R_{\text{len}} + R_{\text{rep}} + R_{\text{lines}} + R_{\text{char}} \in [-4.25, 3.75]$$
|
| 168 |
|
| 169 |
+
The weighted TER reward dominates by design, with curriculum learning gradually increasing its strictness. The other components provide shaping signals that help the model avoid common failure modes during training.
|
| 170 |
|
| 171 |
With this design, training became stable and outputs became coherent.
|
| 172 |
|