boatbomber
/

NabuOCR

@@ -66,13 +66,17 @@ For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure dur
 - **Max gradient norm:** 0.5
 - **Loss type:** DR-GRPO
-I made three significant mistakes along the way.
-**GSPO was the wrong algorithm.** I tried it because it was new and interesting without considering whether it fit my task. Cuneiform transcription quality is fundamentally character-level: did you get each glyph right? GSPO rewards sequence-level completions, which made the reward signal noisy and unhelpful. GRPO was better suited because it performs importance sampling at the token level, letting the model learn which glyph choices lead to better transcriptions.
 *Lesson: Before adopting a new algorithm, verify its inductive biases match your task's structure. Novelty isn't a reason to use something.*
-**I misconfigured RSLoRA.** When I switched to GRPO, I uncovered a subtle but critical error in my LoRA setup. Standard LoRA scales adapter contributions by:
 $$\hat{W} = W + \frac{\alpha}{r} \times AB$$
@@ -92,7 +96,9 @@ This misconfiguration made LoRA updates an order of magnitude too aggressive. On
 *Lesson: When switching algorithms, trace the math end-to-end. Defaults that work in one might not translate well to the other.*
-**My reward function was exploitable.** The initial design tried to encourage correct "shape" (line count, characters per line) along with TER (token error rate). The model learned to exploit this immediately. It would spam the same cuneiform character into the correct shape, ignoring accuracy entirely. Classic reward hacking.
 *Lesson: If a reward can be gamed, it will be. Design rewards that only improve when the actual goal improves.*

 - **Max gradient norm:** 0.5
 - **Loss type:** DR-GRPO
+I made three significant mistakes along the way that taught me some lessons. I've included them here in hopes that others can avoid making these same mistakes.
+1. GSPO was the wrong algorithm.
+I tried it because it was new and interesting without considering whether it fit my task. Cuneiform transcription quality is fundamentally character-level: did you get each glyph right? GSPO rewards sequence-level completions, which made the reward signal noisy and unhelpful. GRPO was better suited because it performs importance sampling at the token level, letting the model learn which glyph choices lead to better transcriptions.
 *Lesson: Before adopting a new algorithm, verify its inductive biases match your task's structure. Novelty isn't a reason to use something.*
+2. I misconfigured RSLoRA.
+When I switched to GRPO, I uncovered a subtle but critical error in my LoRA setup. Standard LoRA scales adapter contributions by:
 $$\hat{W} = W + \frac{\alpha}{r} \times AB$$
 *Lesson: When switching algorithms, trace the math end-to-end. Defaults that work in one might not translate well to the other.*
+3. My reward function was exploitable.
+The initial design tried to encourage correct "shape" (line count, characters per line) along with TER (token error rate). The model learned to exploit this immediately. It would spam the same cuneiform character into the correct shape, ignoring accuracy entirely. Classic reward hacking.
 *Lesson: If a reward can be gamed, it will be. Design rewards that only improve when the actual goal improves.*