Commit
·
e0a9b97
1
Parent(s):
9f5c0fb
Use KaTeX inline syntax
Browse files
STORY.md
CHANGED
|
@@ -55,13 +55,13 @@ To push past SFT's limitations, I applied Group Relative Policy Optimization (GR
|
|
| 55 |
For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure during the multi-generation rollouts. The configuration:
|
| 56 |
|
| 57 |
- **LoRA rank:** 256
|
| 58 |
-
- **LoRA alpha:** 32 (using RSLoRA scaling:
|
| 59 |
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 60 |
- **Generations per prompt:** 5
|
| 61 |
- **Batch size:** 10 × 3 gradient accumulation = 30 effective
|
| 62 |
- **Learning rate:** 2e-6 with cosine decay (10× lower than SFT)
|
| 63 |
- **Warmup:** 3% of training steps
|
| 64 |
-
- **Optimizer:** AdamW (8-bit),
|
| 65 |
- **Weight decay:** 0.03
|
| 66 |
- **Max gradient norm:** 0.5
|
| 67 |
- **Loss type:** DR-GRPO
|
|
@@ -84,11 +84,11 @@ With Rank-Stabilized LoRA (RSLoRA), the scaling changes to:
|
|
| 84 |
|
| 85 |
$$\hat{W}_{\text{rslora}} = W + \frac{\alpha}{\sqrt{r}} \times AB$$
|
| 86 |
|
| 87 |
-
I forgot to adjust alpha. With rank
|
| 88 |
|
| 89 |
$$\frac{512}{\sqrt{256}} = \frac{512}{16} = 32$$
|
| 90 |
|
| 91 |
-
I had intended a scaling factor of 2. The fix was setting
|
| 92 |
|
| 93 |
$$\frac{32}{\sqrt{256}} = \frac{32}{16} = 2$$
|
| 94 |
|
|
@@ -110,13 +110,13 @@ I simplified to four reward functions that aligned with how we actually judge tr
|
|
| 110 |
|
| 111 |
$$R_{\text{faces}} = \frac{1}{2} \cdot \frac{1 - r}{1 + r}, \quad r = \frac{e}{\max(1, n)}$$
|
| 112 |
|
| 113 |
-
where
|
| 114 |
|
| 115 |
**Character Reward** encourages using cuneiform script over other characters:
|
| 116 |
|
| 117 |
$$R_{\text{char}} = 0.2 \cdot \frac{c}{c + u} - 0.1$$
|
| 118 |
|
| 119 |
-
where
|
| 120 |
|
| 121 |
**Length Reward** penalizes deviation from target length:
|
| 122 |
|
|
@@ -126,7 +126,7 @@ $$R_{\text{len}} = 0.2 - 0.4 \cdot \min\left(1, \frac{|L_{\text{target}} - L_{\t
|
|
| 126 |
|
| 127 |
$$R_{\text{acc}} = 6 \left( 0.15 \cdot \frac{p}{n} + 0.85 \cdot \frac{m}{n} \right) - 3$$
|
| 128 |
|
| 129 |
-
where
|
| 130 |
|
| 131 |
**Total Reward**
|
| 132 |
|
|
@@ -142,7 +142,7 @@ I evaluated on a held-out test set of 1,000 tablets using Token Error Rate (TER)
|
|
| 142 |
|
| 143 |
$$\text{TER} = \frac{S + D + I}{N}$$
|
| 144 |
|
| 145 |
-
where
|
| 146 |
|
| 147 |
**Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
|
| 148 |
|
|
|
|
| 55 |
For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure during the multi-generation rollouts. The configuration:
|
| 56 |
|
| 57 |
- **LoRA rank:** 256
|
| 58 |
+
- **LoRA alpha:** 32 (using RSLoRA scaling: \\(\alpha = 2\sqrt{r}\\))
|
| 59 |
- **Trainable parameters:** 239M of 1.2B (20%)
|
| 60 |
- **Generations per prompt:** 5
|
| 61 |
- **Batch size:** 10 × 3 gradient accumulation = 30 effective
|
| 62 |
- **Learning rate:** 2e-6 with cosine decay (10× lower than SFT)
|
| 63 |
- **Warmup:** 3% of training steps
|
| 64 |
+
- **Optimizer:** AdamW (8-bit), \\(\beta_1 = 0.9\\), \\(\beta_2 = 0.99\\)
|
| 65 |
- **Weight decay:** 0.03
|
| 66 |
- **Max gradient norm:** 0.5
|
| 67 |
- **Loss type:** DR-GRPO
|
|
|
|
| 84 |
|
| 85 |
$$\hat{W}_{\text{rslora}} = W + \frac{\alpha}{\sqrt{r}} \times AB$$
|
| 86 |
|
| 87 |
+
I forgot to adjust alpha. With rank \\(r = 256\\) and my original \\(\alpha = 2r = 512\\), the effective scaling under RSLoRA was:
|
| 88 |
|
| 89 |
$$\frac{512}{\sqrt{256}} = \frac{512}{16} = 32$$
|
| 90 |
|
| 91 |
+
I had intended a scaling factor of 2. The fix was setting \\(\alpha = 2\sqrt{r} = 32\\):
|
| 92 |
|
| 93 |
$$\frac{32}{\sqrt{256}} = \frac{32}{16} = 2$$
|
| 94 |
|
|
|
|
| 110 |
|
| 111 |
$$R_{\text{faces}} = \frac{1}{2} \cdot \frac{1 - r}{1 + r}, \quad r = \frac{e}{\max(1, n)}$$
|
| 112 |
|
| 113 |
+
where \\(e\\) is the count of incorrect face markers and \\(n\\) is the expected count.
|
| 114 |
|
| 115 |
**Character Reward** encourages using cuneiform script over other characters:
|
| 116 |
|
| 117 |
$$R_{\text{char}} = 0.2 \cdot \frac{c}{c + u} - 0.1$$
|
| 118 |
|
| 119 |
+
where \\(c\\) is the count of cuneiform characters and \\(u\\) is unwanted characters (excluding whitespace, punctuation, and face markers).
|
| 120 |
|
| 121 |
**Length Reward** penalizes deviation from target length:
|
| 122 |
|
|
|
|
| 126 |
|
| 127 |
$$R_{\text{acc}} = 6 \left( 0.15 \cdot \frac{p}{n} + 0.85 \cdot \frac{m}{n} \right) - 3$$
|
| 128 |
|
| 129 |
+
where \\(p\\) is the correct prefix length, \\(m\\) is the count of positionally correct tokens, and \\(n\\) is the ground truth length. Prefix accuracy encourages the model to get the beginning right and extend correctness step by step. Positional accuracy rewards characters that appear at the right indices regardless of what comes before. The 15/85 weighting reflects a deliberate choice: positional accuracy matters more because cuneiform tablets often have damaged or ambiguous openings, so we don't want to over-penalize models that struggle with the first few signs but recover well. Prefix accuracy still gets weight because it encourages the model to "lock in" correct sequences rather than scattering correct guesses randomly.
|
| 130 |
|
| 131 |
**Total Reward**
|
| 132 |
|
|
|
|
| 142 |
|
| 143 |
$$\text{TER} = \frac{S + D + I}{N}$$
|
| 144 |
|
| 145 |
+
where \\(S\\) is substitutions, \\(D\\) is deletions, \\(I\\) is insertions, and \\(N\\) is the number of tokens in the ground truth. Lower is better; 0% means perfect transcription. TER can exceed 100% when insertions outnumber correct tokens.
|
| 146 |
|
| 147 |
**Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
|
| 148 |
|