boatbomber commited on
Commit
e0a9b97
·
1 Parent(s): 9f5c0fb

Use KaTeX inline syntax

Browse files
Files changed (1) hide show
  1. STORY.md +8 -8
STORY.md CHANGED
@@ -55,13 +55,13 @@ To push past SFT's limitations, I applied Group Relative Policy Optimization (GR
55
  For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure during the multi-generation rollouts. The configuration:
56
 
57
  - **LoRA rank:** 256
58
- - **LoRA alpha:** 32 (using RSLoRA scaling: $\alpha = 2\sqrt{r}$)
59
  - **Trainable parameters:** 239M of 1.2B (20%)
60
  - **Generations per prompt:** 5
61
  - **Batch size:** 10 × 3 gradient accumulation = 30 effective
62
  - **Learning rate:** 2e-6 with cosine decay (10× lower than SFT)
63
  - **Warmup:** 3% of training steps
64
- - **Optimizer:** AdamW (8-bit), $\beta_1 = 0.9$, $\beta_2 = 0.99$
65
  - **Weight decay:** 0.03
66
  - **Max gradient norm:** 0.5
67
  - **Loss type:** DR-GRPO
@@ -84,11 +84,11 @@ With Rank-Stabilized LoRA (RSLoRA), the scaling changes to:
84
 
85
  $$\hat{W}_{\text{rslora}} = W + \frac{\alpha}{\sqrt{r}} \times AB$$
86
 
87
- I forgot to adjust alpha. With rank $r = 256$ and my original $\alpha = 2r = 512$, the effective scaling under RSLoRA was:
88
 
89
  $$\frac{512}{\sqrt{256}} = \frac{512}{16} = 32$$
90
 
91
- I had intended a scaling factor of 2. The fix was setting $\alpha = 2\sqrt{r} = 32$:
92
 
93
  $$\frac{32}{\sqrt{256}} = \frac{32}{16} = 2$$
94
 
@@ -110,13 +110,13 @@ I simplified to four reward functions that aligned with how we actually judge tr
110
 
111
  $$R_{\text{faces}} = \frac{1}{2} \cdot \frac{1 - r}{1 + r}, \quad r = \frac{e}{\max(1, n)}$$
112
 
113
- where $e$ is the count of incorrect face markers and $n$ is the expected count.
114
 
115
  **Character Reward** encourages using cuneiform script over other characters:
116
 
117
  $$R_{\text{char}} = 0.2 \cdot \frac{c}{c + u} - 0.1$$
118
 
119
- where $c$ is the count of cuneiform characters and $u$ is unwanted characters (excluding whitespace, punctuation, and face markers).
120
 
121
  **Length Reward** penalizes deviation from target length:
122
 
@@ -126,7 +126,7 @@ $$R_{\text{len}} = 0.2 - 0.4 \cdot \min\left(1, \frac{|L_{\text{target}} - L_{\t
126
 
127
  $$R_{\text{acc}} = 6 \left( 0.15 \cdot \frac{p}{n} + 0.85 \cdot \frac{m}{n} \right) - 3$$
128
 
129
- where $p$ is the correct prefix length, $m$ is the count of positionally correct tokens, and $n$ is the ground truth length. Prefix accuracy encourages the model to get the beginning right and extend correctness step by step. Positional accuracy rewards characters that appear at the right indices regardless of what comes before. The 15/85 weighting reflects a deliberate choice: positional accuracy matters more because cuneiform tablets often have damaged or ambiguous openings, so we don't want to over-penalize models that struggle with the first few signs but recover well. Prefix accuracy still gets weight because it encourages the model to "lock in" correct sequences rather than scattering correct guesses randomly.
130
 
131
  **Total Reward**
132
 
@@ -142,7 +142,7 @@ I evaluated on a held-out test set of 1,000 tablets using Token Error Rate (TER)
142
 
143
  $$\text{TER} = \frac{S + D + I}{N}$$
144
 
145
- where $S$ is substitutions, $D$ is deletions, $I$ is insertions, and $N$ is the number of tokens in the ground truth. Lower is better; 0% means perfect transcription. TER can exceed 100% when insertions outnumber correct tokens.
146
 
147
  **Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
148
 
 
55
  For GRPO, I switched from full fine-tuning to LoRA to reduce memory pressure during the multi-generation rollouts. The configuration:
56
 
57
  - **LoRA rank:** 256
58
+ - **LoRA alpha:** 32 (using RSLoRA scaling: \\(\alpha = 2\sqrt{r}\\))
59
  - **Trainable parameters:** 239M of 1.2B (20%)
60
  - **Generations per prompt:** 5
61
  - **Batch size:** 10 × 3 gradient accumulation = 30 effective
62
  - **Learning rate:** 2e-6 with cosine decay (10× lower than SFT)
63
  - **Warmup:** 3% of training steps
64
+ - **Optimizer:** AdamW (8-bit), \\(\beta_1 = 0.9\\), \\(\beta_2 = 0.99\\)
65
  - **Weight decay:** 0.03
66
  - **Max gradient norm:** 0.5
67
  - **Loss type:** DR-GRPO
 
84
 
85
  $$\hat{W}_{\text{rslora}} = W + \frac{\alpha}{\sqrt{r}} \times AB$$
86
 
87
+ I forgot to adjust alpha. With rank \\(r = 256\\) and my original \\(\alpha = 2r = 512\\), the effective scaling under RSLoRA was:
88
 
89
  $$\frac{512}{\sqrt{256}} = \frac{512}{16} = 32$$
90
 
91
+ I had intended a scaling factor of 2. The fix was setting \\(\alpha = 2\sqrt{r} = 32\\):
92
 
93
  $$\frac{32}{\sqrt{256}} = \frac{32}{16} = 2$$
94
 
 
110
 
111
  $$R_{\text{faces}} = \frac{1}{2} \cdot \frac{1 - r}{1 + r}, \quad r = \frac{e}{\max(1, n)}$$
112
 
113
+ where \\(e\\) is the count of incorrect face markers and \\(n\\) is the expected count.
114
 
115
  **Character Reward** encourages using cuneiform script over other characters:
116
 
117
  $$R_{\text{char}} = 0.2 \cdot \frac{c}{c + u} - 0.1$$
118
 
119
+ where \\(c\\) is the count of cuneiform characters and \\(u\\) is unwanted characters (excluding whitespace, punctuation, and face markers).
120
 
121
  **Length Reward** penalizes deviation from target length:
122
 
 
126
 
127
  $$R_{\text{acc}} = 6 \left( 0.15 \cdot \frac{p}{n} + 0.85 \cdot \frac{m}{n} \right) - 3$$
128
 
129
+ where \\(p\\) is the correct prefix length, \\(m\\) is the count of positionally correct tokens, and \\(n\\) is the ground truth length. Prefix accuracy encourages the model to get the beginning right and extend correctness step by step. Positional accuracy rewards characters that appear at the right indices regardless of what comes before. The 15/85 weighting reflects a deliberate choice: positional accuracy matters more because cuneiform tablets often have damaged or ambiguous openings, so we don't want to over-penalize models that struggle with the first few signs but recover well. Prefix accuracy still gets weight because it encourages the model to "lock in" correct sequences rather than scattering correct guesses randomly.
130
 
131
  **Total Reward**
132
 
 
142
 
143
  $$\text{TER} = \frac{S + D + I}{N}$$
144
 
145
+ where \\(S\\) is substitutions, \\(D\\) is deletions, \\(I\\) is insertions, and \\(N\\) is the number of tokens in the ground truth. Lower is better; 0% means perfect transcription. TER can exceed 100% when insertions outnumber correct tokens.
146
 
147
  **Base model (PaddleOCR-VL 0.9B, no fine-tuning):**
148