amirali1985 commited on
Commit
d1a4946
Β·
1 Parent(s): bc93f38

LaTeX tab: add training details, results table, appendix; fix figure legend overlap

Browse files
Files changed (1) hide show
  1. app.py +153 -3
app.py CHANGED
@@ -92,14 +92,28 @@ Borrow cascades (UD) are the analogous structure in subtraction.
92
  \label{tab:quirke-subtasks}
93
  \end{table}
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  \paragraph{Evaluation splits.}
96
  We evaluate on \emph{C-splits} (C1--C6) grouping problems by the length
97
  of the longest consecutive hot-carry chain with varied answer digits,
98
  following~\citet{quirke_2024_addsub_preprint}.
99
  C6 (six consecutive carries) is the hardest split.
100
- Across three undersized architectures and data sizes from 10K to 100K,
101
- \sorl{} wins on C6 in all 13 tested configurations, with gains as large
102
- as $+50$\,pp (44\% $\to$ 94\% on the smallest model at 50K examples;
 
103
  Table~\ref{tab:undersized-wins}).
104
  """
105
 
@@ -249,6 +263,133 @@ LATEX_FIGURE_EXAMPLE = r"""% fig_arithmetic_example.tex
249
  \end{figure}
250
  """
251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252
  HARD_SPLITS = ["add_C4", "add_C5", "add_C6", "sub_M4", "sub_M5"]
253
  ALL_SPLITS = [
254
  "add_S0", "add_S1", "add_S2", "add_S3", "add_S4", "add_S5", "add_S6", "add_random",
@@ -866,11 +1007,20 @@ hidden activations β€” but here it is readable directly from the token sequence.
866
  gr.Code(value=LATEX_ARITHMETIC_SETUP, label="arithmetic_setup.tex",
867
  language=None, interactive=False)
868
 
 
 
 
 
 
869
  gr.Markdown("#### Β§ Carry-cascade example figure (TikZ)")
870
  gr.Markdown("Requires: `\\usepackage{tikz}`, `\\usetikzlibrary{matrix}`, `\\usepackage{xcolor}`, and `\\providecommand{\\sorl}{\\textsc{DLR}}`.")
871
  gr.Code(value=LATEX_FIGURE_EXAMPLE, label="fig_arithmetic_example.tex",
872
  language=None, interactive=False)
873
 
 
 
 
 
874
  # ── Tab 4: About ──
875
  with gr.TabItem("About"):
876
  eval_info_md = gr.Markdown("")
 
92
  \label{tab:quirke-subtasks}
93
  \end{table}
94
 
95
+ \paragraph{Models and training.}
96
+ We evaluate three undersized architectures:
97
+ \texttt{1L/2H/256d} (1 transformer layer, 2 attention heads, hidden size 256),
98
+ \texttt{1L/3H/510d} (1 layer, 3 heads, hidden size 510),
99
+ and \texttt{2L/1H/128d} (2 layers, 1 head, hidden size 128).
100
+ All models are trained from scratch with AdamW
101
+ ($\eta = 8{\times}10^{-5}$, $\beta_1{=}0.9$, $\beta_2{=}0.999$,
102
+ weight decay $0.01$, 3\% linear warmup),
103
+ batch size 64, for 20 epochs on fixed datasets of
104
+ 10K--100K six-digit addition/subtraction problems.
105
+ The abstraction codebook has $|\mathcal{A}|{=}30$ tokens with $K{=}1$
106
+ (one routing token per answer-digit position).
107
+
108
  \paragraph{Evaluation splits.}
109
  We evaluate on \emph{C-splits} (C1--C6) grouping problems by the length
110
  of the longest consecutive hot-carry chain with varied answer digits,
111
  following~\citet{quirke_2024_addsub_preprint}.
112
  C6 (six consecutive carries) is the hardest split.
113
+ Across the three architectures and data sizes from 10K to 100K,
114
+ \sorl{} wins in 12 of 13 tested configurations overall, and on C6 in
115
+ all 13, with gains as large as $+50$\,pp
116
+ (44\% $\to$ 94\% on the smallest model at 50K examples;
117
  Table~\ref{tab:undersized-wins}).
118
  """
119
 
 
263
  \end{figure}
264
  """
265
 
266
+ LATEX_TABLE_UNDERSIZED = r"""% tab:undersized-wins β€” SoRL vs SFT on undersized architectures
267
+ % Generated by arithmetic/paper/results/result_low_data_wins/run.py
268
+ % Requires: \usepackage{booktabs}, \usepackage{xcolor}
269
+
270
+ \begin{table}[t]
271
+ \centering
272
+ \small
273
+ \begin{tabular}{llrrrr}
274
+ \toprule
275
+ Architecture & Data & Baseline & SoRL & Gap & C6 gap \\
276
+ \midrule
277
+ \texttt{1L/2H/256d} & 10K & 10\% & \textbf{19\%} & \textcolor{green!50!black}{\textbf{+9\%}} & \textcolor{green!50!black}{\textbf{+18\%}} \\
278
+ & 25K & 32\% & 26\% & $-7\%$ & \textcolor{green!50!black}{\textbf{+10\%}} \\
279
+ & 50K & 44\% & \textbf{65\%} & \textcolor{green!50!black}{\textbf{+21\%}} & \textcolor{green!50!black}{\textbf{+34\%}} \\
280
+ & 100K & 49\% & \textbf{65\%} & \textcolor{green!50!black}{\textbf{+16\%}} & \textcolor{green!50!black}{\textbf{+31\%}} \\
281
+ \midrule
282
+ \texttt{1L/3H/510d} & 10K & 36\% & \textbf{52\%} & \textcolor{green!50!black}{\textbf{+16\%}} & \textcolor{green!50!black}{\textbf{+30\%}} \\
283
+ & 25K & 46\% & \textbf{60\%} & \textcolor{green!50!black}{\textbf{+14\%}} & \textcolor{green!50!black}{\textbf{+22\%}} \\
284
+ & 50K & 53\% & \textbf{72\%} & \textcolor{green!50!black}{\textbf{+19\%}} & \textcolor{green!50!black}{\textbf{+38\%}} \\
285
+ & 100K & 67\% & \textbf{83\%} & \textcolor{green!50!black}{\textbf{+16\%}} & \textcolor{green!50!black}{\textbf{+26\%}} \\
286
+ \midrule
287
+ \texttt{2L/1H/128d} & 10K & 16\% & \textbf{36\%} & \textcolor{green!50!black}{\textbf{+21\%}} & \textcolor{green!50!black}{\textbf{+39\%}} \\
288
+ & 25K & 40\% & \textbf{55\%} & \textcolor{green!50!black}{\textbf{+15\%}} & \textcolor{green!50!black}{\textbf{+23\%}} \\
289
+ & 50K & 59\% & \textbf{87\%} & \textcolor{green!50!black}{\textbf{+28\%}} & \textcolor{green!50!black}{\textbf{+50\%}} \\
290
+ & 75K & 75\% & \textbf{87\%} & \textcolor{green!50!black}{\textbf{+12\%}} & \textcolor{green!50!black}{\textbf{+5\%}} \\
291
+ & 100K & 73\% & \textbf{95\%} & \textcolor{green!50!black}{\textbf{+22\%}} & \textcolor{green!50!black}{\textbf{+33\%}} \\
292
+ \bottomrule
293
+ \end{tabular}
294
+ \caption{\sorl{} ($K{=}1$, $|\mathcal{A}|{=}30$) vs.\ \sft{} baseline on
295
+ undersized architectures across data sizes.
296
+ \textbf{Gap} = overall accuracy gain; \textbf{C6 gap} = gain on
297
+ 6-deep carry cascades (the hardest split).
298
+ \sorl{} wins in \textbf{12 of 13} (architecture, data-size) pairs;
299
+ the single exception is \texttt{1L/2H/256d} at 25K, where the model
300
+ is undertrained (accuracy still rising at epoch 20).
301
+ \sorl{} wins on C6 in \textbf{all 13} configurations.}
302
+ \label{tab:undersized-wins}
303
+ \end{table}
304
+ """
305
+
306
+ LATEX_APPENDIX = r"""% ── Appendix: Arithmetic interpretability details ──────────────────────────
307
+
308
+ \section{Arithmetic case study: experimental details}
309
+ \label{app:arithmetic}
310
+
311
+ \subsection{Task and data}
312
+
313
+ Six-digit addition and subtraction problems are formatted as:
314
+ \[
315
+ \underbrace{d_1 d_2 d_3 d_4 d_5 d_6}_{\text{operand A}}
316
+ \; \mathtt{+/-} \;
317
+ \underbrace{d_1 d_2 d_3 d_4 d_5 d_6}_{\text{operand B}}
318
+ \; \mathtt{=} \;
319
+ \underbrace{d_0 d_1 d_2 d_3 d_4 d_5 d_6}_{\text{answer (7 digits)}}
320
+ \]
321
+ where all operands are zero-padded to 6 digits and answers to 7 digits
322
+ (the leading $d_0$ captures the overflow carry or borrow).
323
+ Each symbol is mapped to a unique token via a fixed 13-symbol vocabulary
324
+ ($\mathtt{0}$--$\mathtt{9}$, $\mathtt{+}$, $\mathtt{-}$, $\mathtt{=}$),
325
+ giving sequences of exactly 21 tokens (14 prompt, 7 answer).
326
+
327
+ Training data is drawn uniformly at random from all valid 6-digit
328
+ addition/subtraction problems; subtraction problems are enriched to
329
+ over-represent borrow cascades (40\% of digit positions forced equal,
330
+ giving MB:3\%, MB$_3$:0.8\% vs.\ 0.7\%/0.04\% without enrichment).
331
+ Datasets are fixed (seed 42) and hosted on HuggingFace
332
+ (\texttt{thoughtworks/arithmetic-sorl-data}).
333
+
334
+ \subsection{Model architectures}
335
+
336
+ All models are decoder-only transformers trained from scratch using a
337
+ Qwen3-0.6B tokenizer (digit-level; each symbol = 1 token).
338
+ The three undersized configurations evaluated in Table~\ref{tab:undersized-wins}:
339
+
340
+ \begin{table}[h]
341
+ \centering
342
+ \small
343
+ \begin{tabular}{lrrrrr}
344
+ \toprule
345
+ Name & Layers & Heads & Hidden & FFN & Parameters \\
346
+ \midrule
347
+ \texttt{1L/2H/256d} & 1 & 2 & 256 & 1024 & $\sim$0.3M \\
348
+ \texttt{1L/3H/510d} & 1 & 3 & 510 & 2040 & $\sim$2.0M \\
349
+ \texttt{2L/1H/128d} & 2 & 1 & 128 & 512 & $\sim$0.1M \\
350
+ \bottomrule
351
+ \end{tabular}
352
+ \caption{Undersized architectures. All use pre-norm, GeLU activation,
353
+ and the Qwen3 tokenizer.}
354
+ \end{table}
355
+
356
+ \subsection{Training hyperparameters}
357
+
358
+ \begin{table}[h]
359
+ \centering
360
+ \small
361
+ \begin{tabular}{ll}
362
+ \toprule
363
+ Hyperparameter & Value \\
364
+ \midrule
365
+ Optimizer & AdamW \\
366
+ Learning rate & $8 \times 10^{-5}$ \\
367
+ $\beta_1, \beta_2$ & $0.9,\; 0.999$ \\
368
+ Weight decay & $0.01$ \\
369
+ LR schedule & Linear warmup (3\%) then constant \\
370
+ Batch size & 64 \\
371
+ Epochs & 20 \\
372
+ \sorl{} codebook & $|\mathcal{A}|=30$, $K=1$ \\
373
+ \sorl{} loss weights & $\alpha_{\text{info-gain}}=10$, $\alpha_{\text{abs}}=0.1$, $\alpha_{\text{zipf}}=1.0$ \\
374
+ \bottomrule
375
+ \end{tabular}
376
+ \caption{Training hyperparameters shared across all undersized-architecture runs.}
377
+ \end{table}
378
+
379
+ \subsection{Evaluation}
380
+
381
+ We evaluate using fixed-length autoregressive decoding: the model generates
382
+ answer digits one at a time (left-to-right, $d_0 \to d_6$) using its own
383
+ predictions. Abstraction tokens are inserted via the SoRL recursion
384
+ (search-then-recurse), not sampled autoregressively, matching the training
385
+ procedure. We never use teacher forcing at eval time.
386
+
387
+ Accuracy is measured on 100 held-out examples per evaluation split
388
+ (seed 42, hosted on HuggingFace). The C-splits (C1--C6) group problems
389
+ by the length of the longest consecutive carry chain with varied answer digits,
390
+ following~\citet{quirke_2024_addsub_preprint}.
391
+ """
392
+
393
  HARD_SPLITS = ["add_C4", "add_C5", "add_C6", "sub_M4", "sub_M5"]
394
  ALL_SPLITS = [
395
  "add_S0", "add_S1", "add_S2", "add_S3", "add_S4", "add_S5", "add_S6", "add_random",
 
1007
  gr.Code(value=LATEX_ARITHMETIC_SETUP, label="arithmetic_setup.tex",
1008
  language=None, interactive=False)
1009
 
1010
+ gr.Markdown("#### Β§ Results table β€” SoRL vs baseline on undersized architectures")
1011
+ gr.Markdown("Requires: `\\usepackage{booktabs}`, `\\usepackage{xcolor}`.")
1012
+ gr.Code(value=LATEX_TABLE_UNDERSIZED, label="tab_undersized_wins.tex",
1013
+ language=None, interactive=False)
1014
+
1015
  gr.Markdown("#### Β§ Carry-cascade example figure (TikZ)")
1016
  gr.Markdown("Requires: `\\usepackage{tikz}`, `\\usetikzlibrary{matrix}`, `\\usepackage{xcolor}`, and `\\providecommand{\\sorl}{\\textsc{DLR}}`.")
1017
  gr.Code(value=LATEX_FIGURE_EXAMPLE, label="fig_arithmetic_example.tex",
1018
  language=None, interactive=False)
1019
 
1020
+ gr.Markdown("#### Β§ Appendix β€” full experimental details")
1021
+ gr.Code(value=LATEX_APPENDIX, label="appendix_arithmetic.tex",
1022
+ language=None, interactive=False)
1023
+
1024
  # ── Tab 4: About ──
1025
  with gr.TabItem("About"):
1026
  eval_info_md = gr.Markdown("")