Commit Β·
eaf9f36
1
Parent(s): ba15dfe
replace \paragraph with \textbf
Browse files
app.py
CHANGED
|
@@ -47,7 +47,7 @@ trivial positions (SC/SA) β the carry structure is readable off the token
|
|
| 47 |
sequence with no probing or patching required.
|
| 48 |
Full training and architecture details are in Appendix~\ref{app:training}.
|
| 49 |
|
| 50 |
-
\
|
| 51 |
Analysis of \texttt{2L/1H/128d} shows that \sorl{}'s codebook
|
| 52 |
spontaneously partitions into subtask-specialist tokens:
|
| 53 |
each of the 23 active tokens concentrates on a narrow slice of the
|
|
@@ -60,7 +60,7 @@ Tokens are also causally necessary: knocking out all tokens collapses
|
|
| 60 |
accuracy from 95.5\% to 0.1\%, confirming they carry the computation
|
| 61 |
rather than merely annotating it.
|
| 62 |
|
| 63 |
-
\
|
| 64 |
Because the routing codes are discrete and named, surgical model
|
| 65 |
edits are possible that have no analog in standard transformers:
|
| 66 |
swapping a single token at one answer position fixes wrong predictions
|
|
@@ -324,7 +324,7 @@ LATEX_APPENDIX = r"""% βββββββββββββββββββ
|
|
| 324 |
\label{tab:quirke-subtasks}
|
| 325 |
\end{table}
|
| 326 |
|
| 327 |
-
\
|
| 328 |
All interpretability analyses use model
|
| 329 |
\texttt{add\_sub\_sorl\_v1\_abs30\_K1\_100K\_2L1H128d}
|
| 330 |
(\texttt{2L/1H/128d}, 2 layers, 1 head, hidden size 128; trained on 100K examples),
|
|
@@ -384,7 +384,7 @@ Table~\ref{tab:ablation-splits} shows per-split accuracy under each condition.
|
|
| 384 |
\label{tab:ablation-splits}
|
| 385 |
\end{table}
|
| 386 |
|
| 387 |
-
\
|
| 388 |
Knockout reduces accuracy to $\leq$2\% on every split, confirming that
|
| 389 |
the model has offloaded computation into the routing tokens.
|
| 390 |
Three patterns are notable:
|
|
@@ -483,7 +483,7 @@ every other token in the codebook (29 candidates $\times$ 5 positions = 145
|
|
| 483 |
interventions per example) and measure how many wrong predictions become
|
| 484 |
correct β and how many previously-correct predictions break.
|
| 485 |
|
| 486 |
-
\
|
| 487 |
At positions $d_0$--$d_2$ (the carry-heavy positions), a fixing swap exists
|
| 488 |
for 27--31\% of mispredicted examples.
|
| 489 |
The best single swap is replacing \texttt{t16} with \texttt{t25} at $d_1$:
|
|
|
|
| 47 |
sequence with no probing or patching required.
|
| 48 |
Full training and architecture details are in Appendix~\ref{app:training}.
|
| 49 |
|
| 50 |
+
\textbf{Abstraction tokens recover known circuits without supervision.}
|
| 51 |
Analysis of \texttt{2L/1H/128d} shows that \sorl{}'s codebook
|
| 52 |
spontaneously partitions into subtask-specialist tokens:
|
| 53 |
each of the 23 active tokens concentrates on a narrow slice of the
|
|
|
|
| 60 |
accuracy from 95.5\% to 0.1\%, confirming they carry the computation
|
| 61 |
rather than merely annotating it.
|
| 62 |
|
| 63 |
+
\textbf{Named tokens enable targeted intervention and better performance.}
|
| 64 |
Because the routing codes are discrete and named, surgical model
|
| 65 |
edits are possible that have no analog in standard transformers:
|
| 66 |
swapping a single token at one answer position fixes wrong predictions
|
|
|
|
| 324 |
\label{tab:quirke-subtasks}
|
| 325 |
\end{table}
|
| 326 |
|
| 327 |
+
\textbf{Setup.}
|
| 328 |
All interpretability analyses use model
|
| 329 |
\texttt{add\_sub\_sorl\_v1\_abs30\_K1\_100K\_2L1H128d}
|
| 330 |
(\texttt{2L/1H/128d}, 2 layers, 1 head, hidden size 128; trained on 100K examples),
|
|
|
|
| 384 |
\label{tab:ablation-splits}
|
| 385 |
\end{table}
|
| 386 |
|
| 387 |
+
\textbf{Commentary.}
|
| 388 |
Knockout reduces accuracy to $\leq$2\% on every split, confirming that
|
| 389 |
the model has offloaded computation into the routing tokens.
|
| 390 |
Three patterns are notable:
|
|
|
|
| 483 |
interventions per example) and measure how many wrong predictions become
|
| 484 |
correct β and how many previously-correct predictions break.
|
| 485 |
|
| 486 |
+
\textbf{Results.}
|
| 487 |
At positions $d_0$--$d_2$ (the carry-heavy positions), a fixing swap exists
|
| 488 |
for 27--31\% of mispredicted examples.
|
| 489 |
The best single swap is replacing \texttt{t16} with \texttt{t25} at $d_1$:
|