amirali1985 commited on
Commit
25329e0
·
verified ·
1 Parent(s): cabe4b4

Add model card

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - sorl
5
+ - arithmetic
6
+ - interpretability
7
+ - mechanistic-interpretability
8
+ - qwen3
9
+ ---
10
+
11
+ # Arithmetic SoRL Models
12
+
13
+ Model checkpoints for the **SoRL Arithmetic Interpretability Study**.
14
+
15
+ Small Qwen3 transformers (3L/4H/512d, ~168M params) trained from scratch on integer
16
+ addition and subtraction, with and without SoRL abstraction tokens.
17
+
18
+ ## Goal
19
+
20
+ Show that SoRL externalizes arithmetic reasoning mechanisms (carry, borrow circuits)
21
+ as explicit abstraction tokens — observable and intervenable without activation-level tooling.
22
+
23
+ ## Architecture
24
+
25
+ Tiny Qwen3 from random init via `SorlModelWrapper.from_scratch`:
26
+ ```
27
+ hidden_size=512, num_hidden_layers=3, num_attention_heads=4
28
+ intermediate_size=2048, vocab_size=151936
29
+ ```
30
+
31
+ ## Experiment subfolders
32
+
33
+ Each subfolder contains a trained model + `train_config.json` + `metrics.json`.
34
+
35
+ | Subfolder | Task | Mode | Abstract Vocab |
36
+ |---|---|---|---|
37
+ | `add_baseline` | addition | SFT baseline | 0 |
38
+ | `add_sorl_abs4` | addition | SoRL v6 | 4 |
39
+ | `add_sorl_abs8` | addition | SoRL v6 | 8 |
40
+ | ... | | | |
41
+
42
+ ## Related
43
+
44
+ - Training data: [thoughtworks/arithmetic-sorl-data](https://huggingface.co/datasets/thoughtworks/arithmetic-sorl-data)
45
+ - Code: [mod_gpt/arithmetic/](https://github.com/fangyuan-ksgk/mod_gpt/tree/amir/arithmetic/arithmetic)
46
+ - SoRL paper: Yu & Abdullah, "Intention-Level Alignment with Weak Supervision" (2025)