thoughtworks
/

arithmetic-sorl

interpretability

mechanistic-interpretability

Model card Files Files and versions

amirali1985 commited on Apr 6

Commit

25329e0

·

verified ·

1 Parent(s): cabe4b4

Add model card

Files changed (1) hide show

README.md +46 -0

README.md ADDED Viewed

	@@ -0,0 +1,46 @@

+---
+license: apache-2.0
+tags:
+  - sorl
+  - arithmetic
+  - interpretability
+  - mechanistic-interpretability
+  - qwen3
+---
+# Arithmetic SoRL Models
+Model checkpoints for the **SoRL Arithmetic Interpretability Study**.
+Small Qwen3 transformers (3L/4H/512d, ~168M params) trained from scratch on integer
+addition and subtraction, with and without SoRL abstraction tokens.
+## Goal
+Show that SoRL externalizes arithmetic reasoning mechanisms (carry, borrow circuits)
+as explicit abstraction tokens — observable and intervenable without activation-level tooling.
+## Architecture
+Tiny Qwen3 from random init via `SorlModelWrapper.from_scratch`:
+```
+hidden_size=512, num_hidden_layers=3, num_attention_heads=4
+intermediate_size=2048, vocab_size=151936
+```
+## Experiment subfolders
+Each subfolder contains a trained model + `train_config.json` + `metrics.json`.
+| Subfolder | Task | Mode | Abstract Vocab |
+|---|---|---|---|
+| `add_baseline` | addition | SFT baseline | 0 |
+| `add_sorl_abs4` | addition | SoRL v6 | 4 |
+| `add_sorl_abs8` | addition | SoRL v6 | 8 |
+| ... | | | |
+## Related
+- Training data: [thoughtworks/arithmetic-sorl-data](https://huggingface.co/datasets/thoughtworks/arithmetic-sorl-data)
+- Code: [mod_gpt/arithmetic/](https://github.com/fangyuan-ksgk/mod_gpt/tree/amir/arithmetic/arithmetic)
+- SoRL paper: Yu & Abdullah, "Intention-Level Alignment with Weak Supervision" (2025)