Upload folder using huggingface_hub
Browse files
app.py
CHANGED
|
@@ -531,46 +531,65 @@ t2 = carry cascade. t7 = borrow cascade (subtraction only). t3 = no cascade need
|
|
| 531 |
The model has learned a **vocabulary for arithmetic reasoning** that maps directly to
|
| 532 |
Quirke's circuit definitions β without any supervision about carry logic.
|
| 533 |
|
| 534 |
-
### 4. Surgical
|
| 535 |
|
| 536 |
-
|
| 537 |
-
|
| 538 |
-
position), and the error is reduced.
|
| 539 |
|
| 540 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 541 |
|
| 542 |
```
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
Example 2: 109221 + 326780 = 0436001
|
| 549 |
-
Wrong: 0435901 β t21 at d3 (Use Carry position)
|
| 550 |
-
Fix: transplant t18 from a correct UC example
|
| 551 |
-
Result: 0435001 β d3 fixed (9β0), 2β1 errors β
|
| 552 |
-
|
| 553 |
-
Example 3: 332200 + 868010 = 1200210
|
| 554 |
-
Wrong: 1190210 β t9 at d1 (Use Carry position)
|
| 555 |
-
Fix: transplant t3 from a correct UC example
|
| 556 |
-
Result: 1100210 β d1 fixed (9β0), 2β1 errors β
|
| 557 |
```
|
| 558 |
|
| 559 |
-
|
|
|
|
|
|
|
|
|
|
| 560 |
|
| 561 |
-
|
| 562 |
-
- **t9** encodes "maybe carry, maybe not" (mixed β 56% sum=9, only 22% carry)
|
| 563 |
-
- **t18** encodes "definite carry, not from a sum-9" (Quirke's 1 state β 46% carry, only 2% sum=9)
|
| 564 |
|
| 565 |
-
|
| 566 |
-
|
| 567 |
-
|
|
|
|
| 568 |
|
| 569 |
-
|
| 570 |
-
|
| 571 |
-
|
|
|
|
| 572 |
|
| 573 |
-
|
|
|
|
|
|
|
| 574 |
""")
|
| 575 |
|
| 576 |
gr.Markdown("""### 3. Tokens spread across digit positions
|
|
|
|
| 531 |
The model has learned a **vocabulary for arithmetic reasoning** that maps directly to
|
| 532 |
Quirke's circuit definitions β without any supervision about carry logic.
|
| 533 |
|
| 534 |
+
### 4. Surgical intervention: the right token fixes hard-case failures
|
| 535 |
|
| 536 |
+
*All experiments below use the 1L/3H/510d model with K=1 abs30, trained on 100K examples.
|
| 537 |
+
This model gets C3-C6 accuracy ~70% β enough correct examples for comparison, enough errors to fix.*
|
|
|
|
| 538 |
|
| 539 |
+
**The two carry tokens: t9 vs t21**
|
| 540 |
+
|
| 541 |
+
The model learned two tokens that both appear at carry-related positions, but with different specializations:
|
| 542 |
+
|
| 543 |
+
| | **t9** (n=573) | **t21** (n=719) |
|
| 544 |
+
|---|---|---|
|
| 545 |
+
| Position | d2-d4 (spread) | d3 only (100%) |
|
| 546 |
+
| Sum = 9 | 56% | **95%** |
|
| 547 |
+
| Carry rate | 22% | 3% |
|
| 548 |
+
| Difficulty | **easy** (S0=23%, S2=28%) | **hard** (S5=37%, S6=37%) |
|
| 549 |
+
| Role | "shallow carry, maybe sum-9" | "deep cascade, definitely sum-9" |
|
| 550 |
+
|
| 551 |
+
t9 is a **shallow/ambiguous** token β it appears in easy problems where carry state doesn't matter much.
|
| 552 |
+
t21 is a **deep cascade specialist** β it appears specifically in 5-6 carry cascades where
|
| 553 |
+
every digit's sum is exactly 9 (Quirke's uncertain U state, eq. 2).
|
| 554 |
+
|
| 555 |
+
**The failure mode: using t9 at hard-cascade positions**
|
| 556 |
+
|
| 557 |
+
When the model encounters a hard cascade (C5/C6), it sometimes assigns t9 (the shallow token)
|
| 558 |
+
instead of t21 (the cascade specialist). This is like using the wrong circuit β the model
|
| 559 |
+
treats a deep cascade as a shallow one and gets the carry propagation wrong.
|
| 560 |
+
|
| 561 |
+
**The fix: globally replacing t9 with t21**
|
| 562 |
+
|
| 563 |
+
We test what happens when we force the model to always use t21 (the cascade specialist)
|
| 564 |
+
instead of t9:
|
| 565 |
|
| 566 |
```
|
| 567 |
+
Normal All t9βt21 Effect
|
| 568 |
+
C3-C6 (hard carries): 70% 93% +23pp β hard cases dramatically improve
|
| 569 |
+
S5 (5 cascades): 27% 92% +65pp β nearly fixes the hardest split
|
| 570 |
+
S0 (no carries): 99% 74% -25pp β easy cases get worse
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 571 |
```
|
| 572 |
|
| 573 |
+
Forcing t21 everywhere is like telling the model "always assume deep cascade" β it fixes
|
| 574 |
+
hard problems (+65pp on S5!) but hurts easy ones where no cascade exists. The model needs
|
| 575 |
+
to *correctly choose* between t9 and t21 based on the input, and its main failure mode on
|
| 576 |
+
hard cases is choosing too conservatively (t9 instead of t21).
|
| 577 |
|
| 578 |
+
**Individual transplants confirm this:**
|
|
|
|
|
|
|
| 579 |
|
| 580 |
+
```
|
| 581 |
+
014560 + 125450 = 0140010
|
| 582 |
+
Wrong: 0139010 β t9 at d2 (shallow token at carry position)
|
| 583 |
+
Fixed: 0130010 β transplant correct token β d2 fixed β
|
| 584 |
|
| 585 |
+
332200 + 868010 = 1200210
|
| 586 |
+
Wrong: 1190210 β t9 at d1 (shallow token at carry position)
|
| 587 |
+
Fixed: 1100210 β transplant correct token β d1 fixed β
|
| 588 |
+
```
|
| 589 |
|
| 590 |
+
Each transplant fixes the specific digit where the wrong carry token was assigned.
|
| 591 |
+
This proves **token identity causally determines carry computation** β the wrong
|
| 592 |
+
token = wrong carry state = wrong answer digit.
|
| 593 |
""")
|
| 594 |
|
| 595 |
gr.Markdown("""### 3. Tokens spread across digit positions
|