Upload folder using huggingface_hub
Browse files
app.py
CHANGED
|
@@ -324,16 +324,16 @@ on **hard carry/borrow cascades** β problems requiring multi-digit propagation
|
|
| 324 |
| C4 (4 hot carries) | 88% | **100%** | **+12pp** |
|
| 325 |
| C5 (5 hot carries) | 76% | **100%** | **+24pp** |
|
| 326 |
| C6 (6 hot carries) | 92% | **100%** | **+8pp** |
|
| 327 |
-
| sub\_M4 (4 borrows) |
|
| 328 |
|
| 329 |
**Undersized model (2L/1H/128d) at 100K β where both plateau below 100%:**
|
| 330 |
|
| 331 |
| Split | Baseline | SoRL K=1 abs30 | Gap |
|
| 332 |
|-------|----------|----------------|-----|
|
| 333 |
-
| C3 (3 hot carries) |
|
| 334 |
| C4 (4 hot carries) | 38% | **94%** | **+56pp** |
|
| 335 |
-
| C5 (5 hot carries) |
|
| 336 |
-
| C6 (6 hot carries) |
|
| 337 |
|
| 338 |
Even when the model is too small to reach 100%, SoRL's abstraction tokens provide
|
| 339 |
external scratch-pad memory that doubles or triples accuracy on hard cascades.
|
|
@@ -356,38 +356,18 @@ See the **Results** and **Interpretability** tabs for figures and analysis.
|
|
| 356 |
interactive=False,
|
| 357 |
)
|
| 358 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 359 |
with gr.Accordion("Per-Split Detail", open=False):
|
| 360 |
model_selector = gr.Dropdown(label="Model", choices=[], allow_custom_value=True)
|
| 361 |
detail_btn = gr.Button("Show splits")
|
| 362 |
detail_table = gr.Dataframe(headers=["Split", "Accuracy", "N"], interactive=False)
|
| 363 |
|
| 364 |
-
# ββ Tab 2:
|
| 365 |
-
with gr.TabItem("Results"):
|
| 366 |
-
gr.Markdown("""## SoRL K=1 abs30: never loses to baseline
|
| 367 |
-
|
| 368 |
-
Our best config β **K=1 (abstraction at every position), vocab size 30** β matches or beats
|
| 369 |
-
the SFT baseline on every data size and every architecture tested. No exceptions.
|
| 370 |
-
""")
|
| 371 |
-
gr.Image("static_figures/fig_data_efficiency.png")
|
| 372 |
-
|
| 373 |
-
gr.Markdown("""**At 10K training examples**, SoRL K=1 abs30 reaches **96.1%** while the baseline
|
| 374 |
-
reaches only 72.4% β a **+24 percentage point** improvement. At 25K, SoRL hits 100% while the
|
| 375 |
-
baseline is at 91.6%. By 50K both reach 100%.
|
| 376 |
-
|
| 377 |
-
K=4 (abstraction every 4th position) fails at 10K data β it doesn't have enough examples to learn
|
| 378 |
-
useful abstractions through search. K=1 is more data-efficient because every position gets a
|
| 379 |
-
scratchpad token.
|
| 380 |
-
""")
|
| 381 |
-
|
| 382 |
-
gr.Markdown("### SoRL helps undersized models the most")
|
| 383 |
-
gr.Image("static_figures/fig_undersized.png")
|
| 384 |
-
|
| 385 |
-
gr.Markdown("""The biggest gains are on **capacity-limited architectures**. A 2L/1H/128d model
|
| 386 |
-
goes from 50% (baseline) to **85%** (SoRL K=1 abs30) β a +35pp improvement. The abstraction tokens
|
| 387 |
-
effectively give the model external memory that compensates for its limited hidden dimensions.
|
| 388 |
-
""")
|
| 389 |
-
|
| 390 |
-
# ββ Tab 3: Interpretability ββ
|
| 391 |
with gr.TabItem("Interpretability"):
|
| 392 |
gr.Markdown("""## SoRL tokens externalize arithmetic circuits
|
| 393 |
|
|
|
|
| 324 |
| C4 (4 hot carries) | 88% | **100%** | **+12pp** |
|
| 325 |
| C5 (5 hot carries) | 76% | **100%** | **+24pp** |
|
| 326 |
| C6 (6 hot carries) | 92% | **100%** | **+8pp** |
|
| 327 |
+
| sub\_M4 (4 borrows) | 10% | **100%** | **+90pp** |
|
| 328 |
|
| 329 |
**Undersized model (2L/1H/128d) at 100K β where both plateau below 100%:**
|
| 330 |
|
| 331 |
| Split | Baseline | SoRL K=1 abs30 | Gap |
|
| 332 |
|-------|----------|----------------|-----|
|
| 333 |
+
| C3 (3 hot carries) | 28% | **98%** | **+70pp** |
|
| 334 |
| C4 (4 hot carries) | 38% | **94%** | **+56pp** |
|
| 335 |
+
| C5 (5 hot carries) | 48% | **86%** | **+38pp** |
|
| 336 |
+
| C6 (6 hot carries) | 32% | **94%** | **+62pp** |
|
| 337 |
|
| 338 |
Even when the model is too small to reach 100%, SoRL's abstraction tokens provide
|
| 339 |
external scratch-pad memory that doubles or triples accuracy on hard cascades.
|
|
|
|
| 356 |
interactive=False,
|
| 357 |
)
|
| 358 |
|
| 359 |
+
with gr.Accordion("Data Efficiency & Undersized Models", open=False):
|
| 360 |
+
gr.Image("static_figures/fig_data_efficiency.png")
|
| 361 |
+
gr.Markdown("At 10K, SoRL K=1 abs30 reaches **96.7%** vs baseline **76.6%** (+20pp). By 50K both hit 100%.")
|
| 362 |
+
gr.Image("static_figures/fig_undersized.png")
|
| 363 |
+
gr.Markdown("Undersized 2L/1H/128d: baseline 50% β SoRL **85%** (+35pp). Abstraction tokens compensate for limited capacity.")
|
| 364 |
+
|
| 365 |
with gr.Accordion("Per-Split Detail", open=False):
|
| 366 |
model_selector = gr.Dropdown(label="Model", choices=[], allow_custom_value=True)
|
| 367 |
detail_btn = gr.Button("Show splits")
|
| 368 |
detail_table = gr.Dataframe(headers=["Split", "Accuracy", "N"], interactive=False)
|
| 369 |
|
| 370 |
+
# ββ Tab 2: Interpretability ββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 371 |
with gr.TabItem("Interpretability"):
|
| 372 |
gr.Markdown("""## SoRL tokens externalize arithmetic circuits
|
| 373 |
|