Upload README.md
Browse files
README.md
CHANGED
|
@@ -46,21 +46,34 @@ This enables **60 total refinement steps** (30 layers × 2 steps each) throughou
|
|
| 46 |
|
| 47 |
Evaluated on LM-Evaluation-Harness:
|
| 48 |
|
| 49 |
-
| Task | Metric | Asterisk-Pi | Asterisk (
|
| 50 |
-
|
| 51 |
-
| **ARC-Challenge** | acc_norm | **0.3038** | 0.2884 | +0.0154 |
|
| 52 |
-
| **ARC-Easy** | acc_norm | **0.5412** | 0.5450 | -0.0038 |
|
| 53 |
-
| **HellaSwag** | acc_norm |
|
| 54 |
-
| **PIQA** | acc_norm |
|
| 55 |
-
| **WinoGrande** | acc | **0.5391** | 0.5210 | +0.0181 |
|
| 56 |
|
| 57 |
### Analysis
|
| 58 |
|
| 59 |
-
|
| 60 |
- **ARC-Challenge** (+1.54%): More challenging reasoning benefits from iterative refinement
|
| 61 |
- **WinoGrande** (+1.81%): Multi-step resolution helps with pronoun disambiguation
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
## Architecture
|
| 66 |
|
|
|
|
| 46 |
|
| 47 |
Evaluated on LM-Evaluation-Harness:
|
| 48 |
|
| 49 |
+
| Task | Metric | Asterisk-Pi<br>(173.7M) | Asterisk<br>(171.2M) | SmolLM2-135M<br>(135.6M) | Gemma-3-270m-it<br>(270M) | Δ vs Asterisk | Δ vs SmolLM2 | Δ vs Gemma-3 |
|
| 50 |
+
|------|--------|-------------|-----------------|--------------|----------------|---------------|--------------|--------------|
|
| 51 |
+
| **ARC-Challenge** | acc_norm | **0.3038** | 0.2884 | 0.2773 | 0.2730 | +0.0154 | **+0.0265** | **+0.0308** |
|
| 52 |
+
| **ARC-Easy** | acc_norm | **0.5412** | **0.5450** | 0.4899 | 0.5059 | -0.0038 | **+0.0513** | **+0.0353** |
|
| 53 |
+
| **HellaSwag** | acc_norm | 0.4207 | **0.4430** | 0.4293 | 0.3937 | -0.0223 | -0.0086 | **+0.0270** |
|
| 54 |
+
| **PIQA** | acc_norm | 0.6703 | **0.6770** | 0.6632 | 0.6692 | -0.0067 | **+0.0071** | +0.0011 |
|
| 55 |
+
| **WinoGrande** | acc | **0.5391** | 0.5210 | 0.5154 | 0.5257 | +0.0181 | **+0.0237** | +0.0134 |
|
| 56 |
|
| 57 |
### Analysis
|
| 58 |
|
| 59 |
+
**π-Flow improvements over base Asterisk:**
|
| 60 |
- **ARC-Challenge** (+1.54%): More challenging reasoning benefits from iterative refinement
|
| 61 |
- **WinoGrande** (+1.81%): Multi-step resolution helps with pronoun disambiguation
|
| 62 |
|
| 63 |
+
**Improvements over SmolLM2-135M base:**
|
| 64 |
+
- **ARC-Challenge** (+2.65%): Hybrid architecture + π-flow significantly improves complex reasoning
|
| 65 |
+
- **ARC-Easy** (+5.13%): Strong gains on elementary science questions
|
| 66 |
+
- **WinoGrande** (+2.37%): Better pronoun disambiguation through iterative refinement
|
| 67 |
+
- **PIQA** (+0.71%): Modest gains on physical commonsense
|
| 68 |
+
|
| 69 |
+
**Outperforming Gemma-3-270m-it (with 96M fewer parameters):**
|
| 70 |
+
- **ARC-Challenge** (+3.08%): Superior reasoning despite being 35% smaller
|
| 71 |
+
- **ARC-Easy** (+3.53%): Significant advantage on elementary science
|
| 72 |
+
- **HellaSwag** (+2.70%): Much stronger commonsense reasoning
|
| 73 |
+
- **WinoGrande** (+1.34%): Better coreference resolution
|
| 74 |
+
- **PIQA** (+0.11%): Comparable physical reasoning
|
| 75 |
+
|
| 76 |
+
**Key insight**: Asterisk-Pi (173.7M params) consistently outperforms the much larger Gemma-3-270m-it (270M params), demonstrating that the hybrid ASPP-Attention architecture with π-flow refinement achieves superior parameter efficiency. The structured reasoning approach enables better performance per parameter, especially on complex multi-step reasoning tasks.
|
| 77 |
|
| 78 |
## Architecture
|
| 79 |
|