Commit
·
59bb64b
1
Parent(s):
7cfc480
Rebalance takeaways - observations not universal lessons
Browse files- Rename "Key Takeaways" to "What We Observed"
- Frame as what happened in this experiment, not best practices
- Keep training process observations (reward shaping, entropy) as defensible lessons
- Soften phase summaries from "Lesson" to "Observation" where appropriate
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
README.md
CHANGED
|
@@ -114,7 +114,7 @@ Started with micro-bonuses to guide learning:
|
|
| 114 |
|
| 115 |
**What happened**: Entropy collapsed from 1.09 → 0.36. The agent learned to game the reward function—collect bonuses while ignoring actual profitability. Buffer showed 90% win rate while real trade win rate was 20%.
|
| 116 |
|
| 117 |
-
**Lesson**: Reward shaping
|
| 118 |
|
| 119 |
### Phase 2: Pure Realized PnL
|
| 120 |
|
|
@@ -144,7 +144,7 @@ First update hit -$64 drawdown. But the agent recovered:
|
|
| 144 |
| 1 | -$63.75 | 29.5% |
|
| 145 |
| 36 | +$23.10 | 15.6% |
|
| 146 |
|
| 147 |
-
**
|
| 148 |
|
| 149 |
### Phase 4: Share-Based PnL ($500 trades)
|
| 150 |
|
|
@@ -199,17 +199,17 @@ These patterns emerged in the data. Whether they represent learned behavior or m
|
|
| 199 |
|
| 200 |
These could reflect genuine learned strategies or simply profitable patterns in this specific market window.
|
| 201 |
|
| 202 |
-
##
|
| 203 |
|
| 204 |
-
1. **Reward shaping
|
| 205 |
|
| 206 |
-
2. **Reward signal design
|
| 207 |
|
| 208 |
-
3. **Entropy coefficient
|
| 209 |
|
| 210 |
-
4. **
|
| 211 |
|
| 212 |
-
5. **Give it time** -
|
| 213 |
|
| 214 |
---
|
| 215 |
|
|
|
|
| 114 |
|
| 115 |
**What happened**: Entropy collapsed from 1.09 → 0.36. The agent learned to game the reward function—collect bonuses while ignoring actual profitability. Buffer showed 90% win rate while real trade win rate was 20%.
|
| 116 |
|
| 117 |
+
**Lesson**: Reward shaping backfired here. When shaping rewards were gameable and similar magnitude to the real signal, the agent optimized the wrong thing.
|
| 118 |
|
| 119 |
### Phase 2: Pure Realized PnL
|
| 120 |
|
|
|
|
| 144 |
| 1 | -$63.75 | 29.5% |
|
| 145 |
| 36 | +$23.10 | 15.6% |
|
| 146 |
|
| 147 |
+
**Observation**: The agent recovered from -$64 to +$23 without policy collapse.
|
| 148 |
|
| 149 |
### Phase 4: Share-Based PnL ($500 trades)
|
| 150 |
|
|
|
|
| 199 |
|
| 200 |
These could reflect genuine learned strategies or simply profitable patterns in this specific market window.
|
| 201 |
|
| 202 |
+
## What We Observed
|
| 203 |
|
| 204 |
+
1. **Reward shaping backfired** - Phase 1 collapsed when the agent gamed micro-bonuses. Pure realized PnL worked better for us.
|
| 205 |
|
| 206 |
+
2. **Reward signal design mattered** - Share-based PnL outperformed probability-based by 4.5x. Match actual market economics.
|
| 207 |
|
| 208 |
+
3. **Entropy coefficient mattered** - 0.05 caused policy collapse; 0.10 maintained exploration.
|
| 209 |
|
| 210 |
+
4. **Buffer/trade divergence was a warning sign** - When buffer win rate diverged from actual trades, the agent was optimizing the wrong thing.
|
| 211 |
|
| 212 |
+
5. **Give it time** - LACUNA started deep in the red. Early performance wasn't indicative.
|
| 213 |
|
| 214 |
---
|
| 215 |
|