Rebalance takeaways - observations not universal lessons

- Rename "Key Takeaways" to "What We Observed"
- Frame as what happened in this experiment, not best practices
- Keep training process observations (reward shaping, entropy) as defensible lessons
- Soften phase summaries from "Lesson" to "Observation" where appropriate

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -114,7 +114,7 @@ Started with micro-bonuses to guide learning:
 **What happened**: Entropy collapsed from 1.09 → 0.36. The agent learned to game the reward function—collect bonuses while ignoring actual profitability. Buffer showed 90% win rate while real trade win rate was 20%.
-**Lesson**: Reward shaping is risky. When shaping rewards are gameable and similar magnitude to the real signal, agents optimize the wrong thing.
 ### Phase 2: Pure Realized PnL
@@ -144,7 +144,7 @@ First update hit -$64 drawdown. But the agent recovered:
 | 1 | -$63.75 | 29.5% |
 | 36 | +$23.10 | 15.6% |
-**Lesson**: The agent can recover from large adverse moves without policy collapse.
 ### Phase 4: Share-Based PnL ($500 trades)
@@ -199,17 +199,17 @@ These patterns emerged in the data. Whether they represent learned behavior or m
 These could reflect genuine learned strategies or simply profitable patterns in this specific market window.
-## Key Takeaways
-1. **Reward shaping is dangerous** - When shaping rewards are gameable and similar magnitude to real signal, agents optimize the wrong thing. Sparse but honest > dense but noisy.
-2. **Reward signal design matters** - Share-based PnL outperformed probability-based by 4.5x ROI. Match actual market economics.
-3. **Entropy coefficient matters** - 0.05 caused policy collapse; 0.10 maintained healthy exploration.
-4. **Watch buffer/trade win rate divergence** - When these diverge, the agent is optimizing the wrong objective.
-5. **Give it time** - Learning and recovery from drawdowns take time. Early performance is not indicative of final results.
 ---

 **What happened**: Entropy collapsed from 1.09 → 0.36. The agent learned to game the reward function—collect bonuses while ignoring actual profitability. Buffer showed 90% win rate while real trade win rate was 20%.
+**Lesson**: Reward shaping backfired here. When shaping rewards were gameable and similar magnitude to the real signal, the agent optimized the wrong thing.
 ### Phase 2: Pure Realized PnL
 | 1 | -$63.75 | 29.5% |
 | 36 | +$23.10 | 15.6% |
+**Observation**: The agent recovered from -$64 to +$23 without policy collapse.
 ### Phase 4: Share-Based PnL ($500 trades)
 These could reflect genuine learned strategies or simply profitable patterns in this specific market window.
+## What We Observed
+1. **Reward shaping backfired** - Phase 1 collapsed when the agent gamed micro-bonuses. Pure realized PnL worked better for us.
+2. **Reward signal design mattered** - Share-based PnL outperformed probability-based by 4.5x. Match actual market economics.
+3. **Entropy coefficient mattered** - 0.05 caused policy collapse; 0.10 maintained exploration.
+4. **Buffer/trade divergence was a warning sign** - When buffer win rate diverged from actual trades, the agent was optimizing the wrong thing.
+5. **Give it time** - LACUNA started deep in the red. Early performance wasn't indicative.
 ---