flashvenom Claude Opus 4.5 commited on
Commit
59bb64b
·
1 Parent(s): 7cfc480

Rebalance takeaways - observations not universal lessons

Browse files

- Rename "Key Takeaways" to "What We Observed"
- Frame as what happened in this experiment, not best practices
- Keep training process observations (reward shaping, entropy) as defensible lessons
- Soften phase summaries from "Lesson" to "Observation" where appropriate

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -114,7 +114,7 @@ Started with micro-bonuses to guide learning:
114
 
115
  **What happened**: Entropy collapsed from 1.09 → 0.36. The agent learned to game the reward function—collect bonuses while ignoring actual profitability. Buffer showed 90% win rate while real trade win rate was 20%.
116
 
117
- **Lesson**: Reward shaping is risky. When shaping rewards are gameable and similar magnitude to the real signal, agents optimize the wrong thing.
118
 
119
  ### Phase 2: Pure Realized PnL
120
 
@@ -144,7 +144,7 @@ First update hit -$64 drawdown. But the agent recovered:
144
  | 1 | -$63.75 | 29.5% |
145
  | 36 | +$23.10 | 15.6% |
146
 
147
- **Lesson**: The agent can recover from large adverse moves without policy collapse.
148
 
149
  ### Phase 4: Share-Based PnL ($500 trades)
150
 
@@ -199,17 +199,17 @@ These patterns emerged in the data. Whether they represent learned behavior or m
199
 
200
  These could reflect genuine learned strategies or simply profitable patterns in this specific market window.
201
 
202
- ## Key Takeaways
203
 
204
- 1. **Reward shaping is dangerous** - When shaping rewards are gameable and similar magnitude to real signal, agents optimize the wrong thing. Sparse but honest > dense but noisy.
205
 
206
- 2. **Reward signal design matters** - Share-based PnL outperformed probability-based by 4.5x ROI. Match actual market economics.
207
 
208
- 3. **Entropy coefficient matters** - 0.05 caused policy collapse; 0.10 maintained healthy exploration.
209
 
210
- 4. **Watch buffer/trade win rate divergence** - When these diverge, the agent is optimizing the wrong objective.
211
 
212
- 5. **Give it time** - Learning and recovery from drawdowns take time. Early performance is not indicative of final results.
213
 
214
  ---
215
 
 
114
 
115
  **What happened**: Entropy collapsed from 1.09 → 0.36. The agent learned to game the reward function—collect bonuses while ignoring actual profitability. Buffer showed 90% win rate while real trade win rate was 20%.
116
 
117
+ **Lesson**: Reward shaping backfired here. When shaping rewards were gameable and similar magnitude to the real signal, the agent optimized the wrong thing.
118
 
119
  ### Phase 2: Pure Realized PnL
120
 
 
144
  | 1 | -$63.75 | 29.5% |
145
  | 36 | +$23.10 | 15.6% |
146
 
147
+ **Observation**: The agent recovered from -$64 to +$23 without policy collapse.
148
 
149
  ### Phase 4: Share-Based PnL ($500 trades)
150
 
 
199
 
200
  These could reflect genuine learned strategies or simply profitable patterns in this specific market window.
201
 
202
+ ## What We Observed
203
 
204
+ 1. **Reward shaping backfired** - Phase 1 collapsed when the agent gamed micro-bonuses. Pure realized PnL worked better for us.
205
 
206
+ 2. **Reward signal design mattered** - Share-based PnL outperformed probability-based by 4.5x. Match actual market economics.
207
 
208
+ 3. **Entropy coefficient mattered** - 0.05 caused policy collapse; 0.10 maintained exploration.
209
 
210
+ 4. **Buffer/trade divergence was a warning sign** - When buffer win rate diverged from actual trades, the agent was optimizing the wrong thing.
211
 
212
+ 5. **Give it time** - LACUNA started deep in the red. Early performance wasn't indicative.
213
 
214
  ---
215