trade-pool / README.md
tosi-n's picture
Upload README.md with huggingface_hub
65053bf verified
|
Raw
History Blame Contribute Delete
3.94 kB
---
license: apache-2.0
base_model: poolside/Laguna-XS.2
tags:
- reinforcement-learning
- lora
- trading
- coding-agent
- verifiers
- prime-intellect
- poolside-hackathon
library_name: peft
---
# TradePool β€” a self-improving trading coding-agent (Laguna XS.2 LoRA)
**Poolside Γ— Prime Intellect Research Hackathon β€” Foundations track.**
A LoRA adapter for `poolside/Laguna-XS.2`, trained with reinforcement learning so the
model becomes a **coding agent that writes causal crypto trading-strategy functions**,
scored by a leak-proof out-of-sample backtest.
## The idea in one line
> Trading discipline that normally lives as *prompt text* (a memory file of rules) is
> turned into **adapter weights** by rewarding disciplined, profitable behaviour on
> held-out market data. The verifier *is* the backtest.
## How it works
1. **Environment** (`verifiers`, v0 `SingleTurnEnv`, pushed to `stimulir/trade-pool`):
the agent is given a Base-chain token's in-sample price history + a library of causal
indicators (RSI, MACD, MAs, z-score, Bollinger, volatility) and must write
`def strategy(features, position) -> target_position`.
2. **Verifier / reward** β€” the strategy runs bar-by-bar over a **held-out** window
(lookahead is structurally impossible; the function never sees future bars), scored by
a weighted rubric:
- OOS Sharpe (0.40) Β· beats buy-and-hold (0.20) Β· drawdown control (0.15) Β·
sane exposure (0.10) Β· transaction cost (0.05) Β· valid+actually-trades (0.10)
- Hard gates β†’ reward 0: invalid code, lookahead, NaN equity, **do-nothing strategies**.
3. **Training** β€” Prime Hosted RL (GRPO), `poolside/Laguna-XS.2`, 50 steps, batch 128,
`rollouts_per_example=8`, `enable_thinking=false`. FREE hosted Laguna run.
## Results
RL produced a clean, monotonic reward climb on the training environment:
| Stage | Total reward |
|---|---|
| step ~0 (baseline) | ~0.15 |
| step ~8 | 0.19 |
| step ~11 | 0.28 |
| step ~13 (peak) | ~0.42 |
| step ~50 (final) | ~0.34–0.41 |
Every rubric component improved together (not single-metric gaming):
`reward_valid` 0.30 β†’ ~0.70 (writes valid trading code far more often),
`reward_sharpe` 0.10 β†’ 0.33, drawdown/exposure/cost all up. Held-out-symbol eval on base
Laguna scored `reward_valid` 0.75 / `reward_sharpe` 0.45, confirming the env is in the
healthy trainable band before training.
## The novel contribution: closing the self-improvement loop
- **Weights channel:** each RL iteration warm-starts from the prior adapter
(`checkpoint_id`) β€” genuine parametric continuation.
- **Curriculum channel:** a reflection step reads the prior adapter's out-of-sample eval
and shifts the next run's objective (sharpe β†’ min-drawdown β†’ balanced) and focuses the
weakest symbols β€” the agent's own results drive its next curriculum.
- **Falsifiable proof ("memory is the adapter"):** the discipline block (distilled from
618 real prior trading decisions) can be **stripped from the prompt**
(`use_seed_principles=false`); if the trained adapter stays disciplined, the rules now
live in the weights, not the prompt.
## Files
- `trade_pool/` β€” the full `verifiers` environment (features, causal backtester, executor,
rubric, data) β€” installable, builds to a wheel, bundles its own OHLCV tape.
- `adapter/` β€” the trained LoRA adapter weights for `poolside/Laguna-XS.2`.
- `configs/` β€” the RL training config(s).
- `reward_curve.txt`, `eval_*.json` β€” training + eval metrics.
## Reproduce
```bash
prime env push --path ./trade_pool --visibility PRIVATE # -> <you>/trade-pool
prime eval run <you>/trade-pool -m poolside/laguna-xs.2 -n 8 -r 1
prime train run configs/iter_1.toml # FREE hosted Laguna RL
prime deployments create <adapter_id> # serve the adapter
```
Built at the Poolside London hackathon, 29–30 May 2026. Team: **TradePool** (Tosin Dairo).