--- license: apache-2.0 base_model: poolside/Laguna-XS.2 tags: - reinforcement-learning - lora - trading - coding-agent - verifiers - prime-intellect - poolside-hackathon library_name: peft --- # TradePool — a self-improving trading coding-agent (Laguna XS.2 LoRA) **Poolside × Prime Intellect Research Hackathon — Foundations track.** A LoRA adapter for `poolside/Laguna-XS.2`, trained with reinforcement learning so the model becomes a **coding agent that writes causal crypto trading-strategy functions**, scored by a leak-proof out-of-sample backtest. ## The idea in one line > Trading discipline that normally lives as *prompt text* (a memory file of rules) is > turned into **adapter weights** by rewarding disciplined, profitable behaviour on > held-out market data. The verifier *is* the backtest. ## How it works 1. **Environment** (`verifiers`, v0 `SingleTurnEnv`, pushed to `stimulir/trade-pool`): the agent is given a Base-chain token's in-sample price history + a library of causal indicators (RSI, MACD, MAs, z-score, Bollinger, volatility) and must write `def strategy(features, position) -> target_position`. 2. **Verifier / reward** — the strategy runs bar-by-bar over a **held-out** window (lookahead is structurally impossible; the function never sees future bars), scored by a weighted rubric: - OOS Sharpe (0.40) · beats buy-and-hold (0.20) · drawdown control (0.15) · sane exposure (0.10) · transaction cost (0.05) · valid+actually-trades (0.10) - Hard gates → reward 0: invalid code, lookahead, NaN equity, **do-nothing strategies**. 3. **Training** — Prime Hosted RL (GRPO), `poolside/Laguna-XS.2`, 50 steps, batch 128, `rollouts_per_example=8`, `enable_thinking=false`. FREE hosted Laguna run. ## Results RL produced a clean, monotonic reward climb on the training environment: | Stage | Total reward | |---|---| | step ~0 (baseline) | ~0.15 | | step ~8 | 0.19 | | step ~11 | 0.28 | | step ~13 (peak) | ~0.42 | | step ~50 (final) | ~0.34–0.41 | Every rubric component improved together (not single-metric gaming): `reward_valid` 0.30 → ~0.70 (writes valid trading code far more often), `reward_sharpe` 0.10 → 0.33, drawdown/exposure/cost all up. Held-out-symbol eval on base Laguna scored `reward_valid` 0.75 / `reward_sharpe` 0.45, confirming the env is in the healthy trainable band before training. ## The novel contribution: closing the self-improvement loop - **Weights channel:** each RL iteration warm-starts from the prior adapter (`checkpoint_id`) — genuine parametric continuation. - **Curriculum channel:** a reflection step reads the prior adapter's out-of-sample eval and shifts the next run's objective (sharpe → min-drawdown → balanced) and focuses the weakest symbols — the agent's own results drive its next curriculum. - **Falsifiable proof ("memory is the adapter"):** the discipline block (distilled from 618 real prior trading decisions) can be **stripped from the prompt** (`use_seed_principles=false`); if the trained adapter stays disciplined, the rules now live in the weights, not the prompt. ## Files - `trade_pool/` — the full `verifiers` environment (features, causal backtester, executor, rubric, data) — installable, builds to a wheel, bundles its own OHLCV tape. - `adapter/` — the trained LoRA adapter weights for `poolside/Laguna-XS.2`. - `configs/` — the RL training config(s). - `reward_curve.txt`, `eval_*.json` — training + eval metrics. ## Reproduce ```bash prime env push --path ./trade_pool --visibility PRIVATE # -> /trade-pool prime eval run /trade-pool -m poolside/laguna-xs.2 -n 8 -r 1 prime train run configs/iter_1.toml # FREE hosted Laguna RL prime deployments create # serve the adapter ``` Built at the Poolside London hackathon, 29–30 May 2026. Team: **TradePool** (Tosin Dairo).