---
license: apache-2.0
base_model: poolside/Laguna-XS.2
tags:
  - reinforcement-learning
  - lora
  - trading
  - coding-agent
  - verifiers
  - prime-intellect
  - poolside-hackathon
library_name: peft
---

# TradePool — a self-improving trading coding-agent (Laguna XS.2 LoRA)

**Poolside × Prime Intellect Research Hackathon — Foundations track.**

A LoRA adapter for `poolside/Laguna-XS.2`, trained with reinforcement learning so the
model becomes a **coding agent that writes causal crypto trading-strategy functions**,
scored by a leak-proof out-of-sample backtest.

## The idea in one line
> Trading discipline that normally lives as *prompt text* (a memory file of rules) is
> turned into **adapter weights** by rewarding disciplined, profitable behaviour on
> held-out market data. The verifier *is* the backtest.

## How it works
1. **Environment** (`verifiers`, v0 `SingleTurnEnv`, pushed to `stimulir/trade-pool`):
   the agent is given a Base-chain token's in-sample price history + a library of causal
   indicators (RSI, MACD, MAs, z-score, Bollinger, volatility) and must write
   `def strategy(features, position) -> target_position`.
2. **Verifier / reward** — the strategy runs bar-by-bar over a **held-out** window
   (lookahead is structurally impossible; the function never sees future bars), scored by
   a weighted rubric:
   - OOS Sharpe (0.40) · beats buy-and-hold (0.20) · drawdown control (0.15) ·
     sane exposure (0.10) · transaction cost (0.05) · valid+actually-trades (0.10)
   - Hard gates → reward 0: invalid code, lookahead, NaN equity, **do-nothing strategies**.
3. **Training** — Prime Hosted RL (GRPO), `poolside/Laguna-XS.2`, 50 steps, batch 128,
   `rollouts_per_example=8`, `enable_thinking=false`. FREE hosted Laguna run.

## Results
RL produced a clean, monotonic reward climb on the training environment:

| Stage | Total reward |
|---|---|
| step ~0 (baseline) | ~0.15 |
| step ~8  | 0.19 |
| step ~11 | 0.28 |
| step ~13 (peak) | ~0.42 |
| step ~50 (final) | ~0.34–0.41 |

Every rubric component improved together (not single-metric gaming):
`reward_valid` 0.30 → ~0.70 (writes valid trading code far more often),
`reward_sharpe` 0.10 → 0.33, drawdown/exposure/cost all up. Held-out-symbol eval on base
Laguna scored `reward_valid` 0.75 / `reward_sharpe` 0.45, confirming the env is in the
healthy trainable band before training.

## The novel contribution: closing the self-improvement loop
- **Weights channel:** each RL iteration warm-starts from the prior adapter
  (`checkpoint_id`) — genuine parametric continuation.
- **Curriculum channel:** a reflection step reads the prior adapter's out-of-sample eval
  and shifts the next run's objective (sharpe → min-drawdown → balanced) and focuses the
  weakest symbols — the agent's own results drive its next curriculum.
- **Falsifiable proof ("memory is the adapter"):** the discipline block (distilled from
  618 real prior trading decisions) can be **stripped from the prompt**
  (`use_seed_principles=false`); if the trained adapter stays disciplined, the rules now
  live in the weights, not the prompt.

## Files
- `trade_pool/` — the full `verifiers` environment (features, causal backtester, executor,
  rubric, data) — installable, builds to a wheel, bundles its own OHLCV tape.
- `adapter/` — the trained LoRA adapter weights for `poolside/Laguna-XS.2`.
- `configs/` — the RL training config(s).
- `reward_curve.txt`, `eval_*.json` — training + eval metrics.

## Reproduce
```bash
prime env push --path ./trade_pool --visibility PRIVATE     # -> <you>/trade-pool
prime eval run <you>/trade-pool -m poolside/laguna-xs.2 -n 8 -r 1
prime train run configs/iter_1.toml                          # FREE hosted Laguna RL
prime deployments create <adapter_id>                        # serve the adapter
```

Built at the Poolside London hackathon, 29–30 May 2026. Team: **TradePool** (Tosin Dairo).