Upload README.md with huggingface_hub

65053bf verified about 1 month ago

3.94 kB

	---
	license: apache-2.0
	base_model: poolside/Laguna-XS.2
	tags:
	- reinforcement-learning
	- lora
	- trading
	- coding-agent
	- verifiers
	- prime-intellect
	- poolside-hackathon
	library_name: peft
	---

	# TradePool — a self-improving trading coding-agent (Laguna XS.2 LoRA)

	Poolside × Prime Intellect Research Hackathon — Foundations track.

	A LoRA adapter for `poolside/Laguna-XS.2`, trained with reinforcement learning so the
	model becomes a coding agent that writes causal crypto trading-strategy functions,
	scored by a leak-proof out-of-sample backtest.

	## The idea in one line
	> Trading discipline that normally lives as prompt text (a memory file of rules) is
	> turned into adapter weights by rewarding disciplined, profitable behaviour on
	> held-out market data. The verifier is the backtest.

	## How it works
	1. Environment (`verifiers`, v0 `SingleTurnEnv`, pushed to `stimulir/trade-pool`):
	the agent is given a Base-chain token's in-sample price history + a library of causal
	indicators (RSI, MACD, MAs, z-score, Bollinger, volatility) and must write
	`def strategy(features, position) -> target_position`.
	2. Verifier / reward — the strategy runs bar-by-bar over a held-out window
	(lookahead is structurally impossible; the function never sees future bars), scored by
	a weighted rubric:
	- OOS Sharpe (0.40) · beats buy-and-hold (0.20) · drawdown control (0.15) ·
	sane exposure (0.10) · transaction cost (0.05) · valid+actually-trades (0.10)
	- Hard gates → reward 0: invalid code, lookahead, NaN equity, do-nothing strategies.
	3. Training — Prime Hosted RL (GRPO), `poolside/Laguna-XS.2`, 50 steps, batch 128,
	`rollouts_per_example=8`, `enable_thinking=false`. FREE hosted Laguna run.

	## Results
	RL produced a clean, monotonic reward climb on the training environment:

	\| Stage \| Total reward \|
	\|---\|---\|
	\| step ~0 (baseline) \| ~0.15 \|
	\| step ~8 \| 0.19 \|
	\| step ~11 \| 0.28 \|
	\| step ~13 (peak) \| ~0.42 \|
	\| step ~50 (final) \| ~0.34–0.41 \|

	Every rubric component improved together (not single-metric gaming):
	`reward_valid` 0.30 → ~0.70 (writes valid trading code far more often),
	`reward_sharpe` 0.10 → 0.33, drawdown/exposure/cost all up. Held-out-symbol eval on base
	Laguna scored `reward_valid` 0.75 / `reward_sharpe` 0.45, confirming the env is in the
	healthy trainable band before training.

	## The novel contribution: closing the self-improvement loop
	- Weights channel: each RL iteration warm-starts from the prior adapter
	(`checkpoint_id`) — genuine parametric continuation.
	- Curriculum channel: a reflection step reads the prior adapter's out-of-sample eval
	and shifts the next run's objective (sharpe → min-drawdown → balanced) and focuses the
	weakest symbols — the agent's own results drive its next curriculum.
	- Falsifiable proof ("memory is the adapter"): the discipline block (distilled from
	618 real prior trading decisions) can be stripped from the prompt
	(`use_seed_principles=false`); if the trained adapter stays disciplined, the rules now
	live in the weights, not the prompt.

	## Files
	- `trade_pool/` — the full `verifiers` environment (features, causal backtester, executor,
	rubric, data) — installable, builds to a wheel, bundles its own OHLCV tape.
	- `adapter/` — the trained LoRA adapter weights for `poolside/Laguna-XS.2`.
	- `configs/` — the RL training config(s).
	- `reward_curve.txt`, `eval_*.json` — training + eval metrics.

	## Reproduce
	```bash
	prime env push --path ./trade_pool --visibility PRIVATE # -> <you>/trade-pool
	prime eval run <you>/trade-pool -m poolside/laguna-xs.2 -n 8 -r 1
	prime train run configs/iter_1.toml # FREE hosted Laguna RL
	prime deployments create <adapter_id> # serve the adapter
	```

	Built at the Poolside London hackathon, 29–30 May 2026. Team: TradePool (Tosin Dairo).