poolside-laguna-hackathon
/

trade-pool

Reinforcement Learning

prime-intellect

poolside-hackathon

Model card Files Files and versions

trade-pool / LOOP.md

tosi-n's picture

Upload folder using huggingface_hub

ce6b50a verified about 1 month ago

|

History Blame Contribute Delete

4.01 kB

	# The Recursive Self-Improving Loop

	How tradewatch's soft reflection (events → MEMORY.md prompt text) becomes a real
	gradient loop on Laguna XS.2, where improvement compounds across iterations through
	both adapter weights and curriculum.

	## The two improvement channels

	1. Weights (parametric continuation): each hosted RL run warm-starts from the prior
	iteration's adapter via `checkpoint_id`. The model is never reset — discipline learned
	in iter N carries into iter N+1. This is the thing tradewatch never had.
	2. Curriculum (reflection-driven): between runs, `recursive_loop.py reflect` reads the
	prior adapter's OOS eval and shifts the next run's objective (sharpe → min_drawdown
	→ balanced) and focus symbols (the weakest performers). This is tradewatch's
	`summarize_session_events` reflection — repurposed to steer RL instead of prompt notes.

	## One iteration

	```
	┌──────────────────────────────────────────────┐
	│ configs/rl/iter_N.toml │
	│ model=poolside/Laguna-XS.2 │
	│ checkpoint_id=<iter N-1 adapter> ← weights │
	│ [[env]] objective=..., symbols=[weak...] │
	└───────────────────┬──────────────────────────┘
	prime train run iter_N.toml │ (FREE hosted RL, GRPO, 128 rollouts/step)
	▼
	LoRA adapter ──► prime deployments create
	│ base:adapter_id, OpenAI-compatible
	▼
	python scripts/laguna_eval.py --model base:adapter_id --split oos_symbols
	(writes strategy per HELD-OUT symbol, scores via rubric)
	│ logs/eval_*.json
	▼
	python scripts/recursive_loop.py reflect <eval.json> --checkpoint-id <adapter>
	(curriculum policy → objective + weak-symbol focus)
	│
	▼
	configs/rl/iter_{N+1}.toml ──► loop repeats
	```

	Note Prime enforces max 1 concurrent run/user, so iterations are sequential — which
	is exactly what warm-starting requires anyway (iter N+1 needs iter N's adapter to exist).

	## Curriculum policy (`_choose_objective`, inspectable & deterministic)
	- valid≥0.8 but mean_total<0.5 → `min_drawdown` (strategies run but lose → control risk)
	- pct_wrote_code<0.7 → `sharpe` + more steps (model still learning to code)
	- otherwise → `balanced` (competent → broaden)
	- always: next run focuses the 3 weakest OOS symbols, rotates `seed` for fresh task mixes,
	lengthens to 75 steps if learning stalled (<0.5).

	## Closing to tradewatch (the demo)
	The deployed adapter is OpenAI-compatible, so tradewatch's existing `HybrieClient` runs it
	live with one config change:
	```
	base_url = https://api.pinference.ai/api/v1
	model = poolside/Laguna-XS.2:<adapter_id>
	```
	Ablation money-shot: run the adapter with MEMORY.md stripped from the prompt. If the
	discipline holds, it's provably in the weights — the memory became the adapter.

	## Run it
	```bash
	# bootstrap iteration 1
	python scripts/recursive_loop.py init --env-id <you>/stock-strategy-env --model poolside/Laguna-XS.2
	prime train run configs/rl/iter_1.toml
	prime deployments create <adapter_id>
	export PRIME_API_KEY=...
	python scripts/laguna_eval.py --model poolside/Laguna-XS.2:<adapter_id> --split oos_symbols
	python scripts/recursive_loop.py reflect logs/eval_*.json --checkpoint-id <adapter_id>
	# -> configs/rl/iter_2.toml ready; repeat
	```