trade-pool / LOOP.md
tosi-n's picture
Upload folder using huggingface_hub
ce6b50a verified
|
Raw
History Blame Contribute Delete
4.01 kB

The Recursive Self-Improving Loop

How tradewatch's soft reflection (events β†’ MEMORY.md prompt text) becomes a real gradient loop on Laguna XS.2, where improvement compounds across iterations through both adapter weights and curriculum.

The two improvement channels

  1. Weights (parametric continuation): each hosted RL run warm-starts from the prior iteration's adapter via checkpoint_id. The model is never reset β€” discipline learned in iter N carries into iter N+1. This is the thing tradewatch never had.
  2. Curriculum (reflection-driven): between runs, recursive_loop.py reflect reads the prior adapter's OOS eval and shifts the next run's objective (sharpe β†’ min_drawdown β†’ balanced) and focus symbols (the weakest performers). This is tradewatch's summarize_session_events reflection β€” repurposed to steer RL instead of prompt notes.

One iteration

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  configs/rl/iter_N.toml                       β”‚
                    β”‚  model=poolside/Laguna-XS.2                    β”‚
                    β”‚  checkpoint_id=<iter N-1 adapter>  ← weights   β”‚
                    β”‚  [[env]] objective=..., symbols=[weak...]      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   prime train run iter_N.toml          β”‚  (FREE hosted RL, GRPO, 128 rollouts/step)
                                        β–Ό
                    LoRA adapter  ──►  prime deployments create
                                        β”‚   base:adapter_id, OpenAI-compatible
                                        β–Ό
   python scripts/laguna_eval.py --model base:adapter_id --split oos_symbols
                    (writes strategy per HELD-OUT symbol, scores via rubric)
                                        β”‚  logs/eval_*.json
                                        β–Ό
   python scripts/recursive_loop.py reflect <eval.json> --checkpoint-id <adapter>
                    (curriculum policy β†’ objective + weak-symbol focus)
                                        β”‚
                                        β–Ό
                    configs/rl/iter_{N+1}.toml   ──►  loop repeats

Note Prime enforces max 1 concurrent run/user, so iterations are sequential β€” which is exactly what warm-starting requires anyway (iter N+1 needs iter N's adapter to exist).

Curriculum policy (_choose_objective, inspectable & deterministic)

  • validβ‰₯0.8 but mean_total<0.5 β†’ min_drawdown (strategies run but lose β†’ control risk)
  • pct_wrote_code<0.7 β†’ sharpe + more steps (model still learning to code)
  • otherwise β†’ balanced (competent β†’ broaden)
  • always: next run focuses the 3 weakest OOS symbols, rotates seed for fresh task mixes, lengthens to 75 steps if learning stalled (<0.5).

Closing to tradewatch (the demo)

The deployed adapter is OpenAI-compatible, so tradewatch's existing HybrieClient runs it live with one config change:

base_url = https://api.pinference.ai/api/v1
model    = poolside/Laguna-XS.2:<adapter_id>

Ablation money-shot: run the adapter with MEMORY.md stripped from the prompt. If the discipline holds, it's provably in the weights β€” the memory became the adapter.

Run it

# bootstrap iteration 1
python scripts/recursive_loop.py init --env-id <you>/stock-strategy-env --model poolside/Laguna-XS.2
prime train run configs/rl/iter_1.toml
prime deployments create <adapter_id>
export PRIME_API_KEY=...
python scripts/laguna_eval.py --model poolside/Laguna-XS.2:<adapter_id> --split oos_symbols
python scripts/recursive_loop.py reflect logs/eval_*.json --checkpoint-id <adapter_id>
# -> configs/rl/iter_2.toml ready; repeat