File size: 7,974 Bytes
ce6b50a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# hack-x β€” Environment & Training Spec

A self-improving crypto-trading **coding agent** trained with RL on Prime Lab.
tradewatch's soft reflection loop (events β†’ MEMORY.md prompt rules) becomes a real
gradient loop: rollout β†’ backtest verifier β†’ GRPO β†’ **LoRA adapter** β†’ deploy β†’ live demo.

---

## 1. The one-line thesis
> Today tradewatch's trading discipline lives as *prompt text* (MEMORY.md). We make it
> live as *adapter weights* by rewarding disciplined, profitable behavior on
> out-of-sample replayed market history. The proof: the trained adapter stays
> disciplined with MEMORY.md **removed from the prompt**.

## 1b. tradewatch as seed (pull all of it β€” but know its limits)
Mined both DB (4,875 events) and journal (708 rows). What's there:
- **618 `analysis_decision` rows** with 18 market features each (price_usd, liquidity_usd,
  volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5m→24h, age_hours,
  fdv, market_cap) β€” this IS the agent's feature schema + seed/format examples.
- Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
- **9 confirmed pool addresses** β†’ guaranteed-good fetch universe seed.
- Trade outcomes (exits/PnL/skips) β†’ rubric calibration.
- **NOT present:** any stored OHLCV time series. `ohlcv_1h` was fetched live, never persisted.
  Every market row is a point-in-time snapshot, not a replayable tape.
- **Therefore:** tradewatch = seed of *decisions/features/pools* (the brains + universe);
  GeckoTerminal fetch = the *replay price tape*. Both required; neither substitutes.

## 1c. The three loops are independent (demo β‰  batch_size)
- **Training loop:** batch_size=128 parallel replay-rollouts/step β€” a GRPO requirement only.
- **Demo loop:** 1 (or few) LIVE paper session in tradewatch, trained adapter via
  base:adapter_id. As small as we want. 128 was never the demo.
- **Recursive loop (stretch):** reflect β†’ checkpoint_id warm-start β†’ retrain.

## 2. Data (replay substrate) β€” crypto
- **Source:** GeckoTerminal free OHLCV β€” `/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L`
  (already used in `scanner/base_scanner.py:843`). No key.
- **Universe:** start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus,
  GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30–50 Base pools.
- **Bars & DEPTH CEILING (probed live 2026-05-30):** free OHLCV caps at **~1000 bars/pool,
  `before_timestamp` pagination does NOT walk back further.** So:
  - hourly (`tf=hour`): ~1000 bars β‰ˆ **41 days**
  - daily (`tf=day`): ~181 bars β‰ˆ **6 months**
  - **Implication:** shallow history β†’ depth can't supply many decorrelated time windows.
    **Breadth carries GRPO variance: tasks = symbol Γ— window, universe = 30–50 pools.**
- **Storage:** `data/ohlcv/<symbol>.parquet` (ts, o, h, l, c, v). Reproducible, no rate limits at train time.
- **Splits (anti-leakage, structural):**
  - per-symbol time split: train window | OOS window (chronological, no overlap)
  - held-out OOS **symbols** (never seen in training) for generalization check
  - with shallow per-symbol history, the symbol-holdout split does the heavy lifting
- **Crypto-hardening (because reward is gameable here):**
  - slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
  - min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
  - survivorship caveat noted; mitigate by including tokens that later died

## 3. Action surface β€” support BOTH (decide after baseline rollouts)
The backtester accepts a uniform `target_positions` dict; two agent framings feed it:
- **A. Strategy-code:** agent writes `strategy(features_window) -> target_positions`,
  executed causally over the OOS window. Strongest "coding agent" framing.
- **B. Structured decisions:** agent emits per-bar JSON (verdict/entry/target/stop),
  reusing tradewatch's schema. Closer to existing code.
Both reduce to the same backtest call β†’ same rubric. We prototype both, pick via eval.

## 4. Backtester (port `agents/paper_ledger.py`)
- `ReplayFeed`: serves bars **only up to t** β€” lookahead impossible by construction
  (the agent/strategy literally never receives `bars[t+1:]`).
- Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market,
  track equity, cash, exposure, per-trade R:R, drawdown.
- Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
- Benchmarks computed on same window: **buy-and-hold**, MA-crossover, z-score mean-reversion.
- Invariants (unit-tested): causality (no future leak), determinism (seed β†’ identical run).

## 5. verifiers environment (v0 stable API)
Class: **`StatefulToolEnv`** (per-rollout state + stateful tools).

```python
def load_environment(
    symbols: list[str] | str = "train",   # universe or split name
    split: str = "train",                  # train | oos | oos_symbols
    objective: str = "sharpe",             # sharpe | return | min_drawdown
    max_turns: int = 8,
    n_windows: int = 4,                     # OOS windows averaged per reward
    seed: int = 0,
) -> vf.Environment: ...
```

- `setup_state(state)`: pick task (symbol(s) Γ— window Γ— objective), build fresh ReplayFeed + ledger.
- **Tools** (stateful, NOT per-rollout sandbox β€” global/in-proc exec to save credits):
  - `get_features(lookback)` β†’ indicators up to current bar (RSI, MACD, MAs, z-score,
    BB, vol, buy/sell ratio, liquidity) β€” reuses scanner feature logic
  - `run_backtest(strategy_or_decisions)` β†’ in-sample metrics (the agent's feedback loop)
  - `read_metrics()` β†’ current equity/DD/Sharpe
- `@vf.stop`: agent submits final strategy, or max_turns.
- **Reward (Rubric, weighted) β€” computed on OOS, aggregated over n_windows Γ— basket:**

  | reward fn | weight | source |
  |---|---|---|
  | `r_sharpe` (normalized OOS Sharpe) | 0.40 | spine |
  | `r_beats_benchmark` (vs buy-and-hold) | 0.20 | anti-overfit |
  | `r_drawdown` (penalty for deep DD) | 0.15 | MEMORY: protect capital |
  | `r_rr_discipline` (R:Rβ‰₯2 compliance) | 0.10 | trading_agent._validate |
  | `r_exposure` (sane equity/cash, no all-in) | 0.10 | MEMORY: sizing |
  | `r_cost` (turnover/fee penalty) | 0.05 | realism |
  | **HARD GATES β†’ reward 0** | β€” | invalid code, lookahead, NaN equity, illiquid trades |

  Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake
  discipline while losing money.

## 6. Training (Prime Hosted, FREE for Laguna)
```toml
model = "poolside/Laguna-XS.2"   # validate first on Qwen/Qwen3-4B-Instruct-2507
max_steps = 50                    # small validation; scale after curve moves
batch_size = 128                  # = parallel rollouts (the "128 sessions")
rollouts_per_example = 8          # GRPO group β€” needs decorrelated windows
learning_rate = 1e-4
lora_alpha = 16
[sampling]
max_tokens = 512
enable_thinking = false           # docs: start non-reasoning for agentic tasks
[[env]]
id = "<user>/stock-strategy-env"
[eval]
eval_base_model = true            # base-vs-trained comparison built in
```
Pipeline: `prime eval run` (sanity, baseline 10–80%) β†’ small Qwen RL β†’ Laguna RL β†’
download LoRA β†’ `prime deployments create` β†’ `base:adapter_id`.

## 7. Closing the loop
- Deploy adapter β†’ point tradewatch `HybrieClient` at `api.pinference.ai` β†’ **live paper demo**.
- **Ablation (money shot):** trained adapter, MEMORY.md stripped from prompt β†’ discipline holds.
- **Recursive (stretch):** `checkpoint_id` warm-start β†’ reflect on failures β†’
  adjust rubric weights / add tasks / `[buffer]` difficulty filtering β†’ retrain.

## 8. Non-negotiable risks
1. Leakage β†’ structural causal feed (not a detector).
2. Reward variance β†’ aggregate over basket Γ— windows (or GRPO has no signal).
3. Reward hacking β†’ OOS + must-beat-benchmark + hard lookahead gate.
4. Crypto noise β†’ slippage/fee/liquidity model on every fill.