File size: 7,974 Bytes

ce6b50a

# hack-x — Environment & Training Spec

A self-improving crypto-trading **coding agent** trained with RL on Prime Lab.
tradewatch's soft reflection loop (events → MEMORY.md prompt rules) becomes a real
gradient loop: rollout → backtest verifier → GRPO → **LoRA adapter** → deploy → live demo.

---

## 1. The one-line thesis
> Today tradewatch's trading discipline lives as *prompt text* (MEMORY.md). We make it
> live as *adapter weights* by rewarding disciplined, profitable behavior on
> out-of-sample replayed market history. The proof: the trained adapter stays
> disciplined with MEMORY.md **removed from the prompt**.

## 1b. tradewatch as seed (pull all of it — but know its limits)
Mined both DB (4,875 events) and journal (708 rows). What's there:
- **618 `analysis_decision` rows** with 18 market features each (price_usd, liquidity_usd,
  volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5m→24h, age_hours,
  fdv, market_cap) — this IS the agent's feature schema + seed/format examples.
- Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
- **9 confirmed pool addresses** → guaranteed-good fetch universe seed.
- Trade outcomes (exits/PnL/skips) → rubric calibration.
- **NOT present:** any stored OHLCV time series. `ohlcv_1h` was fetched live, never persisted.
  Every market row is a point-in-time snapshot, not a replayable tape.
- **Therefore:** tradewatch = seed of *decisions/features/pools* (the brains + universe);
  GeckoTerminal fetch = the *replay price tape*. Both required; neither substitutes.

## 1c. The three loops are independent (demo ≠ batch_size)
- **Training loop:** batch_size=128 parallel replay-rollouts/step — a GRPO requirement only.
- **Demo loop:** 1 (or few) LIVE paper session in tradewatch, trained adapter via
  base:adapter_id. As small as we want. 128 was never the demo.
- **Recursive loop (stretch):** reflect → checkpoint_id warm-start → retrain.

## 2. Data (replay substrate) — crypto
- **Source:** GeckoTerminal free OHLCV — `/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L`
  (already used in `scanner/base_scanner.py:843`). No key.
- **Universe:** start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus,
  GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30–50 Base pools.
- **Bars & DEPTH CEILING (probed live 2026-05-30):** free OHLCV caps at **~1000 bars/pool,
  `before_timestamp` pagination does NOT walk back further.** So:
  - hourly (`tf=hour`): ~1000 bars ≈ **41 days**
  - daily (`tf=day`): ~181 bars ≈ **6 months**
  - **Implication:** shallow history → depth can't supply many decorrelated time windows.
    **Breadth carries GRPO variance: tasks = symbol × window, universe = 30–50 pools.**
- **Storage:** `data/ohlcv/<symbol>.parquet` (ts, o, h, l, c, v). Reproducible, no rate limits at train time.
- **Splits (anti-leakage, structural):**
  - per-symbol time split: train window | OOS window (chronological, no overlap)
  - held-out OOS **symbols** (never seen in training) for generalization check
  - with shallow per-symbol history, the symbol-holdout split does the heavy lifting
- **Crypto-hardening (because reward is gameable here):**
  - slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
  - min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
  - survivorship caveat noted; mitigate by including tokens that later died

## 3. Action surface — support BOTH (decide after baseline rollouts)
The backtester accepts a uniform `target_positions` dict; two agent framings feed it:
- **A. Strategy-code:** agent writes `strategy(features_window) -> target_positions`,
  executed causally over the OOS window. Strongest "coding agent" framing.
- **B. Structured decisions:** agent emits per-bar JSON (verdict/entry/target/stop),
  reusing tradewatch's schema. Closer to existing code.
Both reduce to the same backtest call → same rubric. We prototype both, pick via eval.

## 4. Backtester (port `agents/paper_ledger.py`)
- `ReplayFeed`: serves bars **only up to t** — lookahead impossible by construction
  (the agent/strategy literally never receives `bars[t+1:]`).
- Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market,
  track equity, cash, exposure, per-trade R:R, drawdown.
- Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
- Benchmarks computed on same window: **buy-and-hold**, MA-crossover, z-score mean-reversion.
- Invariants (unit-tested): causality (no future leak), determinism (seed → identical run).

## 5. verifiers environment (v0 stable API)
Class: **`StatefulToolEnv`** (per-rollout state + stateful tools).

```python
def load_environment(
    symbols: list[str] | str = "train",   # universe or split name
    split: str = "train",                  # train | oos | oos_symbols
    objective: str = "sharpe",             # sharpe | return | min_drawdown
    max_turns: int = 8,
    n_windows: int = 4,                     # OOS windows averaged per reward
    seed: int = 0,
) -> vf.Environment: ...
```

- `setup_state(state)`: pick task (symbol(s) × window × objective), build fresh ReplayFeed + ledger.
- **Tools** (stateful, NOT per-rollout sandbox — global/in-proc exec to save credits):
  - `get_features(lookback)` → indicators up to current bar (RSI, MACD, MAs, z-score,
    BB, vol, buy/sell ratio, liquidity) — reuses scanner feature logic
  - `run_backtest(strategy_or_decisions)` → in-sample metrics (the agent's feedback loop)
  - `read_metrics()` → current equity/DD/Sharpe
- `@vf.stop`: agent submits final strategy, or max_turns.
- **Reward (Rubric, weighted) — computed on OOS, aggregated over n_windows × basket:**

  | reward fn | weight | source |
  |---|---|---|
  | `r_sharpe` (normalized OOS Sharpe) | 0.40 | spine |
  | `r_beats_benchmark` (vs buy-and-hold) | 0.20 | anti-overfit |
  | `r_drawdown` (penalty for deep DD) | 0.15 | MEMORY: protect capital |
  | `r_rr_discipline` (R:R≥2 compliance) | 0.10 | trading_agent._validate |
  | `r_exposure` (sane equity/cash, no all-in) | 0.10 | MEMORY: sizing |
  | `r_cost` (turnover/fee penalty) | 0.05 | realism |
  | **HARD GATES → reward 0** | — | invalid code, lookahead, NaN equity, illiquid trades |

  Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake
  discipline while losing money.

## 6. Training (Prime Hosted, FREE for Laguna)
```toml
model = "poolside/Laguna-XS.2"   # validate first on Qwen/Qwen3-4B-Instruct-2507
max_steps = 50                    # small validation; scale after curve moves
batch_size = 128                  # = parallel rollouts (the "128 sessions")
rollouts_per_example = 8          # GRPO group — needs decorrelated windows
learning_rate = 1e-4
lora_alpha = 16
[sampling]
max_tokens = 512
enable_thinking = false           # docs: start non-reasoning for agentic tasks
[[env]]
id = "<user>/stock-strategy-env"
[eval]
eval_base_model = true            # base-vs-trained comparison built in
```
Pipeline: `prime eval run` (sanity, baseline 10–80%) → small Qwen RL → Laguna RL →
download LoRA → `prime deployments create` → `base:adapter_id`.

## 7. Closing the loop
- Deploy adapter → point tradewatch `HybrieClient` at `api.pinference.ai` → **live paper demo**.
- **Ablation (money shot):** trained adapter, MEMORY.md stripped from prompt → discipline holds.
- **Recursive (stretch):** `checkpoint_id` warm-start → reflect on failures →
  adjust rubric weights / add tasks / `[buffer]` difficulty filtering → retrain.

## 8. Non-negotiable risks
1. Leakage → structural causal feed (not a detector).
2. Reward variance → aggregate over basket × windows (or GRPO has no signal).
3. Reward hacking → OOS + must-beat-benchmark + hard lookahead gate.
4. Crypto noise → slippage/fee/liquidity model on every fill.