# hack-x — Environment & Training Spec A self-improving crypto-trading **coding agent** trained with RL on Prime Lab. tradewatch's soft reflection loop (events → MEMORY.md prompt rules) becomes a real gradient loop: rollout → backtest verifier → GRPO → **LoRA adapter** → deploy → live demo. --- ## 1. The one-line thesis > Today tradewatch's trading discipline lives as *prompt text* (MEMORY.md). We make it > live as *adapter weights* by rewarding disciplined, profitable behavior on > out-of-sample replayed market history. The proof: the trained adapter stays > disciplined with MEMORY.md **removed from the prompt**. ## 1b. tradewatch as seed (pull all of it — but know its limits) Mined both DB (4,875 events) and journal (708 rows). What's there: - **618 `analysis_decision` rows** with 18 market features each (price_usd, liquidity_usd, volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5m→24h, age_hours, fdv, market_cap) — this IS the agent's feature schema + seed/format examples. - Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples). - **9 confirmed pool addresses** → guaranteed-good fetch universe seed. - Trade outcomes (exits/PnL/skips) → rubric calibration. - **NOT present:** any stored OHLCV time series. `ohlcv_1h` was fetched live, never persisted. Every market row is a point-in-time snapshot, not a replayable tape. - **Therefore:** tradewatch = seed of *decisions/features/pools* (the brains + universe); GeckoTerminal fetch = the *replay price tape*. Both required; neither substitutes. ## 1c. The three loops are independent (demo ≠ batch_size) - **Training loop:** batch_size=128 parallel replay-rollouts/step — a GRPO requirement only. - **Demo loop:** 1 (or few) LIVE paper session in tradewatch, trained adapter via base:adapter_id. As small as we want. 128 was never the demo. - **Recursive loop (stretch):** reflect → checkpoint_id warm-start → retrain. ## 2. Data (replay substrate) — crypto - **Source:** GeckoTerminal free OHLCV — `/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L` (already used in `scanner/base_scanner.py:843`). No key. - **Universe:** start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus, GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30–50 Base pools. - **Bars & DEPTH CEILING (probed live 2026-05-30):** free OHLCV caps at **~1000 bars/pool, `before_timestamp` pagination does NOT walk back further.** So: - hourly (`tf=hour`): ~1000 bars ≈ **41 days** - daily (`tf=day`): ~181 bars ≈ **6 months** - **Implication:** shallow history → depth can't supply many decorrelated time windows. **Breadth carries GRPO variance: tasks = symbol × window, universe = 30–50 pools.** - **Storage:** `data/ohlcv/.parquet` (ts, o, h, l, c, v). Reproducible, no rate limits at train time. - **Splits (anti-leakage, structural):** - per-symbol time split: train window | OOS window (chronological, no overlap) - held-out OOS **symbols** (never seen in training) for generalization check - with shallow per-symbol history, the symbol-holdout split does the heavy lifting - **Crypto-hardening (because reward is gameable here):** - slippage + fee model on every fill (fixed bps + size-vs-liquidity impact) - min-liquidity / min-volume gate per bar (illiquid bars can't be traded) - survivorship caveat noted; mitigate by including tokens that later died ## 3. Action surface — support BOTH (decide after baseline rollouts) The backtester accepts a uniform `target_positions` dict; two agent framings feed it: - **A. Strategy-code:** agent writes `strategy(features_window) -> target_positions`, executed causally over the OOS window. Strongest "coding agent" framing. - **B. Structured decisions:** agent emits per-bar JSON (verdict/entry/target/stop), reusing tradewatch's schema. Closer to existing code. Both reduce to the same backtest call → same rubric. We prototype both, pick via eval. ## 4. Backtester (port `agents/paper_ledger.py`) - `ReplayFeed`: serves bars **only up to t** — lookahead impossible by construction (the agent/strategy literally never receives `bars[t+1:]`). - Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market, track equity, cash, exposure, per-trade R:R, drawdown. - Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate. - Benchmarks computed on same window: **buy-and-hold**, MA-crossover, z-score mean-reversion. - Invariants (unit-tested): causality (no future leak), determinism (seed → identical run). ## 5. verifiers environment (v0 stable API) Class: **`StatefulToolEnv`** (per-rollout state + stateful tools). ```python def load_environment( symbols: list[str] | str = "train", # universe or split name split: str = "train", # train | oos | oos_symbols objective: str = "sharpe", # sharpe | return | min_drawdown max_turns: int = 8, n_windows: int = 4, # OOS windows averaged per reward seed: int = 0, ) -> vf.Environment: ... ``` - `setup_state(state)`: pick task (symbol(s) × window × objective), build fresh ReplayFeed + ledger. - **Tools** (stateful, NOT per-rollout sandbox — global/in-proc exec to save credits): - `get_features(lookback)` → indicators up to current bar (RSI, MACD, MAs, z-score, BB, vol, buy/sell ratio, liquidity) — reuses scanner feature logic - `run_backtest(strategy_or_decisions)` → in-sample metrics (the agent's feedback loop) - `read_metrics()` → current equity/DD/Sharpe - `@vf.stop`: agent submits final strategy, or max_turns. - **Reward (Rubric, weighted) — computed on OOS, aggregated over n_windows × basket:** | reward fn | weight | source | |---|---|---| | `r_sharpe` (normalized OOS Sharpe) | 0.40 | spine | | `r_beats_benchmark` (vs buy-and-hold) | 0.20 | anti-overfit | | `r_drawdown` (penalty for deep DD) | 0.15 | MEMORY: protect capital | | `r_rr_discipline` (R:R≥2 compliance) | 0.10 | trading_agent._validate | | `r_exposure` (sane equity/cash, no all-in) | 0.10 | MEMORY: sizing | | `r_cost` (turnover/fee penalty) | 0.05 | realism | | **HARD GATES → reward 0** | — | invalid code, lookahead, NaN equity, illiquid trades | Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake discipline while losing money. ## 6. Training (Prime Hosted, FREE for Laguna) ```toml model = "poolside/Laguna-XS.2" # validate first on Qwen/Qwen3-4B-Instruct-2507 max_steps = 50 # small validation; scale after curve moves batch_size = 128 # = parallel rollouts (the "128 sessions") rollouts_per_example = 8 # GRPO group — needs decorrelated windows learning_rate = 1e-4 lora_alpha = 16 [sampling] max_tokens = 512 enable_thinking = false # docs: start non-reasoning for agentic tasks [[env]] id = "/stock-strategy-env" [eval] eval_base_model = true # base-vs-trained comparison built in ``` Pipeline: `prime eval run` (sanity, baseline 10–80%) → small Qwen RL → Laguna RL → download LoRA → `prime deployments create` → `base:adapter_id`. ## 7. Closing the loop - Deploy adapter → point tradewatch `HybrieClient` at `api.pinference.ai` → **live paper demo**. - **Ablation (money shot):** trained adapter, MEMORY.md stripped from prompt → discipline holds. - **Recursive (stretch):** `checkpoint_id` warm-start → reflect on failures → adjust rubric weights / add tasks / `[buffer]` difficulty filtering → retrain. ## 8. Non-negotiable risks 1. Leakage → structural causal feed (not a detector). 2. Reward variance → aggregate over basket × windows (or GRPO has no signal). 3. Reward hacking → OOS + must-beat-benchmark + hard lookahead gate. 4. Crypto noise → slippage/fee/liquidity model on every fill.