hack-x — Environment & Training Spec

A self-improving crypto-trading coding agent trained with RL on Prime Lab. tradewatch's soft reflection loop (events → MEMORY.md prompt rules) becomes a real gradient loop: rollout → backtest verifier → GRPO → LoRA adapter → deploy → live demo.

1. The one-line thesis

Today tradewatch's trading discipline lives as prompt text (MEMORY.md). We make it live as adapter weights by rewarding disciplined, profitable behavior on out-of-sample replayed market history. The proof: the trained adapter stays disciplined with MEMORY.md removed from the prompt.

1b. tradewatch as seed (pull all of it — but know its limits)

Mined both DB (4,875 events) and journal (708 rows). What's there:

618 analysis_decision rows with 18 market features each (price_usd, liquidity_usd, volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5m→24h, age_hours, fdv, market_cap) — this IS the agent's feature schema + seed/format examples.
Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
9 confirmed pool addresses → guaranteed-good fetch universe seed.
Trade outcomes (exits/PnL/skips) → rubric calibration.
NOT present: any stored OHLCV time series. ohlcv_1h was fetched live, never persisted. Every market row is a point-in-time snapshot, not a replayable tape.
Therefore: tradewatch = seed of decisions/features/pools (the brains + universe); GeckoTerminal fetch = the replay price tape. Both required; neither substitutes.

1c. The three loops are independent (demo ≠ batch_size)

Training loop: batch_size=128 parallel replay-rollouts/step — a GRPO requirement only.
Demo loop: 1 (or few) LIVE paper session in tradewatch, trained adapter via base:adapter_id. As small as we want. 128 was never the demo.
Recursive loop (stretch): reflect → checkpoint_id warm-start → retrain.

2. Data (replay substrate) — crypto

Source: GeckoTerminal free OHLCV — /networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L (already used in scanner/base_scanner.py:843). No key.
Universe: start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus, GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30–50 Base pools.
Bars & DEPTH CEILING (probed live 2026-05-30): free OHLCV caps at ~1000 bars/pool, before_timestamp pagination does NOT walk back further. So:
- hourly (tf=hour): ~1000 bars ≈ 41 days
- daily (tf=day): ~181 bars ≈ 6 months
- Implication: shallow history → depth can't supply many decorrelated time windows. Breadth carries GRPO variance: tasks = symbol × window, universe = 30–50 pools.
Storage: data/ohlcv/<symbol>.parquet (ts, o, h, l, c, v). Reproducible, no rate limits at train time.
Splits (anti-leakage, structural):
- per-symbol time split: train window | OOS window (chronological, no overlap)
- held-out OOS symbols (never seen in training) for generalization check
- with shallow per-symbol history, the symbol-holdout split does the heavy lifting
Crypto-hardening (because reward is gameable here):
- slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
- min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
- survivorship caveat noted; mitigate by including tokens that later died

3. Action surface — support BOTH (decide after baseline rollouts)

The backtester accepts a uniform target_positions dict; two agent framings feed it:

A. Strategy-code: agent writes strategy(features_window) -> target_positions, executed causally over the OOS window. Strongest "coding agent" framing.
B. Structured decisions: agent emits per-bar JSON (verdict/entry/target/stop), reusing tradewatch's schema. Closer to existing code. Both reduce to the same backtest call → same rubric. We prototype both, pick via eval.

4. Backtester (port `agents/paper_ledger.py`)

ReplayFeed: serves bars only up to t — lookahead impossible by construction (the agent/strategy literally never receives bars[t+1:]).
Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market, track equity, cash, exposure, per-trade R:R, drawdown.
Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
Benchmarks computed on same window: buy-and-hold, MA-crossover, z-score mean-reversion.
Invariants (unit-tested): causality (no future leak), determinism (seed → identical run).

5. verifiers environment (v0 stable API)

Class: StatefulToolEnv (per-rollout state + stateful tools).

def load_environment(
    symbols: list[str] | str = "train",   # universe or split name
    split: str = "train",                  # train | oos | oos_symbols
    objective: str = "sharpe",             # sharpe | return | min_drawdown
    max_turns: int = 8,
    n_windows: int = 4,                     # OOS windows averaged per reward
    seed: int = 0,
) -> vf.Environment: ...

setup_state(state): pick task (symbol(s) × window × objective), build fresh ReplayFeed + ledger.
Tools (stateful, NOT per-rollout sandbox — global/in-proc exec to save credits):
- get_features(lookback) → indicators up to current bar (RSI, MACD, MAs, z-score, BB, vol, buy/sell ratio, liquidity) — reuses scanner feature logic
- run_backtest(strategy_or_decisions) → in-sample metrics (the agent's feedback loop)
- read_metrics() → current equity/DD/Sharpe
@vf.stop: agent submits final strategy, or max_turns.

Reward (Rubric, weighted) — computed on OOS, aggregated over n_windows × basket:

reward fn	weight	source
`r_sharpe` (normalized OOS Sharpe)	0.40	spine
`r_beats_benchmark` (vs buy-and-hold)	0.20	anti-overfit
`r_drawdown` (penalty for deep DD)	0.15	MEMORY: protect capital
`r_rr_discipline` (R:R≥2 compliance)	0.10	trading_agent._validate
`r_exposure` (sane equity/cash, no all-in)	0.10	MEMORY: sizing
`r_cost` (turnover/fee penalty)	0.05	realism
HARD GATES → reward 0	—	invalid code, lookahead, NaN equity, illiquid trades

Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake discipline while losing money.

6. Training (Prime Hosted, FREE for Laguna)

model = "poolside/Laguna-XS.2"   # validate first on Qwen/Qwen3-4B-Instruct-2507
max_steps = 50                    # small validation; scale after curve moves
batch_size = 128                  # = parallel rollouts (the "128 sessions")
rollouts_per_example = 8          # GRPO group — needs decorrelated windows
learning_rate = 1e-4
lora_alpha = 16
[sampling]
max_tokens = 512
enable_thinking = false           # docs: start non-reasoning for agentic tasks
[[env]]
id = "<user>/stock-strategy-env"
[eval]
eval_base_model = true            # base-vs-trained comparison built in

Pipeline: prime eval run (sanity, baseline 10–80%) → small Qwen RL → Laguna RL → download LoRA → prime deployments create → base:adapter_id.

7. Closing the loop

Deploy adapter → point tradewatch HybrieClient at api.pinference.ai → live paper demo.
Ablation (money shot): trained adapter, MEMORY.md stripped from prompt → discipline holds.
Recursive (stretch): checkpoint_id warm-start → reflect on failures → adjust rubric weights / add tasks / [buffer] difficulty filtering → retrain.

8. Non-negotiable risks

Leakage → structural causal feed (not a detector).
Reward variance → aggregate over basket × windows (or GRPO has no signal).
Reward hacking → OOS + must-beat-benchmark + hard lookahead gate.
Crypto noise → slippage/fee/liquidity model on every fill.