trade-pool / SPEC.md
tosi-n's picture
Upload folder using huggingface_hub
ce6b50a verified
|
Raw
History Blame Contribute Delete
7.97 kB

hack-x β€” Environment & Training Spec

A self-improving crypto-trading coding agent trained with RL on Prime Lab. tradewatch's soft reflection loop (events β†’ MEMORY.md prompt rules) becomes a real gradient loop: rollout β†’ backtest verifier β†’ GRPO β†’ LoRA adapter β†’ deploy β†’ live demo.


1. The one-line thesis

Today tradewatch's trading discipline lives as prompt text (MEMORY.md). We make it live as adapter weights by rewarding disciplined, profitable behavior on out-of-sample replayed market history. The proof: the trained adapter stays disciplined with MEMORY.md removed from the prompt.

1b. tradewatch as seed (pull all of it β€” but know its limits)

Mined both DB (4,875 events) and journal (708 rows). What's there:

  • 618 analysis_decision rows with 18 market features each (price_usd, liquidity_usd, volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5mβ†’24h, age_hours, fdv, market_cap) β€” this IS the agent's feature schema + seed/format examples.
  • Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
  • 9 confirmed pool addresses β†’ guaranteed-good fetch universe seed.
  • Trade outcomes (exits/PnL/skips) β†’ rubric calibration.
  • NOT present: any stored OHLCV time series. ohlcv_1h was fetched live, never persisted. Every market row is a point-in-time snapshot, not a replayable tape.
  • Therefore: tradewatch = seed of decisions/features/pools (the brains + universe); GeckoTerminal fetch = the replay price tape. Both required; neither substitutes.

1c. The three loops are independent (demo β‰  batch_size)

  • Training loop: batch_size=128 parallel replay-rollouts/step β€” a GRPO requirement only.
  • Demo loop: 1 (or few) LIVE paper session in tradewatch, trained adapter via base:adapter_id. As small as we want. 128 was never the demo.
  • Recursive loop (stretch): reflect β†’ checkpoint_id warm-start β†’ retrain.

2. Data (replay substrate) β€” crypto

  • Source: GeckoTerminal free OHLCV β€” /networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L (already used in scanner/base_scanner.py:843). No key.
  • Universe: start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus, GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30–50 Base pools.
  • Bars & DEPTH CEILING (probed live 2026-05-30): free OHLCV caps at ~1000 bars/pool, before_timestamp pagination does NOT walk back further. So:
    • hourly (tf=hour): ~1000 bars β‰ˆ 41 days
    • daily (tf=day): ~181 bars β‰ˆ 6 months
    • Implication: shallow history β†’ depth can't supply many decorrelated time windows. Breadth carries GRPO variance: tasks = symbol Γ— window, universe = 30–50 pools.
  • Storage: data/ohlcv/<symbol>.parquet (ts, o, h, l, c, v). Reproducible, no rate limits at train time.
  • Splits (anti-leakage, structural):
    • per-symbol time split: train window | OOS window (chronological, no overlap)
    • held-out OOS symbols (never seen in training) for generalization check
    • with shallow per-symbol history, the symbol-holdout split does the heavy lifting
  • Crypto-hardening (because reward is gameable here):
    • slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
    • min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
    • survivorship caveat noted; mitigate by including tokens that later died

3. Action surface β€” support BOTH (decide after baseline rollouts)

The backtester accepts a uniform target_positions dict; two agent framings feed it:

  • A. Strategy-code: agent writes strategy(features_window) -> target_positions, executed causally over the OOS window. Strongest "coding agent" framing.
  • B. Structured decisions: agent emits per-bar JSON (verdict/entry/target/stop), reusing tradewatch's schema. Closer to existing code. Both reduce to the same backtest call β†’ same rubric. We prototype both, pick via eval.

4. Backtester (port agents/paper_ledger.py)

  • ReplayFeed: serves bars only up to t β€” lookahead impossible by construction (the agent/strategy literally never receives bars[t+1:]).
  • Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market, track equity, cash, exposure, per-trade R:R, drawdown.
  • Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
  • Benchmarks computed on same window: buy-and-hold, MA-crossover, z-score mean-reversion.
  • Invariants (unit-tested): causality (no future leak), determinism (seed β†’ identical run).

5. verifiers environment (v0 stable API)

Class: StatefulToolEnv (per-rollout state + stateful tools).

def load_environment(
    symbols: list[str] | str = "train",   # universe or split name
    split: str = "train",                  # train | oos | oos_symbols
    objective: str = "sharpe",             # sharpe | return | min_drawdown
    max_turns: int = 8,
    n_windows: int = 4,                     # OOS windows averaged per reward
    seed: int = 0,
) -> vf.Environment: ...
  • setup_state(state): pick task (symbol(s) Γ— window Γ— objective), build fresh ReplayFeed + ledger.

  • Tools (stateful, NOT per-rollout sandbox β€” global/in-proc exec to save credits):

    • get_features(lookback) β†’ indicators up to current bar (RSI, MACD, MAs, z-score, BB, vol, buy/sell ratio, liquidity) β€” reuses scanner feature logic
    • run_backtest(strategy_or_decisions) β†’ in-sample metrics (the agent's feedback loop)
    • read_metrics() β†’ current equity/DD/Sharpe
  • @vf.stop: agent submits final strategy, or max_turns.

  • Reward (Rubric, weighted) β€” computed on OOS, aggregated over n_windows Γ— basket:

    reward fn weight source
    r_sharpe (normalized OOS Sharpe) 0.40 spine
    r_beats_benchmark (vs buy-and-hold) 0.20 anti-overfit
    r_drawdown (penalty for deep DD) 0.15 MEMORY: protect capital
    r_rr_discipline (R:Rβ‰₯2 compliance) 0.10 trading_agent._validate
    r_exposure (sane equity/cash, no all-in) 0.10 MEMORY: sizing
    r_cost (turnover/fee penalty) 0.05 realism
    HARD GATES β†’ reward 0 β€” invalid code, lookahead, NaN equity, illiquid trades

    Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake discipline while losing money.

6. Training (Prime Hosted, FREE for Laguna)

model = "poolside/Laguna-XS.2"   # validate first on Qwen/Qwen3-4B-Instruct-2507
max_steps = 50                    # small validation; scale after curve moves
batch_size = 128                  # = parallel rollouts (the "128 sessions")
rollouts_per_example = 8          # GRPO group β€” needs decorrelated windows
learning_rate = 1e-4
lora_alpha = 16
[sampling]
max_tokens = 512
enable_thinking = false           # docs: start non-reasoning for agentic tasks
[[env]]
id = "<user>/stock-strategy-env"
[eval]
eval_base_model = true            # base-vs-trained comparison built in

Pipeline: prime eval run (sanity, baseline 10–80%) β†’ small Qwen RL β†’ Laguna RL β†’ download LoRA β†’ prime deployments create β†’ base:adapter_id.

7. Closing the loop

  • Deploy adapter β†’ point tradewatch HybrieClient at api.pinference.ai β†’ live paper demo.
  • Ablation (money shot): trained adapter, MEMORY.md stripped from prompt β†’ discipline holds.
  • Recursive (stretch): checkpoint_id warm-start β†’ reflect on failures β†’ adjust rubric weights / add tasks / [buffer] difficulty filtering β†’ retrain.

8. Non-negotiable risks

  1. Leakage β†’ structural causal feed (not a detector).
  2. Reward variance β†’ aggregate over basket Γ— windows (or GRPO has no signal).
  3. Reward hacking β†’ OOS + must-beat-benchmark + hard lookahead gate.
  4. Crypto noise β†’ slippage/fee/liquidity model on every fill.