trade-pool / SPEC.md
tosi-n's picture
Upload folder using huggingface_hub
ce6b50a verified
|
Raw
History Blame Contribute Delete
7.97 kB
# hack-x β€” Environment & Training Spec
A self-improving crypto-trading **coding agent** trained with RL on Prime Lab.
tradewatch's soft reflection loop (events β†’ MEMORY.md prompt rules) becomes a real
gradient loop: rollout β†’ backtest verifier β†’ GRPO β†’ **LoRA adapter** β†’ deploy β†’ live demo.
---
## 1. The one-line thesis
> Today tradewatch's trading discipline lives as *prompt text* (MEMORY.md). We make it
> live as *adapter weights* by rewarding disciplined, profitable behavior on
> out-of-sample replayed market history. The proof: the trained adapter stays
> disciplined with MEMORY.md **removed from the prompt**.
## 1b. tradewatch as seed (pull all of it β€” but know its limits)
Mined both DB (4,875 events) and journal (708 rows). What's there:
- **618 `analysis_decision` rows** with 18 market features each (price_usd, liquidity_usd,
volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5m→24h, age_hours,
fdv, market_cap) β€” this IS the agent's feature schema + seed/format examples.
- Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
- **9 confirmed pool addresses** β†’ guaranteed-good fetch universe seed.
- Trade outcomes (exits/PnL/skips) β†’ rubric calibration.
- **NOT present:** any stored OHLCV time series. `ohlcv_1h` was fetched live, never persisted.
Every market row is a point-in-time snapshot, not a replayable tape.
- **Therefore:** tradewatch = seed of *decisions/features/pools* (the brains + universe);
GeckoTerminal fetch = the *replay price tape*. Both required; neither substitutes.
## 1c. The three loops are independent (demo β‰  batch_size)
- **Training loop:** batch_size=128 parallel replay-rollouts/step β€” a GRPO requirement only.
- **Demo loop:** 1 (or few) LIVE paper session in tradewatch, trained adapter via
base:adapter_id. As small as we want. 128 was never the demo.
- **Recursive loop (stretch):** reflect β†’ checkpoint_id warm-start β†’ retrain.
## 2. Data (replay substrate) β€” crypto
- **Source:** GeckoTerminal free OHLCV β€” `/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L`
(already used in `scanner/base_scanner.py:843`). No key.
- **Universe:** start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus,
GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30–50 Base pools.
- **Bars & DEPTH CEILING (probed live 2026-05-30):** free OHLCV caps at **~1000 bars/pool,
`before_timestamp` pagination does NOT walk back further.** So:
- hourly (`tf=hour`): ~1000 bars β‰ˆ **41 days**
- daily (`tf=day`): ~181 bars β‰ˆ **6 months**
- **Implication:** shallow history β†’ depth can't supply many decorrelated time windows.
**Breadth carries GRPO variance: tasks = symbol Γ— window, universe = 30–50 pools.**
- **Storage:** `data/ohlcv/<symbol>.parquet` (ts, o, h, l, c, v). Reproducible, no rate limits at train time.
- **Splits (anti-leakage, structural):**
- per-symbol time split: train window | OOS window (chronological, no overlap)
- held-out OOS **symbols** (never seen in training) for generalization check
- with shallow per-symbol history, the symbol-holdout split does the heavy lifting
- **Crypto-hardening (because reward is gameable here):**
- slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
- min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
- survivorship caveat noted; mitigate by including tokens that later died
## 3. Action surface β€” support BOTH (decide after baseline rollouts)
The backtester accepts a uniform `target_positions` dict; two agent framings feed it:
- **A. Strategy-code:** agent writes `strategy(features_window) -> target_positions`,
executed causally over the OOS window. Strongest "coding agent" framing.
- **B. Structured decisions:** agent emits per-bar JSON (verdict/entry/target/stop),
reusing tradewatch's schema. Closer to existing code.
Both reduce to the same backtest call β†’ same rubric. We prototype both, pick via eval.
## 4. Backtester (port `agents/paper_ledger.py`)
- `ReplayFeed`: serves bars **only up to t** β€” lookahead impossible by construction
(the agent/strategy literally never receives `bars[t+1:]`).
- Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market,
track equity, cash, exposure, per-trade R:R, drawdown.
- Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
- Benchmarks computed on same window: **buy-and-hold**, MA-crossover, z-score mean-reversion.
- Invariants (unit-tested): causality (no future leak), determinism (seed β†’ identical run).
## 5. verifiers environment (v0 stable API)
Class: **`StatefulToolEnv`** (per-rollout state + stateful tools).
```python
def load_environment(
symbols: list[str] | str = "train", # universe or split name
split: str = "train", # train | oos | oos_symbols
objective: str = "sharpe", # sharpe | return | min_drawdown
max_turns: int = 8,
n_windows: int = 4, # OOS windows averaged per reward
seed: int = 0,
) -> vf.Environment: ...
```
- `setup_state(state)`: pick task (symbol(s) Γ— window Γ— objective), build fresh ReplayFeed + ledger.
- **Tools** (stateful, NOT per-rollout sandbox β€” global/in-proc exec to save credits):
- `get_features(lookback)` β†’ indicators up to current bar (RSI, MACD, MAs, z-score,
BB, vol, buy/sell ratio, liquidity) β€” reuses scanner feature logic
- `run_backtest(strategy_or_decisions)` β†’ in-sample metrics (the agent's feedback loop)
- `read_metrics()` β†’ current equity/DD/Sharpe
- `@vf.stop`: agent submits final strategy, or max_turns.
- **Reward (Rubric, weighted) β€” computed on OOS, aggregated over n_windows Γ— basket:**
| reward fn | weight | source |
|---|---|---|
| `r_sharpe` (normalized OOS Sharpe) | 0.40 | spine |
| `r_beats_benchmark` (vs buy-and-hold) | 0.20 | anti-overfit |
| `r_drawdown` (penalty for deep DD) | 0.15 | MEMORY: protect capital |
| `r_rr_discipline` (R:Rβ‰₯2 compliance) | 0.10 | trading_agent._validate |
| `r_exposure` (sane equity/cash, no all-in) | 0.10 | MEMORY: sizing |
| `r_cost` (turnover/fee penalty) | 0.05 | realism |
| **HARD GATES β†’ reward 0** | β€” | invalid code, lookahead, NaN equity, illiquid trades |
Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake
discipline while losing money.
## 6. Training (Prime Hosted, FREE for Laguna)
```toml
model = "poolside/Laguna-XS.2" # validate first on Qwen/Qwen3-4B-Instruct-2507
max_steps = 50 # small validation; scale after curve moves
batch_size = 128 # = parallel rollouts (the "128 sessions")
rollouts_per_example = 8 # GRPO group β€” needs decorrelated windows
learning_rate = 1e-4
lora_alpha = 16
[sampling]
max_tokens = 512
enable_thinking = false # docs: start non-reasoning for agentic tasks
[[env]]
id = "<user>/stock-strategy-env"
[eval]
eval_base_model = true # base-vs-trained comparison built in
```
Pipeline: `prime eval run` (sanity, baseline 10–80%) β†’ small Qwen RL β†’ Laguna RL β†’
download LoRA β†’ `prime deployments create` β†’ `base:adapter_id`.
## 7. Closing the loop
- Deploy adapter β†’ point tradewatch `HybrieClient` at `api.pinference.ai` β†’ **live paper demo**.
- **Ablation (money shot):** trained adapter, MEMORY.md stripped from prompt β†’ discipline holds.
- **Recursive (stretch):** `checkpoint_id` warm-start β†’ reflect on failures β†’
adjust rubric weights / add tasks / `[buffer]` difficulty filtering β†’ retrain.
## 8. Non-negotiable risks
1. Leakage β†’ structural causal feed (not a detector).
2. Reward variance β†’ aggregate over basket Γ— windows (or GRPO has no signal).
3. Reward hacking β†’ OOS + must-beat-benchmark + hard lookahead gate.
4. Crypto noise β†’ slippage/fee/liquidity model on every fill.