Upload folder using huggingface_hub

ce6b50a verified about 1 month ago

7.97 kB

	# hack-x — Environment & Training Spec

	A self-improving crypto-trading coding agent trained with RL on Prime Lab.
	tradewatch's soft reflection loop (events → MEMORY.md prompt rules) becomes a real
	gradient loop: rollout → backtest verifier → GRPO → LoRA adapter → deploy → live demo.

	---

	## 1. The one-line thesis
	> Today tradewatch's trading discipline lives as prompt text (MEMORY.md). We make it
	> live as adapter weights by rewarding disciplined, profitable behavior on
	> out-of-sample replayed market history. The proof: the trained adapter stays
	> disciplined with MEMORY.md removed from the prompt.

	## 1b. tradewatch as seed (pull all of it — but know its limits)
	Mined both DB (4,875 events) and journal (708 rows). What's there:
	- 618 `analysis_decision` rows with 18 market features each (price_usd, liquidity_usd,
	volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5m→24h, age_hours,
	fdv, market_cap) — this IS the agent's feature schema + seed/format examples.
	- Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
	- 9 confirmed pool addresses → guaranteed-good fetch universe seed.
	- Trade outcomes (exits/PnL/skips) → rubric calibration.
	- NOT present: any stored OHLCV time series. `ohlcv_1h` was fetched live, never persisted.
	Every market row is a point-in-time snapshot, not a replayable tape.
	- Therefore: tradewatch = seed of decisions/features/pools (the brains + universe);
	GeckoTerminal fetch = the replay price tape. Both required; neither substitutes.

	## 1c. The three loops are independent (demo ≠ batch_size)
	- Training loop: batch_size=128 parallel replay-rollouts/step — a GRPO requirement only.
	- Demo loop: 1 (or few) LIVE paper session in tradewatch, trained adapter via
	base:adapter_id. As small as we want. 128 was never the demo.
	- Recursive loop (stretch): reflect → checkpoint_id warm-start → retrain.

	## 2. Data (replay substrate) — crypto
	- Source: GeckoTerminal free OHLCV — `/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L`
	(already used in `scanner/base_scanner.py:843`). No key.
	- Universe: start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus,
	GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30–50 Base pools.
	- Bars & DEPTH CEILING (probed live 2026-05-30): free OHLCV caps at **~1000 bars/pool,
	`before_timestamp` pagination does NOT walk back further.** So:
	- hourly (`tf=hour`): ~1000 bars ≈ 41 days
	- daily (`tf=day`): ~181 bars ≈ 6 months
	- Implication: shallow history → depth can't supply many decorrelated time windows.
	Breadth carries GRPO variance: tasks = symbol × window, universe = 30–50 pools.
	- Storage: `data/ohlcv/<symbol>.parquet` (ts, o, h, l, c, v). Reproducible, no rate limits at train time.
	- Splits (anti-leakage, structural):
	- per-symbol time split: train window \| OOS window (chronological, no overlap)
	- held-out OOS symbols (never seen in training) for generalization check
	- with shallow per-symbol history, the symbol-holdout split does the heavy lifting
	- Crypto-hardening (because reward is gameable here):
	- slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
	- min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
	- survivorship caveat noted; mitigate by including tokens that later died

	## 3. Action surface — support BOTH (decide after baseline rollouts)
	The backtester accepts a uniform `target_positions` dict; two agent framings feed it:
	- A. Strategy-code: agent writes `strategy(features_window) -> target_positions`,
	executed causally over the OOS window. Strongest "coding agent" framing.
	- B. Structured decisions: agent emits per-bar JSON (verdict/entry/target/stop),
	reusing tradewatch's schema. Closer to existing code.
	Both reduce to the same backtest call → same rubric. We prototype both, pick via eval.

	## 4. Backtester (port `agents/paper_ledger.py`)
	- `ReplayFeed`: serves bars only up to t — lookahead impossible by construction
	(the agent/strategy literally never receives `bars[t+1:]`).
	- Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market,
	track equity, cash, exposure, per-trade R:R, drawdown.
	- Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
	- Benchmarks computed on same window: buy-and-hold, MA-crossover, z-score mean-reversion.
	- Invariants (unit-tested): causality (no future leak), determinism (seed → identical run).

	## 5. verifiers environment (v0 stable API)
	Class: `StatefulToolEnv` (per-rollout state + stateful tools).

	```python
	def load_environment(
	symbols: list[str] \| str = "train", # universe or split name
	split: str = "train", # train \| oos \| oos_symbols
	objective: str = "sharpe", # sharpe \| return \| min_drawdown
	max_turns: int = 8,
	n_windows: int = 4, # OOS windows averaged per reward
	seed: int = 0,
	) -> vf.Environment: ...
	```

	- `setup_state(state)`: pick task (symbol(s) × window × objective), build fresh ReplayFeed + ledger.
	- Tools (stateful, NOT per-rollout sandbox — global/in-proc exec to save credits):
	- `get_features(lookback)` → indicators up to current bar (RSI, MACD, MAs, z-score,
	BB, vol, buy/sell ratio, liquidity) — reuses scanner feature logic
	- `run_backtest(strategy_or_decisions)` → in-sample metrics (the agent's feedback loop)
	- `read_metrics()` → current equity/DD/Sharpe
	- `@vf.stop`: agent submits final strategy, or max_turns.
	- Reward (Rubric, weighted) — computed on OOS, aggregated over n_windows × basket:

	\| reward fn \| weight \| source \|
	\|---\|---\|---\|
	\| `r_sharpe` (normalized OOS Sharpe) \| 0.40 \| spine \|
	\| `r_beats_benchmark` (vs buy-and-hold) \| 0.20 \| anti-overfit \|
	\| `r_drawdown` (penalty for deep DD) \| 0.15 \| MEMORY: protect capital \|
	\| `r_rr_discipline` (R:R≥2 compliance) \| 0.10 \| trading_agent._validate \|
	\| `r_exposure` (sane equity/cash, no all-in) \| 0.10 \| MEMORY: sizing \|
	\| `r_cost` (turnover/fee penalty) \| 0.05 \| realism \|
	\| HARD GATES → reward 0 \| — \| invalid code, lookahead, NaN equity, illiquid trades \|

	Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake
	discipline while losing money.

	## 6. Training (Prime Hosted, FREE for Laguna)
	```toml
	model = "poolside/Laguna-XS.2" # validate first on Qwen/Qwen3-4B-Instruct-2507
	max_steps = 50 # small validation; scale after curve moves
	batch_size = 128 # = parallel rollouts (the "128 sessions")
	rollouts_per_example = 8 # GRPO group — needs decorrelated windows
	learning_rate = 1e-4
	lora_alpha = 16
	[sampling]
	max_tokens = 512
	enable_thinking = false # docs: start non-reasoning for agentic tasks
	[[env]]
	id = "<user>/stock-strategy-env"
	[eval]
	eval_base_model = true # base-vs-trained comparison built in
	```
	Pipeline: `prime eval run` (sanity, baseline 10–80%) → small Qwen RL → Laguna RL →
	download LoRA → `prime deployments create` → `base:adapter_id`.

	## 7. Closing the loop
	- Deploy adapter → point tradewatch `HybrieClient` at `api.pinference.ai` → live paper demo.
	- Ablation (money shot): trained adapter, MEMORY.md stripped from prompt → discipline holds.
	- Recursive (stretch): `checkpoint_id` warm-start → reflect on failures →
	adjust rubric weights / add tasks / `[buffer]` difficulty filtering → retrain.

	## 8. Non-negotiable risks
	1. Leakage → structural causal feed (not a detector).
	2. Reward variance → aggregate over basket × windows (or GRPO has no signal).
	3. Reward hacking → OOS + must-beat-benchmark + hard lookahead gate.
	4. Crypto noise → slippage/fee/liquidity model on every fill.