Instructions to use poolside-laguna-hackathon/trade-pool with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use poolside-laguna-hackathon/trade-pool with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 7,974 Bytes
ce6b50a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | # hack-x β Environment & Training Spec
A self-improving crypto-trading **coding agent** trained with RL on Prime Lab.
tradewatch's soft reflection loop (events β MEMORY.md prompt rules) becomes a real
gradient loop: rollout β backtest verifier β GRPO β **LoRA adapter** β deploy β live demo.
---
## 1. The one-line thesis
> Today tradewatch's trading discipline lives as *prompt text* (MEMORY.md). We make it
> live as *adapter weights* by rewarding disciplined, profitable behavior on
> out-of-sample replayed market history. The proof: the trained adapter stays
> disciplined with MEMORY.md **removed from the prompt**.
## 1b. tradewatch as seed (pull all of it β but know its limits)
Mined both DB (4,875 events) and journal (708 rows). What's there:
- **618 `analysis_decision` rows** with 18 market features each (price_usd, liquidity_usd,
volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5mβ24h, age_hours,
fdv, market_cap) β this IS the agent's feature schema + seed/format examples.
- Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
- **9 confirmed pool addresses** β guaranteed-good fetch universe seed.
- Trade outcomes (exits/PnL/skips) β rubric calibration.
- **NOT present:** any stored OHLCV time series. `ohlcv_1h` was fetched live, never persisted.
Every market row is a point-in-time snapshot, not a replayable tape.
- **Therefore:** tradewatch = seed of *decisions/features/pools* (the brains + universe);
GeckoTerminal fetch = the *replay price tape*. Both required; neither substitutes.
## 1c. The three loops are independent (demo β batch_size)
- **Training loop:** batch_size=128 parallel replay-rollouts/step β a GRPO requirement only.
- **Demo loop:** 1 (or few) LIVE paper session in tradewatch, trained adapter via
base:adapter_id. As small as we want. 128 was never the demo.
- **Recursive loop (stretch):** reflect β checkpoint_id warm-start β retrain.
## 2. Data (replay substrate) β crypto
- **Source:** GeckoTerminal free OHLCV β `/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L`
(already used in `scanner/base_scanner.py:843`). No key.
- **Universe:** start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus,
GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30β50 Base pools.
- **Bars & DEPTH CEILING (probed live 2026-05-30):** free OHLCV caps at **~1000 bars/pool,
`before_timestamp` pagination does NOT walk back further.** So:
- hourly (`tf=hour`): ~1000 bars β **41 days**
- daily (`tf=day`): ~181 bars β **6 months**
- **Implication:** shallow history β depth can't supply many decorrelated time windows.
**Breadth carries GRPO variance: tasks = symbol Γ window, universe = 30β50 pools.**
- **Storage:** `data/ohlcv/<symbol>.parquet` (ts, o, h, l, c, v). Reproducible, no rate limits at train time.
- **Splits (anti-leakage, structural):**
- per-symbol time split: train window | OOS window (chronological, no overlap)
- held-out OOS **symbols** (never seen in training) for generalization check
- with shallow per-symbol history, the symbol-holdout split does the heavy lifting
- **Crypto-hardening (because reward is gameable here):**
- slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
- min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
- survivorship caveat noted; mitigate by including tokens that later died
## 3. Action surface β support BOTH (decide after baseline rollouts)
The backtester accepts a uniform `target_positions` dict; two agent framings feed it:
- **A. Strategy-code:** agent writes `strategy(features_window) -> target_positions`,
executed causally over the OOS window. Strongest "coding agent" framing.
- **B. Structured decisions:** agent emits per-bar JSON (verdict/entry/target/stop),
reusing tradewatch's schema. Closer to existing code.
Both reduce to the same backtest call β same rubric. We prototype both, pick via eval.
## 4. Backtester (port `agents/paper_ledger.py`)
- `ReplayFeed`: serves bars **only up to t** β lookahead impossible by construction
(the agent/strategy literally never receives `bars[t+1:]`).
- Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market,
track equity, cash, exposure, per-trade R:R, drawdown.
- Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
- Benchmarks computed on same window: **buy-and-hold**, MA-crossover, z-score mean-reversion.
- Invariants (unit-tested): causality (no future leak), determinism (seed β identical run).
## 5. verifiers environment (v0 stable API)
Class: **`StatefulToolEnv`** (per-rollout state + stateful tools).
```python
def load_environment(
symbols: list[str] | str = "train", # universe or split name
split: str = "train", # train | oos | oos_symbols
objective: str = "sharpe", # sharpe | return | min_drawdown
max_turns: int = 8,
n_windows: int = 4, # OOS windows averaged per reward
seed: int = 0,
) -> vf.Environment: ...
```
- `setup_state(state)`: pick task (symbol(s) Γ window Γ objective), build fresh ReplayFeed + ledger.
- **Tools** (stateful, NOT per-rollout sandbox β global/in-proc exec to save credits):
- `get_features(lookback)` β indicators up to current bar (RSI, MACD, MAs, z-score,
BB, vol, buy/sell ratio, liquidity) β reuses scanner feature logic
- `run_backtest(strategy_or_decisions)` β in-sample metrics (the agent's feedback loop)
- `read_metrics()` β current equity/DD/Sharpe
- `@vf.stop`: agent submits final strategy, or max_turns.
- **Reward (Rubric, weighted) β computed on OOS, aggregated over n_windows Γ basket:**
| reward fn | weight | source |
|---|---|---|
| `r_sharpe` (normalized OOS Sharpe) | 0.40 | spine |
| `r_beats_benchmark` (vs buy-and-hold) | 0.20 | anti-overfit |
| `r_drawdown` (penalty for deep DD) | 0.15 | MEMORY: protect capital |
| `r_rr_discipline` (R:Rβ₯2 compliance) | 0.10 | trading_agent._validate |
| `r_exposure` (sane equity/cash, no all-in) | 0.10 | MEMORY: sizing |
| `r_cost` (turnover/fee penalty) | 0.05 | realism |
| **HARD GATES β reward 0** | β | invalid code, lookahead, NaN equity, illiquid trades |
Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake
discipline while losing money.
## 6. Training (Prime Hosted, FREE for Laguna)
```toml
model = "poolside/Laguna-XS.2" # validate first on Qwen/Qwen3-4B-Instruct-2507
max_steps = 50 # small validation; scale after curve moves
batch_size = 128 # = parallel rollouts (the "128 sessions")
rollouts_per_example = 8 # GRPO group β needs decorrelated windows
learning_rate = 1e-4
lora_alpha = 16
[sampling]
max_tokens = 512
enable_thinking = false # docs: start non-reasoning for agentic tasks
[[env]]
id = "<user>/stock-strategy-env"
[eval]
eval_base_model = true # base-vs-trained comparison built in
```
Pipeline: `prime eval run` (sanity, baseline 10β80%) β small Qwen RL β Laguna RL β
download LoRA β `prime deployments create` β `base:adapter_id`.
## 7. Closing the loop
- Deploy adapter β point tradewatch `HybrieClient` at `api.pinference.ai` β **live paper demo**.
- **Ablation (money shot):** trained adapter, MEMORY.md stripped from prompt β discipline holds.
- **Recursive (stretch):** `checkpoint_id` warm-start β reflect on failures β
adjust rubric weights / add tasks / `[buffer]` difficulty filtering β retrain.
## 8. Non-negotiable risks
1. Leakage β structural causal feed (not a detector).
2. Reward variance β aggregate over basket Γ windows (or GRPO has no signal).
3. Reward hacking β OOS + must-beat-benchmark + hard lookahead gate.
4. Crypto noise β slippage/fee/liquidity model on every fill.
|