Instructions to use poolside-laguna-hackathon/trade-pool with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use poolside-laguna-hackathon/trade-pool with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| # hack-x β Environment & Training Spec | |
| A self-improving crypto-trading **coding agent** trained with RL on Prime Lab. | |
| tradewatch's soft reflection loop (events β MEMORY.md prompt rules) becomes a real | |
| gradient loop: rollout β backtest verifier β GRPO β **LoRA adapter** β deploy β live demo. | |
| --- | |
| ## 1. The one-line thesis | |
| > Today tradewatch's trading discipline lives as *prompt text* (MEMORY.md). We make it | |
| > live as *adapter weights* by rewarding disciplined, profitable behavior on | |
| > out-of-sample replayed market history. The proof: the trained adapter stays | |
| > disciplined with MEMORY.md **removed from the prompt**. | |
| ## 1b. tradewatch as seed (pull all of it β but know its limits) | |
| Mined both DB (4,875 events) and journal (708 rows). What's there: | |
| - **618 `analysis_decision` rows** with 18 market features each (price_usd, liquidity_usd, | |
| volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5mβ24h, age_hours, | |
| fdv, market_cap) β this IS the agent's feature schema + seed/format examples. | |
| - Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples). | |
| - **9 confirmed pool addresses** β guaranteed-good fetch universe seed. | |
| - Trade outcomes (exits/PnL/skips) β rubric calibration. | |
| - **NOT present:** any stored OHLCV time series. `ohlcv_1h` was fetched live, never persisted. | |
| Every market row is a point-in-time snapshot, not a replayable tape. | |
| - **Therefore:** tradewatch = seed of *decisions/features/pools* (the brains + universe); | |
| GeckoTerminal fetch = the *replay price tape*. Both required; neither substitutes. | |
| ## 1c. The three loops are independent (demo β batch_size) | |
| - **Training loop:** batch_size=128 parallel replay-rollouts/step β a GRPO requirement only. | |
| - **Demo loop:** 1 (or few) LIVE paper session in tradewatch, trained adapter via | |
| base:adapter_id. As small as we want. 128 was never the demo. | |
| - **Recursive loop (stretch):** reflect β checkpoint_id warm-start β retrain. | |
| ## 2. Data (replay substrate) β crypto | |
| - **Source:** GeckoTerminal free OHLCV β `/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L` | |
| (already used in `scanner/base_scanner.py:843`). No key. | |
| - **Universe:** start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus, | |
| GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30β50 Base pools. | |
| - **Bars & DEPTH CEILING (probed live 2026-05-30):** free OHLCV caps at **~1000 bars/pool, | |
| `before_timestamp` pagination does NOT walk back further.** So: | |
| - hourly (`tf=hour`): ~1000 bars β **41 days** | |
| - daily (`tf=day`): ~181 bars β **6 months** | |
| - **Implication:** shallow history β depth can't supply many decorrelated time windows. | |
| **Breadth carries GRPO variance: tasks = symbol Γ window, universe = 30β50 pools.** | |
| - **Storage:** `data/ohlcv/<symbol>.parquet` (ts, o, h, l, c, v). Reproducible, no rate limits at train time. | |
| - **Splits (anti-leakage, structural):** | |
| - per-symbol time split: train window | OOS window (chronological, no overlap) | |
| - held-out OOS **symbols** (never seen in training) for generalization check | |
| - with shallow per-symbol history, the symbol-holdout split does the heavy lifting | |
| - **Crypto-hardening (because reward is gameable here):** | |
| - slippage + fee model on every fill (fixed bps + size-vs-liquidity impact) | |
| - min-liquidity / min-volume gate per bar (illiquid bars can't be traded) | |
| - survivorship caveat noted; mitigate by including tokens that later died | |
| ## 3. Action surface β support BOTH (decide after baseline rollouts) | |
| The backtester accepts a uniform `target_positions` dict; two agent framings feed it: | |
| - **A. Strategy-code:** agent writes `strategy(features_window) -> target_positions`, | |
| executed causally over the OOS window. Strongest "coding agent" framing. | |
| - **B. Structured decisions:** agent emits per-bar JSON (verdict/entry/target/stop), | |
| reusing tradewatch's schema. Closer to existing code. | |
| Both reduce to the same backtest call β same rubric. We prototype both, pick via eval. | |
| ## 4. Backtester (port `agents/paper_ledger.py`) | |
| - `ReplayFeed`: serves bars **only up to t** β lookahead impossible by construction | |
| (the agent/strategy literally never receives `bars[t+1:]`). | |
| - Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market, | |
| track equity, cash, exposure, per-trade R:R, drawdown. | |
| - Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate. | |
| - Benchmarks computed on same window: **buy-and-hold**, MA-crossover, z-score mean-reversion. | |
| - Invariants (unit-tested): causality (no future leak), determinism (seed β identical run). | |
| ## 5. verifiers environment (v0 stable API) | |
| Class: **`StatefulToolEnv`** (per-rollout state + stateful tools). | |
| ```python | |
| def load_environment( | |
| symbols: list[str] | str = "train", # universe or split name | |
| split: str = "train", # train | oos | oos_symbols | |
| objective: str = "sharpe", # sharpe | return | min_drawdown | |
| max_turns: int = 8, | |
| n_windows: int = 4, # OOS windows averaged per reward | |
| seed: int = 0, | |
| ) -> vf.Environment: ... | |
| ``` | |
| - `setup_state(state)`: pick task (symbol(s) Γ window Γ objective), build fresh ReplayFeed + ledger. | |
| - **Tools** (stateful, NOT per-rollout sandbox β global/in-proc exec to save credits): | |
| - `get_features(lookback)` β indicators up to current bar (RSI, MACD, MAs, z-score, | |
| BB, vol, buy/sell ratio, liquidity) β reuses scanner feature logic | |
| - `run_backtest(strategy_or_decisions)` β in-sample metrics (the agent's feedback loop) | |
| - `read_metrics()` β current equity/DD/Sharpe | |
| - `@vf.stop`: agent submits final strategy, or max_turns. | |
| - **Reward (Rubric, weighted) β computed on OOS, aggregated over n_windows Γ basket:** | |
| | reward fn | weight | source | | |
| |---|---|---| | |
| | `r_sharpe` (normalized OOS Sharpe) | 0.40 | spine | | |
| | `r_beats_benchmark` (vs buy-and-hold) | 0.20 | anti-overfit | | |
| | `r_drawdown` (penalty for deep DD) | 0.15 | MEMORY: protect capital | | |
| | `r_rr_discipline` (R:Rβ₯2 compliance) | 0.10 | trading_agent._validate | | |
| | `r_exposure` (sane equity/cash, no all-in) | 0.10 | MEMORY: sizing | | |
| | `r_cost` (turnover/fee penalty) | 0.05 | realism | | |
| | **HARD GATES β reward 0** | β | invalid code, lookahead, NaN equity, illiquid trades | | |
| Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake | |
| discipline while losing money. | |
| ## 6. Training (Prime Hosted, FREE for Laguna) | |
| ```toml | |
| model = "poolside/Laguna-XS.2" # validate first on Qwen/Qwen3-4B-Instruct-2507 | |
| max_steps = 50 # small validation; scale after curve moves | |
| batch_size = 128 # = parallel rollouts (the "128 sessions") | |
| rollouts_per_example = 8 # GRPO group β needs decorrelated windows | |
| learning_rate = 1e-4 | |
| lora_alpha = 16 | |
| [sampling] | |
| max_tokens = 512 | |
| enable_thinking = false # docs: start non-reasoning for agentic tasks | |
| [[env]] | |
| id = "<user>/stock-strategy-env" | |
| [eval] | |
| eval_base_model = true # base-vs-trained comparison built in | |
| ``` | |
| Pipeline: `prime eval run` (sanity, baseline 10β80%) β small Qwen RL β Laguna RL β | |
| download LoRA β `prime deployments create` β `base:adapter_id`. | |
| ## 7. Closing the loop | |
| - Deploy adapter β point tradewatch `HybrieClient` at `api.pinference.ai` β **live paper demo**. | |
| - **Ablation (money shot):** trained adapter, MEMORY.md stripped from prompt β discipline holds. | |
| - **Recursive (stretch):** `checkpoint_id` warm-start β reflect on failures β | |
| adjust rubric weights / add tasks / `[buffer]` difficulty filtering β retrain. | |
| ## 8. Non-negotiable risks | |
| 1. Leakage β structural causal feed (not a detector). | |
| 2. Reward variance β aggregate over basket Γ windows (or GRPO has no signal). | |
| 3. Reward hacking β OOS + must-beat-benchmark + hard lookahead gate. | |
| 4. Crypto noise β slippage/fee/liquidity model on every fill. | |