Instructions to use poolside-laguna-hackathon/trade-pool with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use poolside-laguna-hackathon/trade-pool with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
hack-x β Environment & Training Spec
A self-improving crypto-trading coding agent trained with RL on Prime Lab. tradewatch's soft reflection loop (events β MEMORY.md prompt rules) becomes a real gradient loop: rollout β backtest verifier β GRPO β LoRA adapter β deploy β live demo.
1. The one-line thesis
Today tradewatch's trading discipline lives as prompt text (MEMORY.md). We make it live as adapter weights by rewarding disciplined, profitable behavior on out-of-sample replayed market history. The proof: the trained adapter stays disciplined with MEMORY.md removed from the prompt.
1b. tradewatch as seed (pull all of it β but know its limits)
Mined both DB (4,875 events) and journal (708 rows). What's there:
- 618
analysis_decisionrows with 18 market features each (price_usd, liquidity_usd, volume_24h/1h, buys/sells 1h+24h, buy_sell_24h, vol/liq, price_change 5mβ24h, age_hours, fdv, market_cap) β this IS the agent's feature schema + seed/format examples. - Real verdicts + entry/target/stop + reasoning over 9 tokens (action-surface-B examples).
- 9 confirmed pool addresses β guaranteed-good fetch universe seed.
- Trade outcomes (exits/PnL/skips) β rubric calibration.
- NOT present: any stored OHLCV time series.
ohlcv_1hwas fetched live, never persisted. Every market row is a point-in-time snapshot, not a replayable tape. - Therefore: tradewatch = seed of decisions/features/pools (the brains + universe); GeckoTerminal fetch = the replay price tape. Both required; neither substitutes.
1c. The three loops are independent (demo β batch_size)
- Training loop: batch_size=128 parallel replay-rollouts/step β a GRPO requirement only.
- Demo loop: 1 (or few) LIVE paper session in tradewatch, trained adapter via base:adapter_id. As small as we want. 128 was never the demo.
- Recursive loop (stretch): reflect β checkpoint_id warm-start β retrain.
2. Data (replay substrate) β crypto
- Source: GeckoTerminal free OHLCV β
/networks/base/pools/{pool}/ohlcv/{tf}?aggregate=N&limit=L(already used inscanner/base_scanner.py:843). No key. - Universe: start from the 9 journal tokens (NOCK, VIRTUAL, DEUS, CTR, Surplus, GITLAWB, VVV, DEGEN, PITCH) + expand via trending-pool discovery to ~30β50 Base pools.
- Bars & DEPTH CEILING (probed live 2026-05-30): free OHLCV caps at ~1000 bars/pool,
before_timestamppagination does NOT walk back further. So:- hourly (
tf=hour): ~1000 bars β 41 days - daily (
tf=day): ~181 bars β 6 months - Implication: shallow history β depth can't supply many decorrelated time windows. Breadth carries GRPO variance: tasks = symbol Γ window, universe = 30β50 pools.
- hourly (
- Storage:
data/ohlcv/<symbol>.parquet(ts, o, h, l, c, v). Reproducible, no rate limits at train time. - Splits (anti-leakage, structural):
- per-symbol time split: train window | OOS window (chronological, no overlap)
- held-out OOS symbols (never seen in training) for generalization check
- with shallow per-symbol history, the symbol-holdout split does the heavy lifting
- Crypto-hardening (because reward is gameable here):
- slippage + fee model on every fill (fixed bps + size-vs-liquidity impact)
- min-liquidity / min-volume gate per bar (illiquid bars can't be traded)
- survivorship caveat noted; mitigate by including tokens that later died
3. Action surface β support BOTH (decide after baseline rollouts)
The backtester accepts a uniform target_positions dict; two agent framings feed it:
- A. Strategy-code: agent writes
strategy(features_window) -> target_positions, executed causally over the OOS window. Strongest "coding agent" framing. - B. Structured decisions: agent emits per-bar JSON (verdict/entry/target/stop), reusing tradewatch's schema. Closer to existing code. Both reduce to the same backtest call β same rubric. We prototype both, pick via eval.
4. Backtester (port agents/paper_ledger.py)
ReplayFeed: serves bars only up to t β lookahead impossible by construction (the agent/strategy literally never receivesbars[t+1:]).- Engine: step bar-by-bar, apply target_positions via slippage/fee model, mark-to-market, track equity, cash, exposure, per-trade R:R, drawdown.
- Metrics: Sharpe (primary), CAGR/return, max drawdown, equity/cash ratio, turnover, win rate.
- Benchmarks computed on same window: buy-and-hold, MA-crossover, z-score mean-reversion.
- Invariants (unit-tested): causality (no future leak), determinism (seed β identical run).
5. verifiers environment (v0 stable API)
Class: StatefulToolEnv (per-rollout state + stateful tools).
def load_environment(
symbols: list[str] | str = "train", # universe or split name
split: str = "train", # train | oos | oos_symbols
objective: str = "sharpe", # sharpe | return | min_drawdown
max_turns: int = 8,
n_windows: int = 4, # OOS windows averaged per reward
seed: int = 0,
) -> vf.Environment: ...
setup_state(state): pick task (symbol(s) Γ window Γ objective), build fresh ReplayFeed + ledger.Tools (stateful, NOT per-rollout sandbox β global/in-proc exec to save credits):
get_features(lookback)β indicators up to current bar (RSI, MACD, MAs, z-score, BB, vol, buy/sell ratio, liquidity) β reuses scanner feature logicrun_backtest(strategy_or_decisions)β in-sample metrics (the agent's feedback loop)read_metrics()β current equity/DD/Sharpe
@vf.stop: agent submits final strategy, or max_turns.Reward (Rubric, weighted) β computed on OOS, aggregated over n_windows Γ basket:
reward fn weight source r_sharpe(normalized OOS Sharpe)0.40 spine r_beats_benchmark(vs buy-and-hold)0.20 anti-overfit r_drawdown(penalty for deep DD)0.15 MEMORY: protect capital r_rr_discipline(R:Rβ₯2 compliance)0.10 trading_agent._validate r_exposure(sane equity/cash, no all-in)0.10 MEMORY: sizing r_cost(turnover/fee penalty)0.05 realism HARD GATES β reward 0 β invalid code, lookahead, NaN equity, illiquid trades Outcome terms (sharpe + beats_benchmark) must dominate so the model can't fake discipline while losing money.
6. Training (Prime Hosted, FREE for Laguna)
model = "poolside/Laguna-XS.2" # validate first on Qwen/Qwen3-4B-Instruct-2507
max_steps = 50 # small validation; scale after curve moves
batch_size = 128 # = parallel rollouts (the "128 sessions")
rollouts_per_example = 8 # GRPO group β needs decorrelated windows
learning_rate = 1e-4
lora_alpha = 16
[sampling]
max_tokens = 512
enable_thinking = false # docs: start non-reasoning for agentic tasks
[[env]]
id = "<user>/stock-strategy-env"
[eval]
eval_base_model = true # base-vs-trained comparison built in
Pipeline: prime eval run (sanity, baseline 10β80%) β small Qwen RL β Laguna RL β
download LoRA β prime deployments create β base:adapter_id.
7. Closing the loop
- Deploy adapter β point tradewatch
HybrieClientatapi.pinference.aiβ live paper demo. - Ablation (money shot): trained adapter, MEMORY.md stripped from prompt β discipline holds.
- Recursive (stretch):
checkpoint_idwarm-start β reflect on failures β adjust rubric weights / add tasks /[buffer]difficulty filtering β retrain.
8. Non-negotiable risks
- Leakage β structural causal feed (not a detector).
- Reward variance β aggregate over basket Γ windows (or GRPO has no signal).
- Reward hacking β OOS + must-beat-benchmark + hard lookahead gate.
- Crypto noise β slippage/fee/liquidity model on every fill.