YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
rl_btc_v4 β Offline Implicit Q-Learning for Bitcoin Trading
Status: v2 (fixed + improved) β Trainer runs and produces diagnostics
Based on: IQL (Kostrikov et al., 2021)
Data: 5m Binance OHLCV + derivatives data β 2.2M+ transitions
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β rl_btc_v4 Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββ β
β β Parquet βββββΆβ Dataset Builder βββββΆβ (s,a,r,s',d)β β
β β Data β β (behavioral policyβ β Transitionsβ
β β (5m bars) β β counterfactuals) β β β
β ββββββββββββββββ ββββββββββββββββββββ ββββββββ¬ββββββ β
β β β
β ββββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β IQL Trainer β β
β β β β
β β DiscreteQ(s) β [Q(s,aβ), Q(s,aβ), ... Q(s,aβ)] β β
β β ValueNet(s) β V(s) β β
β β PolicyNet(s) β [Ο(aβ|s), ..., Ο(aβ|s)] β β
β β β β
β β 1. V-update: expectile regression on Q(s, a_data) β β
β β 2. Q-update: TD backup using V(s') β β
β β 3. Policy: advantage-weighted BC (clipped weights) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Project Structure
rl_btc_v4/
βββ __init__.py # Package exports
βββ __main__.py # CLI entry point
βββ constants.py # Action space, features, defaults
βββ env.py # Gym-compatible BTC trading environment
βββ dataset.py # Offline RL dataset builder
βββ iql_trainer.py # IQL trainer (v2 β fixed)
βββ train.py # Local training script
βββ train_cloud.py # Cloud training (v1 β legacy)
βββ train_cloud_v2.py # Cloud training (v2 β current)
βββ train_hf_job.py # HF Jobs training script
βββ README.md # This file
Action Space
| ID | Name | Type | Side | Fraction | Description |
|---|---|---|---|---|---|
| 0 | HOLD | hold | β | 0.00 | Keep current position |
| 1 | FLAT | target | β | 0.00 | Liquidate all positions |
| 2 | YES_10 | target | YES | 0.10 | Buy YES at 10% equity |
| 3 | YES_25 | target | YES | 0.25 | Buy YES at 25% equity |
| 4 | YES_50 | target | YES | 0.50 | Buy YES at 50% equity |
| 5 | NO_10 | target | NO | 0.10 | Buy NO at 10% equity |
| 6 | NO_25 | target | NO | 0.25 | Buy NO at 25% equity |
| 7 | NO_50 | target | NO | 0.50 | Buy NO at 50% equity |
8 discrete actions β designed for a prediction market microstructure where:
- YES shares pay $1 if the market resolves UP, $0 otherwise
- NO shares pay $1 if the market resolves DOWN, $0 otherwise
- Positions settle at
obs_pos == 4(end of 5m observation window) obs_pos == 4is settlement-only: the simulator and replay env do not allow opening or changing positions on the same row that resolves the market.
Feature Engineering
Market Features (38 dimensions)
| Category | Features | Purpose |
|---|---|---|
| Time | obs_pos, seconds_since_open, seconds_to_close, market_progress, hour_sin, hour_cos |
Intra-window and daily seasonality |
| Prices | yes_bid/ask, no_bid/ask, yes/no_mid, yes/no_spread |
Order book state |
| Arbitrage | microprice_bias, abs_yes_mid_distance_from_even |
Cross-market mispricing signals |
| Returns | price_return_from_open, mark/index_return_from_open_filled, abs_mark_return |
Price momentum |
| Volatility | rolling_vol_15m, rolling_vol_60m |
Risk estimation |
| Flow | taker_buy_ratio, buy_sell_imbalance, imbalance_x_vol |
Order flow pressure |
| Volume | volume, num_trades, rolling_volume_15m, rolling_num_trades_15m |
Liquidity signals |
| Risk-adjusted | return_over_vol_15m/60m, signed_move_x_time_remaining |
Sharpe-like signals |
| Derivatives | funding_rate, funding_rate_prev, oi_delta_5m/15m/60m, long_short_ratio |
Market sentiment |
Portfolio Features (9 dimensions)
| Feature | Description |
|---|---|
cash_fraction |
Cash / starting_cash |
equity_fraction |
Total equity / starting_cash |
drawdown_fraction |
Peak-to-trough drawdown |
position_side |
+1 (YES), -1 (NO), 0 (flat) |
position_fraction |
Position value / equity |
position_shares |
Raw share count |
avg_entry_price |
Cost basis per share |
unrealized_pnl_fraction |
PnL / starting_cash |
steps_held_fraction |
How long position has been held |
State Vector
state = [history_ordered (30 Γ 38), portfolio_vec (9)]
= [1140 + 9] = 1149 dimensions (with history_length=30)
Market features are normalized using a StandardScaler fit only on the training-period data. The history is a rolling window of the last 30 observations (chronologically ordered).
Reward Function
Risk-sensitive reward with multiple penalty terms:
pnl_reward = clip(log(max(equity_after, floor) / max(equity_before, floor)), -2, 2)
dd_penalty = 0.50 Γ max(0, current_dd - prev_dd)
risk_penalty = 1.0 Γ max(0, current_dd - prev_dd)Β²
cvar_penalty = 1.0 Γ |pnl_reward| Γ (1 + current_dd) [only if pnl_reward < 0]
reward = clip(pnl_reward - dd_penalty - risk_penalty - cvar_penalty, -4, 4)
Design rationale:
- PnL reward: Bounded log-equity return with a 1% starting-cash floor, so rare binary-market jackpot payouts do not dominate the critic target
- Drawdown penalty: Penalizes increasing drawdown (soft: linear + quadratic)
- CVaR penalty: Amplifies losses during drawdown periods (tail risk) β subtracted from reward to penalize negative PnL more severely when already underwater
Fee Model
All trading operations (buy, liquidate, reduce) charge a taker fee based on the Crypto/BTC prediction market schedule:
fee_usdc = shares Γ fee_rate Γ price Γ (1 - price)
| Parameter | Value | Source |
|---|---|---|
taker_fee_rate |
0.072 | Crypto category, BTC markets taker fee |
maker_fee |
0 | Not modelled (all orders assumed market/taker) |
maker_rebate |
20% | Not applied |
Quadratic fee shape: Highest at price β 0.50 (max uncertainty), zero at price = 0 or 1. This matches how prediction market fees scale with outcome uncertainty.
The fee is applied consistently in both:
BTCTradingEnvβ live policy replay, deducts fromcashon buys (added to cost) and from proceeds on sells/liquidationdataset.pycounterfactual simulator β offline training rewards include fee drag so the learned Q-values and policy account for trading costs
To disable fees (e.g. ablation): --taker-fee-rate 0.0 or TAKER_FEE_RATE = 0.0 in constants.py.
IQL Algorithm (v2 β Fixed)
What was wrong in v1
| Component | v1 (Buggy) | v2 (Fixed) |
|---|---|---|
| Q-network | (s, one_hot(a)) β scalar |
s β [Q(s,aβ), ..., Q(s,aβ)] |
| V-update | Q_target(next_s, one_hot(a)) |
min(Qβ_target(s), Qβ_target(s)) gathered at dataset actions |
| Evaluation | return np.mean(rewards) (no policy) |
Action agreement + entropy diagnostics |
| Dataset | Greedy best-action (98.8% HOLD) / random softmax churn | Conservative edge-gated behavioral policy with epsilon exploration |
| LR | Fixed | CosineAnnealingLR |
| Stopping | None | Early stopping with best model restoration |
Current Algorithm (per batch)
1. V-update:
q_target = min(Qβ_target(s), Qβ_target(s)) gathered at dataset actions
v_loss = expectile_loss(q_target - V(s), Ο=0.7)
2. Q-update:
q_target = r + Ξ³ Γ (1-done) Γ V(s')
q_loss = MSE(Qβ(s,a), q_target) + MSE(Qβ(s,a), q_target)
soft_update(Q_target β Q)
3. Policy update (every 2 critic steps):
advantage = Q(s, a_dataset) - V(s)
weights = clamp(exp(advantage / temperature), max=100)
policy_loss = -mean(weights Γ log Ο(a_dataset | s))
Key hyperparameters
| Parameter | Cloud Default | Local Default | Rationale |
|---|---|---|---|
expectile |
0.7 | 0.7 | Moderate upper expectile β balances optimism with robustness |
temperature |
3.0 | 3.0 | Low Ξ² β policy stays closer to behavioral (safer for financial data) |
gamma |
0.99 | 0.99 | Standard discount for episodic tasks |
tau |
0.005 | 0.005 | Slow target network update for stability |
lr |
3e-4 | 3e-4 | Standard Adam LR |
batch_size |
512 | 256 | Large batches for stable gradients (cloud); smaller for local memory |
hidden_dim |
256 | 256 | 2-layer MLP |
dropout |
0.1 | 0.0 | Light regularization (cloud only) |
policy_update_freq |
2 | 2 | TD3-style delayed policy updates |
early_stopping_patience |
20 evals | 20 evals | Stops if eval reward doesn't improve |
behavioral_policy_mode |
conservative | conservative | Avoids fee-heavy uniform random trading |
min_trade_edge |
0.005 | 0.005 | Directional action must beat HOLD/FLAT by this reward edge |
behavioral_epsilon |
0.03 | 0.03 | Keeps limited action support for IQL without dominating the dataset |
Dataset Building
Process
- Load parquet β filter to
obs_pos β {0,1,2,3,4}, sort by time - Fill NaNs β defaults per column (0.0 for most, 1.0 for long_short_ratio)
- Temporal split β last 20% of calendar days held out for test, with a purge/embargo gap of
episode_span_dayson each side to prevent overlapping windows from leaking information - Fit StandardScaler β normalize market features using train-period data only
- Build episodes β sliding windows (span=30d, stride=15d) within each split
- Counterfactual simulation β for each step, simulate ALL 8 actions and compute rewards
- Settlement guard β
obs_pos == 4can settle existing inventory but cannot open new exposure, preventing same-row outcome leakage - Behavioral policy β pick the best of HOLD/FLAT unless the best directional trade clears
min_trade_edge; use small epsilon exploration for action support - Shard metadata β stores sampled action distribution and, by default,
all_action_rewardsfor counterfactual supervised training
Data Leakage Prevention
The dataset builder enforces a clean temporal separation:
[--- train days ---][<-- embargo -->][<-- embargo -->][--- test days ---]
scaler fit here gap gap held out
- The
StandardScaleris fit only on training-period rows. - An embargo buffer (default: one full
episode_span_days, i.e. 30 days) is removed from both sides of the split boundary so that no 30-day sliding window can straddle the train/test line. - Episodes are built independently within each split's day list.
Scale (full config)
- ~750 episodes from 2263 days of 5m data (before embargo removal)
- State dim: 1149 (30 Γ 38 + 9)
- Action distribution: recorded in shard metadata; reject runs where fee-paying actions are near-uniform without a clear edge
Evaluation
What the trainer provides
The IQL trainer computes diagnostics at evaluation checkpoints:
| Metric | What it measures | What it does NOT measure |
|---|---|---|
| Agreement with optimal Q-action | How often the policy picks the action with the highest Q-value | Actual trading PnL or Sharpe |
| Action entropy | Diversity of the policy's action distribution | Out-of-sample performance |
These are diagnostics of training convergence, not trading performance metrics. Agreement and entropy tell you whether the policy has learned to distinguish actions, but they do not substitute for a true policy replay on held-out market data.
Policy replay
train_local_sharded.py and train_counterfactual_local.py both run held-out replay through BTCTradingEnv when --replay-episodes > 0. The replay metrics (mean_pnl, fees, drawdown, action counts) are the primary acceptance criteria.
Counterfactual Q path
Because the simulator computes rewards for every action at each state, train_counterfactual_local.py can train a direct action-value model on all_action_rewards. This avoids throwing away supervision by sampling one behavioral action per row. Deployment remains edge-gated: choose HOLD/FLAT unless the best directional action beats the no-new-risk baseline by min_trade_edge.
Current State
What works β
- IQL trainer with correct discrete-action architecture
- Behavioral dataset with diverse actions
- Counterfactual Q trainer using all-action reward labels
- Leakage-free temporal split with embargo gap
- Scaler fit on training data only
- LR scheduling, early stopping, best model checkpointing
- Cloud training scripts with Trackio monitoring
- HF Hub upload/download
Known Challenges β οΈ
- Sparse positive rewards β most 5m windows don't have strong signals. The model needs many epochs to learn meaningful patterns.
- High-dimensional state (1149 dims) β the 30-step history window creates a large input.
- Counterfactual simulation is slow β dataset building takes time for full config (30-day episodes).
- Sparse or no deployment trades β after removing settlement-row leakage, held-out replay may correctly select no-trade if directional actions do not clear fees and drawdown risk.
How to Run
Local training (small config, CPU/MPS)
cd /path/to/doug-data
python -m rl_btc_v4.train
Default local config: batch_size=256, dropout=0.0, epochs=100.
Cloud training (GPU, full config)
python rl_btc_v4/train_cloud_v2.py
Cloud config uses batch_size=512, dropout=0.1.
Via HF Jobs
from huggingface_hub import HfApi
# Upload code to HF Hub first
api = HfApi()
api.upload_folder(
folder_path="rl_btc_v4",
repo_id="fbzu/rl_btc_v4_iql",
repo_type="model",
)
# Then run train_hf_job.py via hf_jobs with GPU hardware
Artifacts
After training, the following are saved to https://huggingface.co/fbzu/rl_btc_v4_iql:
| File | Contents |
|---|---|
iql_model.pt |
Q, V, and Policy network state dicts + config |
scaler.npz |
Feature normalization stats + reward mean/std |
train_report.json |
Full training metrics, config, and results |
References
- IQL Paper: Kostrikov et al. (2021) β Implicit Q-Learning
- Reference Implementation: ikostrikov/implicit_q_learning (JAX/Flax)
- Discrete IQL: Chulabhaya/recurrent-discrete-iql (PyTorch)