YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

rl_btc_v4 β€” Offline Implicit Q-Learning for Bitcoin Trading

Status: v2 (fixed + improved) β€” Trainer runs and produces diagnostics
Based on: IQL (Kostrikov et al., 2021)
Data: 5m Binance OHLCV + derivatives data β†’ 2.2M+ transitions


Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    rl_btc_v4 Pipeline                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Parquet     │───▢│  Dataset Builder  │───▢│ (s,a,r,s',d)β”‚ β”‚
β”‚  β”‚  Data        β”‚    β”‚  (behavioral policyβ”‚    β”‚  Transitionsβ”‚
β”‚  β”‚  (5m bars)   β”‚    β”‚   counterfactuals) β”‚    β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                      β”‚       β”‚
β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                     β–Ό                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              IQL Trainer                              β”‚   β”‚
β”‚  β”‚                                                       β”‚   β”‚
β”‚  β”‚   DiscreteQ(s) β†’ [Q(s,aβ‚€), Q(s,a₁), ... Q(s,a₇)]    β”‚   β”‚
β”‚  β”‚   ValueNet(s) β†’ V(s)                                  β”‚   β”‚
β”‚  β”‚   PolicyNet(s) β†’ [Ο€(aβ‚€|s), ..., Ο€(a₇|s)]             β”‚   β”‚
β”‚  β”‚                                                       β”‚   β”‚
β”‚  β”‚   1. V-update: expectile regression on Q(s, a_data)  β”‚   β”‚
β”‚  β”‚   2. Q-update: TD backup using V(s')                 β”‚   β”‚
β”‚  β”‚   3. Policy: advantage-weighted BC (clipped weights) β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

rl_btc_v4/
β”œβ”€β”€ __init__.py          # Package exports
β”œβ”€β”€ __main__.py          # CLI entry point
β”œβ”€β”€ constants.py         # Action space, features, defaults
β”œβ”€β”€ env.py               # Gym-compatible BTC trading environment
β”œβ”€β”€ dataset.py           # Offline RL dataset builder
β”œβ”€β”€ iql_trainer.py       # IQL trainer (v2 β€” fixed)
β”œβ”€β”€ train.py             # Local training script
β”œβ”€β”€ train_cloud.py       # Cloud training (v1 β€” legacy)
β”œβ”€β”€ train_cloud_v2.py    # Cloud training (v2 β€” current)
β”œβ”€β”€ train_hf_job.py      # HF Jobs training script
└── README.md            # This file

Action Space

ID Name Type Side Fraction Description
0 HOLD hold β€” 0.00 Keep current position
1 FLAT target β€” 0.00 Liquidate all positions
2 YES_10 target YES 0.10 Buy YES at 10% equity
3 YES_25 target YES 0.25 Buy YES at 25% equity
4 YES_50 target YES 0.50 Buy YES at 50% equity
5 NO_10 target NO 0.10 Buy NO at 10% equity
6 NO_25 target NO 0.25 Buy NO at 25% equity
7 NO_50 target NO 0.50 Buy NO at 50% equity

8 discrete actions β€” designed for a prediction market microstructure where:

  • YES shares pay $1 if the market resolves UP, $0 otherwise
  • NO shares pay $1 if the market resolves DOWN, $0 otherwise
  • Positions settle at obs_pos == 4 (end of 5m observation window)
  • obs_pos == 4 is settlement-only: the simulator and replay env do not allow opening or changing positions on the same row that resolves the market.

Feature Engineering

Market Features (38 dimensions)

Category Features Purpose
Time obs_pos, seconds_since_open, seconds_to_close, market_progress, hour_sin, hour_cos Intra-window and daily seasonality
Prices yes_bid/ask, no_bid/ask, yes/no_mid, yes/no_spread Order book state
Arbitrage microprice_bias, abs_yes_mid_distance_from_even Cross-market mispricing signals
Returns price_return_from_open, mark/index_return_from_open_filled, abs_mark_return Price momentum
Volatility rolling_vol_15m, rolling_vol_60m Risk estimation
Flow taker_buy_ratio, buy_sell_imbalance, imbalance_x_vol Order flow pressure
Volume volume, num_trades, rolling_volume_15m, rolling_num_trades_15m Liquidity signals
Risk-adjusted return_over_vol_15m/60m, signed_move_x_time_remaining Sharpe-like signals
Derivatives funding_rate, funding_rate_prev, oi_delta_5m/15m/60m, long_short_ratio Market sentiment

Portfolio Features (9 dimensions)

Feature Description
cash_fraction Cash / starting_cash
equity_fraction Total equity / starting_cash
drawdown_fraction Peak-to-trough drawdown
position_side +1 (YES), -1 (NO), 0 (flat)
position_fraction Position value / equity
position_shares Raw share count
avg_entry_price Cost basis per share
unrealized_pnl_fraction PnL / starting_cash
steps_held_fraction How long position has been held

State Vector

state = [history_ordered (30 Γ— 38), portfolio_vec (9)]
      = [1140 + 9] = 1149 dimensions (with history_length=30)

Market features are normalized using a StandardScaler fit only on the training-period data. The history is a rolling window of the last 30 observations (chronologically ordered).


Reward Function

Risk-sensitive reward with multiple penalty terms:

pnl_reward     = clip(log(max(equity_after, floor) / max(equity_before, floor)), -2, 2)
dd_penalty     = 0.50 Γ— max(0, current_dd - prev_dd)
risk_penalty   = 1.0  Γ— max(0, current_dd - prev_dd)Β²
cvar_penalty   = 1.0  Γ— |pnl_reward| Γ— (1 + current_dd)  [only if pnl_reward < 0]
reward         = clip(pnl_reward - dd_penalty - risk_penalty - cvar_penalty, -4, 4)

Design rationale:

  • PnL reward: Bounded log-equity return with a 1% starting-cash floor, so rare binary-market jackpot payouts do not dominate the critic target
  • Drawdown penalty: Penalizes increasing drawdown (soft: linear + quadratic)
  • CVaR penalty: Amplifies losses during drawdown periods (tail risk) β€” subtracted from reward to penalize negative PnL more severely when already underwater

Fee Model

All trading operations (buy, liquidate, reduce) charge a taker fee based on the Crypto/BTC prediction market schedule:

fee_usdc = shares Γ— fee_rate Γ— price Γ— (1 - price)
Parameter Value Source
taker_fee_rate 0.072 Crypto category, BTC markets taker fee
maker_fee 0 Not modelled (all orders assumed market/taker)
maker_rebate 20% Not applied

Quadratic fee shape: Highest at price β‰ˆ 0.50 (max uncertainty), zero at price = 0 or 1. This matches how prediction market fees scale with outcome uncertainty.

The fee is applied consistently in both:

  • BTCTradingEnv β€” live policy replay, deducts from cash on buys (added to cost) and from proceeds on sells/liquidation
  • dataset.py counterfactual simulator β€” offline training rewards include fee drag so the learned Q-values and policy account for trading costs

To disable fees (e.g. ablation): --taker-fee-rate 0.0 or TAKER_FEE_RATE = 0.0 in constants.py.


IQL Algorithm (v2 β€” Fixed)

What was wrong in v1

Component v1 (Buggy) v2 (Fixed)
Q-network (s, one_hot(a)) β†’ scalar s β†’ [Q(s,aβ‚€), ..., Q(s,a₇)]
V-update Q_target(next_s, one_hot(a)) min(Q₁_target(s), Qβ‚‚_target(s)) gathered at dataset actions
Evaluation return np.mean(rewards) (no policy) Action agreement + entropy diagnostics
Dataset Greedy best-action (98.8% HOLD) / random softmax churn Conservative edge-gated behavioral policy with epsilon exploration
LR Fixed CosineAnnealingLR
Stopping None Early stopping with best model restoration

Current Algorithm (per batch)

1. V-update:
   q_target = min(Q₁_target(s), Qβ‚‚_target(s)) gathered at dataset actions
   v_loss = expectile_loss(q_target - V(s), Ο„=0.7)

2. Q-update:
   q_target = r + Ξ³ Γ— (1-done) Γ— V(s')
   q_loss = MSE(Q₁(s,a), q_target) + MSE(Qβ‚‚(s,a), q_target)
   soft_update(Q_target ← Q)

3. Policy update (every 2 critic steps):
   advantage = Q(s, a_dataset) - V(s)
   weights = clamp(exp(advantage / temperature), max=100)
   policy_loss = -mean(weights Γ— log Ο€(a_dataset | s))

Key hyperparameters

Parameter Cloud Default Local Default Rationale
expectile 0.7 0.7 Moderate upper expectile β€” balances optimism with robustness
temperature 3.0 3.0 Low Ξ² β†’ policy stays closer to behavioral (safer for financial data)
gamma 0.99 0.99 Standard discount for episodic tasks
tau 0.005 0.005 Slow target network update for stability
lr 3e-4 3e-4 Standard Adam LR
batch_size 512 256 Large batches for stable gradients (cloud); smaller for local memory
hidden_dim 256 256 2-layer MLP
dropout 0.1 0.0 Light regularization (cloud only)
policy_update_freq 2 2 TD3-style delayed policy updates
early_stopping_patience 20 evals 20 evals Stops if eval reward doesn't improve
behavioral_policy_mode conservative conservative Avoids fee-heavy uniform random trading
min_trade_edge 0.005 0.005 Directional action must beat HOLD/FLAT by this reward edge
behavioral_epsilon 0.03 0.03 Keeps limited action support for IQL without dominating the dataset

Dataset Building

Process

  1. Load parquet β†’ filter to obs_pos ∈ {0,1,2,3,4}, sort by time
  2. Fill NaNs β†’ defaults per column (0.0 for most, 1.0 for long_short_ratio)
  3. Temporal split β†’ last 20% of calendar days held out for test, with a purge/embargo gap of episode_span_days on each side to prevent overlapping windows from leaking information
  4. Fit StandardScaler β†’ normalize market features using train-period data only
  5. Build episodes β†’ sliding windows (span=30d, stride=15d) within each split
  6. Counterfactual simulation β†’ for each step, simulate ALL 8 actions and compute rewards
  7. Settlement guard β†’ obs_pos == 4 can settle existing inventory but cannot open new exposure, preventing same-row outcome leakage
  8. Behavioral policy β†’ pick the best of HOLD/FLAT unless the best directional trade clears min_trade_edge; use small epsilon exploration for action support
  9. Shard metadata β†’ stores sampled action distribution and, by default, all_action_rewards for counterfactual supervised training

Data Leakage Prevention

The dataset builder enforces a clean temporal separation:

[--- train days ---][<-- embargo -->][<-- embargo -->][--- test days ---]
   scaler fit here       gap              gap          held out
  • The StandardScaler is fit only on training-period rows.
  • An embargo buffer (default: one full episode_span_days, i.e. 30 days) is removed from both sides of the split boundary so that no 30-day sliding window can straddle the train/test line.
  • Episodes are built independently within each split's day list.

Scale (full config)

  • ~750 episodes from 2263 days of 5m data (before embargo removal)
  • State dim: 1149 (30 Γ— 38 + 9)
  • Action distribution: recorded in shard metadata; reject runs where fee-paying actions are near-uniform without a clear edge

Evaluation

What the trainer provides

The IQL trainer computes diagnostics at evaluation checkpoints:

Metric What it measures What it does NOT measure
Agreement with optimal Q-action How often the policy picks the action with the highest Q-value Actual trading PnL or Sharpe
Action entropy Diversity of the policy's action distribution Out-of-sample performance

These are diagnostics of training convergence, not trading performance metrics. Agreement and entropy tell you whether the policy has learned to distinguish actions, but they do not substitute for a true policy replay on held-out market data.

Policy replay

train_local_sharded.py and train_counterfactual_local.py both run held-out replay through BTCTradingEnv when --replay-episodes > 0. The replay metrics (mean_pnl, fees, drawdown, action counts) are the primary acceptance criteria.

Counterfactual Q path

Because the simulator computes rewards for every action at each state, train_counterfactual_local.py can train a direct action-value model on all_action_rewards. This avoids throwing away supervision by sampling one behavioral action per row. Deployment remains edge-gated: choose HOLD/FLAT unless the best directional action beats the no-new-risk baseline by min_trade_edge.


Current State

What works βœ…

  • IQL trainer with correct discrete-action architecture
  • Behavioral dataset with diverse actions
  • Counterfactual Q trainer using all-action reward labels
  • Leakage-free temporal split with embargo gap
  • Scaler fit on training data only
  • LR scheduling, early stopping, best model checkpointing
  • Cloud training scripts with Trackio monitoring
  • HF Hub upload/download

Known Challenges ⚠️

  1. Sparse positive rewards β€” most 5m windows don't have strong signals. The model needs many epochs to learn meaningful patterns.
  2. High-dimensional state (1149 dims) β€” the 30-step history window creates a large input.
  3. Counterfactual simulation is slow β€” dataset building takes time for full config (30-day episodes).
  4. Sparse or no deployment trades β€” after removing settlement-row leakage, held-out replay may correctly select no-trade if directional actions do not clear fees and drawdown risk.

How to Run

Local training (small config, CPU/MPS)

cd /path/to/doug-data
python -m rl_btc_v4.train

Default local config: batch_size=256, dropout=0.0, epochs=100.

Cloud training (GPU, full config)

python rl_btc_v4/train_cloud_v2.py

Cloud config uses batch_size=512, dropout=0.1.

Via HF Jobs

from huggingface_hub import HfApi

# Upload code to HF Hub first
api = HfApi()
api.upload_folder(
    folder_path="rl_btc_v4",
    repo_id="fbzu/rl_btc_v4_iql",
    repo_type="model",
)

# Then run train_hf_job.py via hf_jobs with GPU hardware

Artifacts

After training, the following are saved to https://huggingface.co/fbzu/rl_btc_v4_iql:

File Contents
iql_model.pt Q, V, and Policy network state dicts + config
scaler.npz Feature normalization stats + reward mean/std
train_report.json Full training metrics, config, and results

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using fbzu/rl_btc_v4_iql 3

Paper for fbzu/rl_btc_v4_iql