YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

rl_btc_v4 — Offline Implicit Q-Learning for Bitcoin Trading

Status: v2 (fixed + improved) — Trainer runs and produces diagnostics
Based on: IQL (Kostrikov et al., 2021)
Data: 5m Binance OHLCV + derivatives data → 2.2M+ transitions

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    rl_btc_v4 Pipeline                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐    ┌──────────────────┐    ┌────────────┐ │
│  │  Parquet     │───▶│  Dataset Builder  │───▶│ (s,a,r,s',d)│ │
│  │  Data        │    │  (behavioral policy│    │  Transitions│
│  │  (5m bars)   │    │   counterfactuals) │    │            │
│  └──────────────┘    └──────────────────┘    └──────┬─────┘ │
│                                                      │       │
│                     ┌────────────────────────────────┘       │
│                     ▼                                        │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              IQL Trainer                              │   │
│  │                                                       │   │
│  │   DiscreteQ(s) → [Q(s,a₀), Q(s,a₁), ... Q(s,a₇)]    │   │
│  │   ValueNet(s) → V(s)                                  │   │
│  │   PolicyNet(s) → [π(a₀|s), ..., π(a₇|s)]             │   │
│  │                                                       │   │
│  │   1. V-update: expectile regression on Q(s, a_data)  │   │
│  │   2. Q-update: TD backup using V(s')                 │   │
│  │   3. Policy: advantage-weighted BC (clipped weights) │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Project Structure

rl_btc_v4/
├── __init__.py          # Package exports
├── __main__.py          # CLI entry point
├── constants.py         # Action space, features, defaults
├── env.py               # Gym-compatible BTC trading environment
├── dataset.py           # Offline RL dataset builder
├── iql_trainer.py       # IQL trainer (v2 — fixed)
├── train.py             # Local training script
├── train_cloud.py       # Cloud training (v1 — legacy)
├── train_cloud_v2.py    # Cloud training (v2 — current)
├── train_hf_job.py      # HF Jobs training script
└── README.md            # This file

Action Space

ID	Name	Type	Side	Fraction	Description
0	HOLD	hold	—	0.00	Keep current position
1	FLAT	target	—	0.00	Liquidate all positions
2	YES_10	target	YES	0.10	Buy YES at 10% equity
3	YES_25	target	YES	0.25	Buy YES at 25% equity
4	YES_50	target	YES	0.50	Buy YES at 50% equity
5	NO_10	target	NO	0.10	Buy NO at 10% equity
6	NO_25	target	NO	0.25	Buy NO at 25% equity
7	NO_50	target	NO	0.50	Buy NO at 50% equity

8 discrete actions — designed for a prediction market microstructure where:

YES shares pay $1 if the market resolves UP, $0 otherwise
NO shares pay $1 if the market resolves DOWN, $0 otherwise
Positions settle at obs_pos == 4 (end of 5m observation window)
obs_pos == 4 is settlement-only: the simulator and replay env do not allow opening or changing positions on the same row that resolves the market.

Feature Engineering

Market Features (38 dimensions)

Category	Features	Purpose
Time	`obs_pos`, `seconds_since_open`, `seconds_to_close`, `market_progress`, `hour_sin`, `hour_cos`	Intra-window and daily seasonality
Prices	`yes_bid/ask`, `no_bid/ask`, `yes/no_mid`, `yes/no_spread`	Order book state
Arbitrage	`microprice_bias`, `abs_yes_mid_distance_from_even`	Cross-market mispricing signals
Returns	`price_return_from_open`, `mark/index_return_from_open_filled`, `abs_mark_return`	Price momentum
Volatility	`rolling_vol_15m`, `rolling_vol_60m`	Risk estimation
Flow	`taker_buy_ratio`, `buy_sell_imbalance`, `imbalance_x_vol`	Order flow pressure
Volume	`volume`, `num_trades`, `rolling_volume_15m`, `rolling_num_trades_15m`	Liquidity signals
Risk-adjusted	`return_over_vol_15m/60m`, `signed_move_x_time_remaining`	Sharpe-like signals
Derivatives	`funding_rate`, `funding_rate_prev`, `oi_delta_5m/15m/60m`, `long_short_ratio`	Market sentiment

Portfolio Features (9 dimensions)

Feature	Description
`cash_fraction`	Cash / starting_cash
`equity_fraction`	Total equity / starting_cash
`drawdown_fraction`	Peak-to-trough drawdown
`position_side`	+1 (YES), -1 (NO), 0 (flat)
`position_fraction`	Position value / equity
`position_shares`	Raw share count
`avg_entry_price`	Cost basis per share
`unrealized_pnl_fraction`	PnL / starting_cash
`steps_held_fraction`	How long position has been held

State Vector

state = [history_ordered (30 × 38), portfolio_vec (9)]
      = [1140 + 9] = 1149 dimensions (with history_length=30)

Market features are normalized using a StandardScaler fit only on the training-period data. The history is a rolling window of the last 30 observations (chronologically ordered).

Reward Function

Risk-sensitive reward with multiple penalty terms:

pnl_reward     = clip(log(max(equity_after, floor) / max(equity_before, floor)), -2, 2)
dd_penalty     = 0.50 × max(0, current_dd - prev_dd)
risk_penalty   = 1.0  × max(0, current_dd - prev_dd)²
cvar_penalty   = 1.0  × |pnl_reward| × (1 + current_dd)  [only if pnl_reward < 0]
reward         = clip(pnl_reward - dd_penalty - risk_penalty - cvar_penalty, -4, 4)

Design rationale:

PnL reward: Bounded log-equity return with a 1% starting-cash floor, so rare binary-market jackpot payouts do not dominate the critic target
Drawdown penalty: Penalizes increasing drawdown (soft: linear + quadratic)
CVaR penalty: Amplifies losses during drawdown periods (tail risk) — subtracted from reward to penalize negative PnL more severely when already underwater

Fee Model

All trading operations (buy, liquidate, reduce) charge a taker fee based on the Crypto/BTC prediction market schedule:

fee_usdc = shares × fee_rate × price × (1 - price)

Parameter	Value	Source
`taker_fee_rate`	0.072	Crypto category, BTC markets taker fee
`maker_fee`	0	Not modelled (all orders assumed market/taker)
`maker_rebate`	20%	Not applied

Quadratic fee shape: Highest at price ≈ 0.50 (max uncertainty), zero at price = 0 or 1. This matches how prediction market fees scale with outcome uncertainty.

The fee is applied consistently in both:

BTCTradingEnv — live policy replay, deducts from cash on buys (added to cost) and from proceeds on sells/liquidation
dataset.py counterfactual simulator — offline training rewards include fee drag so the learned Q-values and policy account for trading costs

To disable fees (e.g. ablation): --taker-fee-rate 0.0 or TAKER_FEE_RATE = 0.0 in constants.py.

IQL Algorithm (v2 — Fixed)

What was wrong in v1

Component	v1 (Buggy)	v2 (Fixed)
Q-network	`(s, one_hot(a)) → scalar`	`s → [Q(s,a₀), ..., Q(s,a₇)]`
V-update	`Q_target(next_s, one_hot(a))`	`min(Q₁_target(s), Q₂_target(s))` gathered at dataset actions
Evaluation	`return np.mean(rewards)` (no policy)	Action agreement + entropy diagnostics
Dataset	Greedy best-action (98.8% HOLD) / random softmax churn	Conservative edge-gated behavioral policy with epsilon exploration
LR	Fixed	CosineAnnealingLR
Stopping	None	Early stopping with best model restoration

Current Algorithm (per batch)

1. V-update:
   q_target = min(Q₁_target(s), Q₂_target(s)) gathered at dataset actions
   v_loss = expectile_loss(q_target - V(s), τ=0.7)

2. Q-update:
   q_target = r + γ × (1-done) × V(s')
   q_loss = MSE(Q₁(s,a), q_target) + MSE(Q₂(s,a), q_target)
   soft_update(Q_target ← Q)

3. Policy update (every 2 critic steps):
   advantage = Q(s, a_dataset) - V(s)
   weights = clamp(exp(advantage / temperature), max=100)
   policy_loss = -mean(weights × log π(a_dataset | s))

Key hyperparameters

Parameter	Cloud Default	Local Default	Rationale
`expectile`	0.7	0.7	Moderate upper expectile — balances optimism with robustness
`temperature`	3.0	3.0	Low β → policy stays closer to behavioral (safer for financial data)
`gamma`	0.99	0.99	Standard discount for episodic tasks
`tau`	0.005	0.005	Slow target network update for stability
`lr`	3e-4	3e-4	Standard Adam LR
`batch_size`	512	256	Large batches for stable gradients (cloud); smaller for local memory
`hidden_dim`	256	256	2-layer MLP
`dropout`	0.1	0.0	Light regularization (cloud only)
`policy_update_freq`	2	2	TD3-style delayed policy updates
`early_stopping_patience`	20 evals	20 evals	Stops if eval reward doesn't improve
`behavioral_policy_mode`	conservative	conservative	Avoids fee-heavy uniform random trading
`min_trade_edge`	0.005	0.005	Directional action must beat HOLD/FLAT by this reward edge
`behavioral_epsilon`	0.03	0.03	Keeps limited action support for IQL without dominating the dataset

Dataset Building

Process

Load parquet → filter to obs_pos ∈ {0,1,2,3,4}, sort by time
Fill NaNs → defaults per column (0.0 for most, 1.0 for long_short_ratio)
Temporal split → last 20% of calendar days held out for test, with a purge/embargo gap of episode_span_days on each side to prevent overlapping windows from leaking information
Fit StandardScaler → normalize market features using train-period data only
Build episodes → sliding windows (span=30d, stride=15d) within each split
Counterfactual simulation → for each step, simulate ALL 8 actions and compute rewards
Settlement guard → obs_pos == 4 can settle existing inventory but cannot open new exposure, preventing same-row outcome leakage
Behavioral policy → pick the best of HOLD/FLAT unless the best directional trade clears min_trade_edge; use small epsilon exploration for action support
Shard metadata → stores sampled action distribution and, by default, all_action_rewards for counterfactual supervised training

Data Leakage Prevention

The dataset builder enforces a clean temporal separation:

[--- train days ---][<-- embargo -->][<-- embargo -->][--- test days ---]
   scaler fit here       gap              gap          held out

The StandardScaler is fit only on training-period rows.
An embargo buffer (default: one full episode_span_days, i.e. 30 days) is removed from both sides of the split boundary so that no 30-day sliding window can straddle the train/test line.
Episodes are built independently within each split's day list.

Scale (full config)

~750 episodes from 2263 days of 5m data (before embargo removal)
State dim: 1149 (30 × 38 + 9)
Action distribution: recorded in shard metadata; reject runs where fee-paying actions are near-uniform without a clear edge

Evaluation

What the trainer provides

The IQL trainer computes diagnostics at evaluation checkpoints:

Metric	What it measures	What it does NOT measure
Agreement with optimal Q-action	How often the policy picks the action with the highest Q-value	Actual trading PnL or Sharpe
Action entropy	Diversity of the policy's action distribution	Out-of-sample performance

These are diagnostics of training convergence, not trading performance metrics. Agreement and entropy tell you whether the policy has learned to distinguish actions, but they do not substitute for a true policy replay on held-out market data.

Policy replay

train_local_sharded.py and train_counterfactual_local.py both run held-out replay through BTCTradingEnv when --replay-episodes > 0. The replay metrics (mean_pnl, fees, drawdown, action counts) are the primary acceptance criteria.

Counterfactual Q path

Because the simulator computes rewards for every action at each state, train_counterfactual_local.py can train a direct action-value model on all_action_rewards. This avoids throwing away supervision by sampling one behavioral action per row. Deployment remains edge-gated: choose HOLD/FLAT unless the best directional action beats the no-new-risk baseline by min_trade_edge.

Current State

What works ✅

IQL trainer with correct discrete-action architecture
Behavioral dataset with diverse actions
Counterfactual Q trainer using all-action reward labels
Leakage-free temporal split with embargo gap
Scaler fit on training data only
LR scheduling, early stopping, best model checkpointing
Cloud training scripts with Trackio monitoring
HF Hub upload/download

Known Challenges ⚠️

Sparse positive rewards — most 5m windows don't have strong signals. The model needs many epochs to learn meaningful patterns.
High-dimensional state (1149 dims) — the 30-step history window creates a large input.
Counterfactual simulation is slow — dataset building takes time for full config (30-day episodes).
Sparse or no deployment trades — after removing settlement-row leakage, held-out replay may correctly select no-trade if directional actions do not clear fees and drawdown risk.

How to Run

Local training (small config, CPU/MPS)

cd /path/to/doug-data
python -m rl_btc_v4.train

Default local config: batch_size=256, dropout=0.0, epochs=100.

Cloud training (GPU, full config)

python rl_btc_v4/train_cloud_v2.py

Cloud config uses batch_size=512, dropout=0.1.

Via HF Jobs

from huggingface_hub import HfApi

# Upload code to HF Hub first
api = HfApi()
api.upload_folder(
    folder_path="rl_btc_v4",
    repo_id="fbzu/rl_btc_v4_iql",
    repo_type="model",
)

# Then run train_hf_job.py via hf_jobs with GPU hardware

Artifacts

After training, the following are saved to https://huggingface.co/fbzu/rl_btc_v4_iql:

File	Contents
`iql_model.pt`	Q, V, and Policy network state dicts + config
`scaler.npz`	Feature normalization stats + reward mean/std
`train_report.json`	Full training metrics, config, and results

References

IQL Paper: Kostrikov et al. (2021) — Implicit Q-Learning
Reference Implementation: ikostrikov/implicit_q_learning (JAX/Flax)
Discrete IQL: Chulabhaya/recurrent-discrete-iql (PyTorch)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using fbzu/rl_btc_v4_iql 3

Paper for fbzu/rl_btc_v4_iql

Offline Reinforcement Learning with Implicit Q-Learning

Paper • 2110.06169 • Published Oct 12, 2021