BTC 5-min Polymarket Predictor
LightGBM classifier that predicts swing events on Polymarket BTC binary markets.
Specifically: given that the YES token price has entered a cheap zone (β€ 0.35),
will it swing back up to β₯ 0.75 before market expiry?
Problem
Polymarket lists 5-minute BTC binary markets (e.g. "Will BTC be above $X at 14:05?").
YES tokens trade between 0 and 1. When YES drops to β€ 0.35, the market is pricing
a ~35% or lower probability of BTC being above the strike. This model predicts whether
that cheap YES will swing back above 0.75 β a +140% return on the token price.
Model performance (latest)
| Metric | Value |
|---|---|
| CV ROC-AUC (5-fold) | 0.706 Β± 0.016 |
| Test ROC-AUC | 0.715 |
| Test PR-AUC | 0.740 |
| Train/test gap | 0.100 (healthy) |
| Precision @0.63 threshold | ~85% |
| EV per trade @0.63 | +0.553 |
Trained on ~3,200 markets across multiple days of Polymarket tick data.
Model versions
| Version | CV AUC | Notes |
|---|---|---|
| v1 | β | initial build, order callback bug |
| v2 | 0.597 | dropped btc_price, aggregated OB sizes |
| v3 | 0.602 | dropped hour/minute, added prev_open_mid_delta, heavier regularisation |
| v4 | 0.706 | 4x more data (3,200 markets), volume features dominant, healthy train/test gap |
| v5βv8 | 0.706+ | incremental retrains on additional data |
Features (33 total)
One row per market, captured at the first tick where mid β€ 0.35.
Market state
| Feature | Description |
|---|---|
mid |
Current YES mid price (0β1) |
secs_norm |
Normalised time remaining (1.0 = start, 0.0 = expiry) |
open_mid |
YES price at market open |
price_from_open |
mid - open_mid |
dist_from_low |
mid - running low since open |
spread |
Best ask - best bid |
is_first_tick |
1 if this is the first tick of the market |
Momentum
| Feature | Description |
|---|---|
mom_5s |
Price change over last 5 seconds |
mom_30s |
Price change over last 30 seconds |
mom_accel |
mom_5s - mom_30s (acceleration) |
Order book (aggregated)
| Feature | Description |
|---|---|
ob_imbalance |
(bid_vol - ask_vol) / total_vol |
depth_ratio |
bid depth / ask depth at top 5 levels |
wall_side |
1=bid wall, -1=ask wall, 0=balanced |
top_bid_size |
Size at best bid |
top_ask_size |
Size at best ask |
total_bid_size |
Sum of bid sizes across 5 levels (engineered) |
total_ask_size |
Sum of ask sizes across 5 levels (engineered) |
size_imbalance |
(total_bid - total_ask) / total (engineered) |
bid_ask_slope |
How steeply the book widens on each side (engineered) |
Volume / tape
| Feature | Description |
|---|---|
vol_imbalance |
Buy volume - sell volume (normalised) |
vol_delta_rate |
Rate of change of vol_imbalance |
vol_buy_rate |
Buy trades per second |
vol_sell_rate |
Sell trades per second |
vol_acceleration |
Change in total volume rate |
BTC reference
| Feature | Description |
|---|---|
btc_delta_norm |
BTC price change normalised to market open |
btc_mom_30s |
BTC price momentum over 30 seconds |
btc_aligned |
1 if BTC momentum aligns with YES direction |
Streak / history
| Feature | Description |
|---|---|
streak_len |
Length of current price streak (ticks) |
streak_dir |
Direction of streak: 1=up, -1=down |
alt_streak |
Alternating tick count (choppiness signal) |
up_ratio_last5 |
Fraction of last 5 ticks that were up |
up_ratio_last10 |
Fraction of last 10 ticks that were up |
Cross-market
| Feature | Description |
|---|---|
prev_open_mid_delta |
open_mid - prev_open_mid (shift vs prior market) |
Engineered features
Derived inside src/features.py from raw order book columns:
total_bid_size = sum(bid_s1..bid_s5)
total_ask_size = sum(ask_s1..ask_s5)
size_imbalance = (total_bid - total_ask) / (total_bid + total_ask)
bid_ask_slope = (ask_p5 - ask_p1) - (bid_p1 - bid_p5)
prev_open_mid_delta = open_mid - prev_open_mid
Usage
git clone https://huggingface.co/philippotiger/btc-5min-polymarket-predictor
cd btc-5min-polymarket-predictor
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
Train β auto version (increments from highest existing model)
python src/train.py # all CSVs in data/raw/
python src/train.py data/raw/my_data.csv # specific file
Train β explicit version (overwrites if exists)
python src/train.py data/raw/my_data.csv --model v9
python src/train.py --model v9 # all CSVs, named version
Score a file
python src/predict.py data/raw/new_day.csv
python src/predict.py data/raw/new_day.csv --threshold 0.55
python src/predict.py data/raw/new_day.csv --model v8
Live inference (single tick)
from src.predict import load_model, predict_live
model, feature_cols = load_model("v8")
result = predict_live(tick_dict, model, feature_cols, threshold=0.63)
if result["signal"]:
print(f"BUY conf={result['pred_proba']:.1%} EV={result['ev_per_trade']:+.3f}")
Sample data
A 24-hour sample of real Polymarket tick data is included at sample_data/sample.csv (~25MB).
This covers ~200 markets and is enough to test the full pipeline end-to-end.
It is not enough to train a competitive model β for that you need several weeks of data
collected from your own Polymarket feed.
# test the pipeline with the sample
python src/predict.py sample_data/sample.csv
Data format
Your CSV needs one row per second per market with these raw columns:
ts, hhmm, market_id, market_seq, is_first_tick, is_last_tick,
secs_norm, mid, spread, open_mid, price_from_open, dist_from_low,
swing_low, swing_high, price_range, dist_to_high,
mom_5s, mom_30s, mom_accel,
bid_p1..5, bid_s1..5, ask_p1..5, ask_s1..5,
ob_imbalance, depth_ratio, wall_side, top_bid_size, top_ask_size,
vol_imbalance, vol_delta_rate, vol_buy_rate, vol_sell_rate, vol_acceleration,
btc_price, btc_open, btc_delta_norm, btc_mom_30s, btc_aligned,
streak_len, streak_dir, alt_streak, up_ratio_last5, up_ratio_last10,
prev_open_mid, outcome, swing_occurred, max_mid_reached
Thresholds
| Threshold | Precision | Trades (per 640 markets) | EV/trade |
|---|---|---|---|
| 0.46 | 63% | 282 | +0.335 |
| 0.53 | 76% | 194 | +0.463 |
| 0.63 | 85% | 136 | +0.553 |
Do not use thresholds below 0.46 β the 50β60% confidence bucket is unreliable on fresh data.
Project structure
βββ src/
β βββ features.py # feature engineering β single source of truth
β βββ train.py # training pipeline with live rich dashboard
β βββ predict.py # backtest + live inference
β βββ split_last24h.py # extract time windows for OOS testing
βββ results/
β βββ feature_importance_v4.csv
β βββ training_metrics_v4.csv
βββ sample_data/
β βββ sample.csv # 24h sample (~200 markets) for pipeline testing
βββ requirements.txt
βββ README.md
License
MIT β use freely, no warranty. Not financial advice.