BTC 5-min Polymarket Predictor

LightGBM classifier that predicts swing events on Polymarket BTC binary markets.
Specifically: given that the YES token price has entered a cheap zone (≀ 0.35),
will it swing back up to β‰₯ 0.75 before market expiry?


Problem

Polymarket lists 5-minute BTC binary markets (e.g. "Will BTC be above $X at 14:05?").
YES tokens trade between 0 and 1. When YES drops to ≀ 0.35, the market is pricing
a ~35% or lower probability of BTC being above the strike. This model predicts whether
that cheap YES will swing back above 0.75 β€” a +140% return on the token price.


Model performance (latest)

Metric Value
CV ROC-AUC (5-fold) 0.706 Β± 0.016
Test ROC-AUC 0.715
Test PR-AUC 0.740
Train/test gap 0.100 (healthy)
Precision @0.63 threshold ~85%
EV per trade @0.63 +0.553

Trained on ~3,200 markets across multiple days of Polymarket tick data.


Model versions

Version CV AUC Notes
v1 β€” initial build, order callback bug
v2 0.597 dropped btc_price, aggregated OB sizes
v3 0.602 dropped hour/minute, added prev_open_mid_delta, heavier regularisation
v4 0.706 4x more data (3,200 markets), volume features dominant, healthy train/test gap
v5–v8 0.706+ incremental retrains on additional data

Features (33 total)

One row per market, captured at the first tick where mid ≀ 0.35.

Market state

Feature Description
mid Current YES mid price (0–1)
secs_norm Normalised time remaining (1.0 = start, 0.0 = expiry)
open_mid YES price at market open
price_from_open mid - open_mid
dist_from_low mid - running low since open
spread Best ask - best bid
is_first_tick 1 if this is the first tick of the market

Momentum

Feature Description
mom_5s Price change over last 5 seconds
mom_30s Price change over last 30 seconds
mom_accel mom_5s - mom_30s (acceleration)

Order book (aggregated)

Feature Description
ob_imbalance (bid_vol - ask_vol) / total_vol
depth_ratio bid depth / ask depth at top 5 levels
wall_side 1=bid wall, -1=ask wall, 0=balanced
top_bid_size Size at best bid
top_ask_size Size at best ask
total_bid_size Sum of bid sizes across 5 levels (engineered)
total_ask_size Sum of ask sizes across 5 levels (engineered)
size_imbalance (total_bid - total_ask) / total (engineered)
bid_ask_slope How steeply the book widens on each side (engineered)

Volume / tape

Feature Description
vol_imbalance Buy volume - sell volume (normalised)
vol_delta_rate Rate of change of vol_imbalance
vol_buy_rate Buy trades per second
vol_sell_rate Sell trades per second
vol_acceleration Change in total volume rate

BTC reference

Feature Description
btc_delta_norm BTC price change normalised to market open
btc_mom_30s BTC price momentum over 30 seconds
btc_aligned 1 if BTC momentum aligns with YES direction

Streak / history

Feature Description
streak_len Length of current price streak (ticks)
streak_dir Direction of streak: 1=up, -1=down
alt_streak Alternating tick count (choppiness signal)
up_ratio_last5 Fraction of last 5 ticks that were up
up_ratio_last10 Fraction of last 10 ticks that were up

Cross-market

Feature Description
prev_open_mid_delta open_mid - prev_open_mid (shift vs prior market)

Engineered features

Derived inside src/features.py from raw order book columns:

total_bid_size      = sum(bid_s1..bid_s5)
total_ask_size      = sum(ask_s1..ask_s5)
size_imbalance      = (total_bid - total_ask) / (total_bid + total_ask)
bid_ask_slope       = (ask_p5 - ask_p1) - (bid_p1 - bid_p5)
prev_open_mid_delta = open_mid - prev_open_mid

Usage

git clone https://huggingface.co/philippotiger/btc-5min-polymarket-predictor
cd btc-5min-polymarket-predictor
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

Train β€” auto version (increments from highest existing model)

python src/train.py                              # all CSVs in data/raw/
python src/train.py data/raw/my_data.csv         # specific file

Train β€” explicit version (overwrites if exists)

python src/train.py data/raw/my_data.csv --model v9
python src/train.py --model v9                   # all CSVs, named version

Score a file

python src/predict.py data/raw/new_day.csv
python src/predict.py data/raw/new_day.csv --threshold 0.55
python src/predict.py data/raw/new_day.csv --model v8

Live inference (single tick)

from src.predict import load_model, predict_live

model, feature_cols = load_model("v8")

result = predict_live(tick_dict, model, feature_cols, threshold=0.63)
if result["signal"]:
    print(f"BUY  conf={result['pred_proba']:.1%}  EV={result['ev_per_trade']:+.3f}")

Sample data

A 24-hour sample of real Polymarket tick data is included at sample_data/sample.csv (~25MB).
This covers ~200 markets and is enough to test the full pipeline end-to-end.

It is not enough to train a competitive model β€” for that you need several weeks of data
collected from your own Polymarket feed.

# test the pipeline with the sample
python src/predict.py sample_data/sample.csv

Data format

Your CSV needs one row per second per market with these raw columns:

ts, hhmm, market_id, market_seq, is_first_tick, is_last_tick,
secs_norm, mid, spread, open_mid, price_from_open, dist_from_low,
swing_low, swing_high, price_range, dist_to_high,
mom_5s, mom_30s, mom_accel,
bid_p1..5, bid_s1..5, ask_p1..5, ask_s1..5,
ob_imbalance, depth_ratio, wall_side, top_bid_size, top_ask_size,
vol_imbalance, vol_delta_rate, vol_buy_rate, vol_sell_rate, vol_acceleration,
btc_price, btc_open, btc_delta_norm, btc_mom_30s, btc_aligned,
streak_len, streak_dir, alt_streak, up_ratio_last5, up_ratio_last10,
prev_open_mid, outcome, swing_occurred, max_mid_reached

Thresholds

Threshold Precision Trades (per 640 markets) EV/trade
0.46 63% 282 +0.335
0.53 76% 194 +0.463
0.63 85% 136 +0.553

Do not use thresholds below 0.46 β€” the 50–60% confidence bucket is unreliable on fresh data.


Project structure

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ features.py        # feature engineering β€” single source of truth
β”‚   β”œβ”€β”€ train.py           # training pipeline with live rich dashboard
β”‚   β”œβ”€β”€ predict.py         # backtest + live inference
β”‚   └── split_last24h.py   # extract time windows for OOS testing
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ feature_importance_v4.csv
β”‚   └── training_metrics_v4.csv
β”œβ”€β”€ sample_data/
β”‚   └── sample.csv         # 24h sample (~200 markets) for pipeline testing
β”œβ”€β”€ requirements.txt
└── README.md

License

MIT β€” use freely, no warranty. Not financial advice.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support