BTC 5-min Polymarket Predictor

LightGBM classifier that predicts swing events on Polymarket BTC binary markets.
Specifically: given that the YES token price has entered a cheap zone (≤ 0.35),
will it swing back up to ≥ 0.75 before market expiry?

Problem

Polymarket lists 5-minute BTC binary markets (e.g. "Will BTC be above $X at 14:05?").
YES tokens trade between 0 and 1. When YES drops to ≤ 0.35, the market is pricing
a ~35% or lower probability of BTC being above the strike. This model predicts whether
that cheap YES will swing back above 0.75 — a +140% return on the token price.

Model performance (latest)

Metric	Value
CV ROC-AUC (5-fold)	0.706 ± 0.016
Test ROC-AUC	0.715
Test PR-AUC	0.740
Train/test gap	0.100 (healthy)
Precision @0.63 threshold	~85%
EV per trade @0.63	+0.553

Trained on ~3,200 markets across multiple days of Polymarket tick data.

Model versions

Version	CV AUC	Notes
v1	—	initial build, order callback bug
v2	0.597	dropped btc_price, aggregated OB sizes
v3	0.602	dropped hour/minute, added prev_open_mid_delta, heavier regularisation
v4	0.706	4x more data (3,200 markets), volume features dominant, healthy train/test gap
v5–v8	0.706+	incremental retrains on additional data

Features (33 total)

One row per market, captured at the first tick where mid ≤ 0.35.

Market state

Feature	Description
`mid`	Current YES mid price (0–1)
`secs_norm`	Normalised time remaining (1.0 = start, 0.0 = expiry)
`open_mid`	YES price at market open
`price_from_open`	mid - open_mid
`dist_from_low`	mid - running low since open
`spread`	Best ask - best bid
`is_first_tick`	1 if this is the first tick of the market

Momentum

Feature	Description
`mom_5s`	Price change over last 5 seconds
`mom_30s`	Price change over last 30 seconds
`mom_accel`	mom_5s - mom_30s (acceleration)

Order book (aggregated)

Feature	Description
`ob_imbalance`	(bid_vol - ask_vol) / total_vol
`depth_ratio`	bid depth / ask depth at top 5 levels
`wall_side`	1=bid wall, -1=ask wall, 0=balanced
`top_bid_size`	Size at best bid
`top_ask_size`	Size at best ask
`total_bid_size`	Sum of bid sizes across 5 levels (engineered)
`total_ask_size`	Sum of ask sizes across 5 levels (engineered)
`size_imbalance`	(total_bid - total_ask) / total (engineered)
`bid_ask_slope`	How steeply the book widens on each side (engineered)

Volume / tape

Feature	Description
`vol_imbalance`	Buy volume - sell volume (normalised)
`vol_delta_rate`	Rate of change of vol_imbalance
`vol_buy_rate`	Buy trades per second
`vol_sell_rate`	Sell trades per second
`vol_acceleration`	Change in total volume rate

BTC reference

Feature	Description
`btc_delta_norm`	BTC price change normalised to market open
`btc_mom_30s`	BTC price momentum over 30 seconds
`btc_aligned`	1 if BTC momentum aligns with YES direction

Streak / history

Feature	Description
`streak_len`	Length of current price streak (ticks)
`streak_dir`	Direction of streak: 1=up, -1=down
`alt_streak`	Alternating tick count (choppiness signal)
`up_ratio_last5`	Fraction of last 5 ticks that were up
`up_ratio_last10`	Fraction of last 10 ticks that were up

Cross-market

Feature	Description
`prev_open_mid_delta`	open_mid - prev_open_mid (shift vs prior market)

Engineered features

Derived inside src/features.py from raw order book columns:

total_bid_size      = sum(bid_s1..bid_s5)
total_ask_size      = sum(ask_s1..ask_s5)
size_imbalance      = (total_bid - total_ask) / (total_bid + total_ask)
bid_ask_slope       = (ask_p5 - ask_p1) - (bid_p1 - bid_p5)
prev_open_mid_delta = open_mid - prev_open_mid

Usage

git clone https://huggingface.co/philippotiger/btc-5min-polymarket-predictor
cd btc-5min-polymarket-predictor
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

Train — auto version (increments from highest existing model)

python src/train.py                              # all CSVs in data/raw/
python src/train.py data/raw/my_data.csv         # specific file

Train — explicit version (overwrites if exists)

python src/train.py data/raw/my_data.csv --model v9
python src/train.py --model v9                   # all CSVs, named version

Score a file

python src/predict.py data/raw/new_day.csv
python src/predict.py data/raw/new_day.csv --threshold 0.55
python src/predict.py data/raw/new_day.csv --model v8

Live inference (single tick)

from src.predict import load_model, predict_live

model, feature_cols = load_model("v8")

result = predict_live(tick_dict, model, feature_cols, threshold=0.63)
if result["signal"]:
    print(f"BUY  conf={result['pred_proba']:.1%}  EV={result['ev_per_trade']:+.3f}")

Sample data

A 24-hour sample of real Polymarket tick data is included at sample_data/sample.csv (~25MB).
This covers ~200 markets and is enough to test the full pipeline end-to-end.

It is not enough to train a competitive model — for that you need several weeks of data
collected from your own Polymarket feed.

# test the pipeline with the sample
python src/predict.py sample_data/sample.csv

Data format

Your CSV needs one row per second per market with these raw columns:

ts, hhmm, market_id, market_seq, is_first_tick, is_last_tick,
secs_norm, mid, spread, open_mid, price_from_open, dist_from_low,
swing_low, swing_high, price_range, dist_to_high,
mom_5s, mom_30s, mom_accel,
bid_p1..5, bid_s1..5, ask_p1..5, ask_s1..5,
ob_imbalance, depth_ratio, wall_side, top_bid_size, top_ask_size,
vol_imbalance, vol_delta_rate, vol_buy_rate, vol_sell_rate, vol_acceleration,
btc_price, btc_open, btc_delta_norm, btc_mom_30s, btc_aligned,
streak_len, streak_dir, alt_streak, up_ratio_last5, up_ratio_last10,
prev_open_mid, outcome, swing_occurred, max_mid_reached

Thresholds

Threshold	Precision	Trades (per 640 markets)	EV/trade
0.46	63%	282	+0.335
0.53	76%	194	+0.463
0.63	85%	136	+0.553

Do not use thresholds below 0.46 — the 50–60% confidence bucket is unreliable on fresh data.

Project structure

├── src/
│   ├── features.py        # feature engineering — single source of truth
│   ├── train.py           # training pipeline with live rich dashboard
│   ├── predict.py         # backtest + live inference
│   └── split_last24h.py   # extract time windows for OOS testing
├── results/
│   ├── feature_importance_v4.csv
│   └── training_metrics_v4.csv
├── sample_data/
│   └── sample.csv         # 24h sample (~200 markets) for pipeline testing
├── requirements.txt
└── README.md

License

MIT — use freely, no warranty. Not financial advice.

Downloads last month: -; Downloads are not tracked for this model. How to track