IPL Match Winner Prediction Model

This repository contains machine learning models for predicting IPL (Indian Premier League) cricket match outcomes using historical ball-by-ball data from 2008–2025.

Dataset

Source: prasad-gade05/ipl-enriched-2008-2025 on Hugging Face
Coverage: 278,205 ball-by-ball records across 1,169 matches (2008–2025)
Splits: Time-based train/test (seasons 2008–2023 train, 2024–2025 test)

Models

1. Pre-Match Winner Prediction (`ipl_model_v2/`)

Predicts the winner before the match starts using engineered features:

Elo ratings for each team
Multi-window rolling form (last 3, 5, 10 matches)
Season-level performance stats
Venue-specific chasing win rates and average scores
Toss impact features
Head-to-head historical records

Models trained:

XGBoost
Random Forest
Gradient Boosting
Logistic Regression
Weighted Ensemble of all four

2. In-Match Win Probability (`ipl_inmatch_model/`)

Predicts the win probability during the match at different stages:

After 5 overs: ~67% accuracy, AUC 0.70
After 10 overs: ~69% accuracy, AUC 0.75
After 15 overs: ~82% accuracy, AUC 0.89
After 20 overs: ~92% accuracy, AUC 0.98

Each stage uses the current match state:

Runs scored, wickets lost, run rates
Required run rate vs current run rate
Balls and wickets remaining
Target score

Performance Summary

Model	Stage	Accuracy	AUC-ROC	Log Loss
XGBoost	5 overs	0.671	0.700	0.656
XGBoost	10 overs	0.685	0.754	0.648
XGBoost	15 overs	0.818	0.892	0.427
XGBoost	20 overs	0.916	0.978	0.204

Baseline (always predict majority class): ~52.4%

Files

xgboost_stage_5.pkl — In-match model at 5 overs
xgboost_stage_10.pkl — In-match model at 10 overs
xgboost_stage_15.pkl — In-match model at 15 overs
xgboost_stage_20.pkl — In-match model at 20 overs
team_encoder.pkl — Label encoder for team names
stage_results.json — Evaluation metrics per stage

Usage

import joblib
import pandas as pd

# Load model for a specific stage
model = joblib.load("xgboost_stage_15.pkl")
encoder = joblib.load("team_encoder.pkl")

# Predict win probability for batting first team
features = pd.DataFrame({
    'batting_first_enc': [encoder.transform(["Royal Challengers Bengaluru"])[0]],
    'batting_second_enc': [encoder.transform(["Mumbai Indians"])[0]],
    'toss_decision_bat': [1],
    't1_runs': [120],
    't1_balls': [90],
    't1_wickets': [3],
    't1_boundaries': [12],
    't1_dots': [25],
    't1_run_rate': [8.0],
    't2_runs': [80],
    't2_balls': [60],
    't2_wickets': [2],
    't2_boundaries': [8],
    't2_dots': [20],
    't2_run_rate': [8.0],
    'target': [180],
    'second_innings_active': [1],
    'runs_remaining': [100],
    'balls_remaining': [60],
    'required_run_rate': [10.0],
    'rr_diff': [0.0],
    't2_wickets_remaining': [8],
})

prob = model.predict_proba(features)[0][1]
print(f"Batting first team win probability: {prob:.2%}")

License

Dataset used: CC0-1.0 (public domain)

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support