π§ Chess Game Length Predictor
βΆοΈ Project Presentation Video: https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d
Predicting whether a chess game will be short, medium, or long using machine learning
This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess. The goal was to predict the length of a chess game (in number of half-moves) by classifying each game into one of three categories:
- 0 β Short game
- 1 β Medium-length game
- 2 β Long game
The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file.
π Project Overview
Chess games vary greatly in structure and duration. Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles.
This model predicts a gameβs length category using information available before or shortly after the opening phase, including:
- Player ratings
- Time control increment
- Opening characteristics
- Engineered features describing opening length
- Rating differences and averages
The final exported artifact is a trained Random Forest Classifier.
π Dataset Summary
The dataset contains approximately 19,800 chess games with:
- Game metadata (ID, number of turns, time increment)
- Player ratings (white, black)
- Opening information (short names, ECO-like categories)
- Game outcome and victory status
The target variable (turns_class) was created by binning number of turns into three quantile-based classes.
π Exploratory Data Analysis
Key insights included:
β’ Game length shows patterns across different opening families.
Some openings consistently lead to longer or shorter games.
β’ Player ratings mildly influence length.
Higher-rated matchups tend to produce longer, higher-quality games.
β’ Victory status interacts with game duration.
Games ending by resignation or timeout skew shorter.
β’ Opening names were highly diverse.
Hundreds of unique openings required careful preprocessing and encoding.
π οΈ Feature Engineering
Engineered features include:
β Opening Length Group
opening_moves was grouped into four categories: very_short, short, medium, long.
β Rating Difference & Average
rating_diff = abs(white_rating - black_rating)rating_avg = (white_rating + black_rating) / 2
β Draw Indicator
Boolean feature marking whether a game ended in a draw.
β Optional Opening Clusters
KMeans clustering was used to group openings with similar behavior.
β One-Hot Encoding
Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality.
π€ Models Trained
Three models were evaluated:
1οΈβ£ Logistic Regression
Baseline model. Accuracy: ~46β47%.
2οΈβ£ Random Forest Classifier
Best-performing model overall. Accuracy: ~47β48%.
3οΈβ£ Gradient Boosting Classifier
Slightly below Random Forest.
π Winner: Random Forest
Random Forest achieved the strongest performance due to:
- Good generalization
- Stable accuracy across all classes
- Robust handling of high-dimensional categorical features
This repository hosts the exported model:
random_forest_model.pkl
ποΈ Repository Contents
| File | Description |
|---|---|
random_forest_model.pkl |
Final trained model |
README.md |
Project documentation |
.gitattributes |
Managed by HuggingFace |
Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb |
Google Colab |
π§© Limitations
- Medium-length games show substantial overlap with other classes.
- High variance in opening names increases sparsity.
- Model is limited by information available early in the game.
π Acknowledgments
Thanks to Lichess for providing open game data and to HuggingFace for model hosting.