galsaar's picture
Update README.md
996addf verified
# 🧠 Chess Game Length Predictor
▢️ **Project Presentation Video:** https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d
### Predicting whether a chess game will be **short**, **medium**, or **long** using machine learning
This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess.
The goal was to predict the **length of a chess game** (in number of half-moves) by classifying each game into one of three categories:
* **0 – Short game**
* **1 – Medium-length game**
* **2 – Long game**
The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file.
---
# πŸ“˜ Project Overview
Chess games vary greatly in structure and duration.
Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles.
This model predicts a game’s *length category* using information available before or shortly after the opening phase, including:
* Player ratings
* Time control increment
* Opening characteristics
* Engineered features describing opening length
* Rating differences and averages
The final exported artifact is a trained **Random Forest Classifier**.
---
# πŸ” Dataset Summary
The dataset contains approximately 19,800 chess games with:
* Game metadata (ID, number of turns, time increment)
* Player ratings (white, black)
* Opening information (short names, ECO-like categories)
* Game outcome and victory status
The target variable (`turns_class`) was created by **binning number of turns** into three quantile-based classes.
---
# πŸ“Š Exploratory Data Analysis
Key insights included:
### β€’ Game length shows patterns across different opening families.
Some openings consistently lead to longer or shorter games.
### β€’ Player ratings mildly influence length.
Higher-rated matchups tend to produce longer, higher-quality games.
### β€’ Victory status interacts with game duration.
Games ending by resignation or timeout skew shorter.
### β€’ Opening names were highly diverse.
Hundreds of unique openings required careful preprocessing and encoding.
---
# πŸ› οΈ Feature Engineering
Engineered features include:
### βœ” Opening Length Group
`opening_moves` was grouped into four categories: `very_short`, `short`, `medium`, `long`.
### βœ” Rating Difference & Average
* `rating_diff = abs(white_rating - black_rating)`
* `rating_avg = (white_rating + black_rating) / 2`
### βœ” Draw Indicator
Boolean feature marking whether a game ended in a draw.
### βœ” Optional Opening Clusters
KMeans clustering was used to group openings with similar behavior.
### βœ” One-Hot Encoding
Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality.
---
# πŸ€– Models Trained
Three models were evaluated:
### 1️⃣ Logistic Regression
Baseline model.
Accuracy: ~46–47%.
### 2️⃣ Random Forest Classifier
Best-performing model overall.
Accuracy: ~47–48%.
### 3️⃣ Gradient Boosting Classifier
Slightly below Random Forest.
---
# πŸ† Winner: Random Forest
Random Forest achieved the strongest performance due to:
* Good generalization
* Stable accuracy across all classes
* Robust handling of high-dimensional categorical features
This repository hosts the exported model:
```
random_forest_model.pkl
```
---
# πŸ—‚οΈ Repository Contents
| File | Description |
| ------------------------- | -------------------------------- |
| `random_forest_model.pkl` | Final trained model |
| `README.md` | Project documentation |
| `.gitattributes` | Managed by HuggingFace |
| `Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb` | Google Colab |
---
# 🧩 Limitations
* Medium-length games show substantial overlap with other classes.
* High variance in opening names increases sparsity.
* Model is limited by information available early in the game.
---
# πŸ™Œ Acknowledgments
Thanks to Lichess for providing open game data and to HuggingFace for model hosting.