| # π§ Chess Game Length Predictor | |
| βΆοΈ **Project Presentation Video:** https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d | |
| ### Predicting whether a chess game will be **short**, **medium**, or **long** using machine learning | |
| This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess. | |
| The goal was to predict the **length of a chess game** (in number of half-moves) by classifying each game into one of three categories: | |
| * **0 β Short game** | |
| * **1 β Medium-length game** | |
| * **2 β Long game** | |
| The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file. | |
| --- | |
| # π Project Overview | |
| Chess games vary greatly in structure and duration. | |
| Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles. | |
| This model predicts a gameβs *length category* using information available before or shortly after the opening phase, including: | |
| * Player ratings | |
| * Time control increment | |
| * Opening characteristics | |
| * Engineered features describing opening length | |
| * Rating differences and averages | |
| The final exported artifact is a trained **Random Forest Classifier**. | |
| --- | |
| # π Dataset Summary | |
| The dataset contains approximately 19,800 chess games with: | |
| * Game metadata (ID, number of turns, time increment) | |
| * Player ratings (white, black) | |
| * Opening information (short names, ECO-like categories) | |
| * Game outcome and victory status | |
| The target variable (`turns_class`) was created by **binning number of turns** into three quantile-based classes. | |
| --- | |
| # π Exploratory Data Analysis | |
| Key insights included: | |
| ### β’ Game length shows patterns across different opening families. | |
| Some openings consistently lead to longer or shorter games. | |
| ### β’ Player ratings mildly influence length. | |
| Higher-rated matchups tend to produce longer, higher-quality games. | |
| ### β’ Victory status interacts with game duration. | |
| Games ending by resignation or timeout skew shorter. | |
| ### β’ Opening names were highly diverse. | |
| Hundreds of unique openings required careful preprocessing and encoding. | |
| --- | |
| # π οΈ Feature Engineering | |
| Engineered features include: | |
| ### β Opening Length Group | |
| `opening_moves` was grouped into four categories: `very_short`, `short`, `medium`, `long`. | |
| ### β Rating Difference & Average | |
| * `rating_diff = abs(white_rating - black_rating)` | |
| * `rating_avg = (white_rating + black_rating) / 2` | |
| ### β Draw Indicator | |
| Boolean feature marking whether a game ended in a draw. | |
| ### β Optional Opening Clusters | |
| KMeans clustering was used to group openings with similar behavior. | |
| ### β One-Hot Encoding | |
| Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality. | |
| --- | |
| # π€ Models Trained | |
| Three models were evaluated: | |
| ### 1οΈβ£ Logistic Regression | |
| Baseline model. | |
| Accuracy: ~46β47%. | |
| ### 2οΈβ£ Random Forest Classifier | |
| Best-performing model overall. | |
| Accuracy: ~47β48%. | |
| ### 3οΈβ£ Gradient Boosting Classifier | |
| Slightly below Random Forest. | |
| --- | |
| # π Winner: Random Forest | |
| Random Forest achieved the strongest performance due to: | |
| * Good generalization | |
| * Stable accuracy across all classes | |
| * Robust handling of high-dimensional categorical features | |
| This repository hosts the exported model: | |
| ``` | |
| random_forest_model.pkl | |
| ``` | |
| --- | |
| # ποΈ Repository Contents | |
| | File | Description | | |
| | ------------------------- | -------------------------------- | | |
| | `random_forest_model.pkl` | Final trained model | | |
| | `README.md` | Project documentation | | |
| | `.gitattributes` | Managed by HuggingFace | | |
| | `Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb` | Google Colab | | |
| --- | |
| # π§© Limitations | |
| * Medium-length games show substantial overlap with other classes. | |
| * High variance in opening names increases sparsity. | |
| * Model is limited by information available early in the game. | |
| --- | |
| # π Acknowledgments | |
| Thanks to Lichess for providing open game data and to HuggingFace for model hosting. | |