# 🧠 Chess Game Length Predictor ▢️ **Project Presentation Video:** https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d ### Predicting whether a chess game will be **short**, **medium**, or **long** using machine learning This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess. The goal was to predict the **length of a chess game** (in number of half-moves) by classifying each game into one of three categories: * **0 – Short game** * **1 – Medium-length game** * **2 – Long game** The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file. --- # πŸ“˜ Project Overview Chess games vary greatly in structure and duration. Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles. This model predicts a game’s *length category* using information available before or shortly after the opening phase, including: * Player ratings * Time control increment * Opening characteristics * Engineered features describing opening length * Rating differences and averages The final exported artifact is a trained **Random Forest Classifier**. --- # πŸ” Dataset Summary The dataset contains approximately 19,800 chess games with: * Game metadata (ID, number of turns, time increment) * Player ratings (white, black) * Opening information (short names, ECO-like categories) * Game outcome and victory status The target variable (`turns_class`) was created by **binning number of turns** into three quantile-based classes. --- # πŸ“Š Exploratory Data Analysis Key insights included: ### β€’ Game length shows patterns across different opening families. Some openings consistently lead to longer or shorter games. ### β€’ Player ratings mildly influence length. Higher-rated matchups tend to produce longer, higher-quality games. ### β€’ Victory status interacts with game duration. Games ending by resignation or timeout skew shorter. ### β€’ Opening names were highly diverse. Hundreds of unique openings required careful preprocessing and encoding. --- # πŸ› οΈ Feature Engineering Engineered features include: ### βœ” Opening Length Group `opening_moves` was grouped into four categories: `very_short`, `short`, `medium`, `long`. ### βœ” Rating Difference & Average * `rating_diff = abs(white_rating - black_rating)` * `rating_avg = (white_rating + black_rating) / 2` ### βœ” Draw Indicator Boolean feature marking whether a game ended in a draw. ### βœ” Optional Opening Clusters KMeans clustering was used to group openings with similar behavior. ### βœ” One-Hot Encoding Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality. --- # πŸ€– Models Trained Three models were evaluated: ### 1️⃣ Logistic Regression Baseline model. Accuracy: ~46–47%. ### 2️⃣ Random Forest Classifier Best-performing model overall. Accuracy: ~47–48%. ### 3️⃣ Gradient Boosting Classifier Slightly below Random Forest. --- # πŸ† Winner: Random Forest Random Forest achieved the strongest performance due to: * Good generalization * Stable accuracy across all classes * Robust handling of high-dimensional categorical features This repository hosts the exported model: ``` random_forest_model.pkl ``` --- # πŸ—‚οΈ Repository Contents | File | Description | | ------------------------- | -------------------------------- | | `random_forest_model.pkl` | Final trained model | | `README.md` | Project documentation | | `.gitattributes` | Managed by HuggingFace | | `Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb` | Google Colab | --- # 🧩 Limitations * Medium-length games show substantial overlap with other classes. * High variance in opening names increases sparsity. * Model is limited by information available early in the game. --- # πŸ™Œ Acknowledgments Thanks to Lichess for providing open game data and to HuggingFace for model hosting.