# 🧠 Chess Game Length Predictor

▶️ **Project Presentation Video:** https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d


### Predicting whether a chess game will be **short**, **medium**, or **long** using machine learning

This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess.
The goal was to predict the **length of a chess game** (in number of half-moves) by classifying each game into one of three categories:

* **0 – Short game**
* **1 – Medium-length game**
* **2 – Long game**

The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file.

---

# 📘 Project Overview

Chess games vary greatly in structure and duration.
Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles.

This model predicts a game’s *length category* using information available before or shortly after the opening phase, including:

* Player ratings
* Time control increment
* Opening characteristics
* Engineered features describing opening length
* Rating differences and averages

The final exported artifact is a trained **Random Forest Classifier**.

---

# 🔍 Dataset Summary

The dataset contains approximately 19,800 chess games with:

* Game metadata (ID, number of turns, time increment)
* Player ratings (white, black)
* Opening information (short names, ECO-like categories)
* Game outcome and victory status

The target variable (`turns_class`) was created by **binning number of turns** into three quantile-based classes.

---

# 📊 Exploratory Data Analysis

Key insights included:

### • Game length shows patterns across different opening families.

Some openings consistently lead to longer or shorter games.

### • Player ratings mildly influence length.

Higher-rated matchups tend to produce longer, higher-quality games.

### • Victory status interacts with game duration.

Games ending by resignation or timeout skew shorter.

### • Opening names were highly diverse.

Hundreds of unique openings required careful preprocessing and encoding.

---

# 🛠️ Feature Engineering

Engineered features include:

### ✔ Opening Length Group

`opening_moves` was grouped into four categories: `very_short`, `short`, `medium`, `long`.

### ✔ Rating Difference & Average

* `rating_diff = abs(white_rating - black_rating)`
* `rating_avg = (white_rating + black_rating) / 2`

### ✔ Draw Indicator

Boolean feature marking whether a game ended in a draw.

### ✔ Optional Opening Clusters

KMeans clustering was used to group openings with similar behavior.

### ✔ One-Hot Encoding

Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality.

---

# 🤖 Models Trained

Three models were evaluated:

### 1️⃣ Logistic Regression

Baseline model.
Accuracy: ~46–47%.

### 2️⃣ Random Forest Classifier

Best-performing model overall.
Accuracy: ~47–48%.

### 3️⃣ Gradient Boosting Classifier

Slightly below Random Forest.

---

# 🏆 Winner: Random Forest

Random Forest achieved the strongest performance due to:

* Good generalization
* Stable accuracy across all classes
* Robust handling of high-dimensional categorical features

This repository hosts the exported model:

```
random_forest_model.pkl
```

---

# 🗂️ Repository Contents

| File                      | Description                      |
| ------------------------- | -------------------------------- |
| `random_forest_model.pkl` | Final trained model              |
| `README.md`               | Project documentation            |
| `.gitattributes`          | Managed by HuggingFace           |
| `Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb`          | Google Colab |

---

# 🧩 Limitations

* Medium-length games show substantial overlap with other classes.
* High variance in opening names increases sparsity.
* Model is limited by information available early in the game.

---

# 🙌 Acknowledgments

Thanks to Lichess for providing open game data and to HuggingFace for model hosting.