Update README.md

996addf verified 2 months ago

4.24 kB

	# 🧠 Chess Game Length Predictor

	▶️ Project Presentation Video: https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d


	### Predicting whether a chess game will be short, medium, or long using machine learning

	This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess.
	The goal was to predict the length of a chess game (in number of half-moves) by classifying each game into one of three categories:

	* 0 – Short game
	* 1 – Medium-length game
	* 2 – Long game

	The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file.

	---

	# 📘 Project Overview

	Chess games vary greatly in structure and duration.
	Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles.

	This model predicts a game’s length category using information available before or shortly after the opening phase, including:

	* Player ratings
	* Time control increment
	* Opening characteristics
	* Engineered features describing opening length
	* Rating differences and averages

	The final exported artifact is a trained Random Forest Classifier.

	---

	# 🔍 Dataset Summary

	The dataset contains approximately 19,800 chess games with:

	* Game metadata (ID, number of turns, time increment)
	* Player ratings (white, black)
	* Opening information (short names, ECO-like categories)
	* Game outcome and victory status

	The target variable (`turns_class`) was created by binning number of turns into three quantile-based classes.

	---

	# 📊 Exploratory Data Analysis

	Key insights included:

	### • Game length shows patterns across different opening families.

	Some openings consistently lead to longer or shorter games.

	### • Player ratings mildly influence length.

	Higher-rated matchups tend to produce longer, higher-quality games.

	### • Victory status interacts with game duration.

	Games ending by resignation or timeout skew shorter.

	### • Opening names were highly diverse.

	Hundreds of unique openings required careful preprocessing and encoding.

	---

	# 🛠️ Feature Engineering

	Engineered features include:

	### ✔ Opening Length Group

	`opening_moves` was grouped into four categories: `very_short`, `short`, `medium`, `long`.

	### ✔ Rating Difference & Average

	* `rating_diff = abs(white_rating - black_rating)`
	* `rating_avg = (white_rating + black_rating) / 2`

	### ✔ Draw Indicator

	Boolean feature marking whether a game ended in a draw.

	### ✔ Optional Opening Clusters

	KMeans clustering was used to group openings with similar behavior.

	### ✔ One-Hot Encoding

	Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality.

	---

	# 🤖 Models Trained

	Three models were evaluated:

	### 1️⃣ Logistic Regression

	Baseline model.
	Accuracy: ~46–47%.

	### 2️⃣ Random Forest Classifier

	Best-performing model overall.
	Accuracy: ~47–48%.

	### 3️⃣ Gradient Boosting Classifier

	Slightly below Random Forest.

	---

	# 🏆 Winner: Random Forest

	Random Forest achieved the strongest performance due to:

	* Good generalization
	* Stable accuracy across all classes
	* Robust handling of high-dimensional categorical features

	This repository hosts the exported model:

	```
	random_forest_model.pkl
	```

	---

	# 🗂️ Repository Contents

	\| File \| Description \|
	\| ------------------------- \| -------------------------------- \|
	\| `random_forest_model.pkl` \| Final trained model \|
	\| `README.md` \| Project documentation \|
	\| `.gitattributes` \| Managed by HuggingFace \|
	\| `Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb` \| Google Colab \|

	---

	# 🧩 Limitations

	* Medium-length games show substantial overlap with other classes.
	* High variance in opening names increases sparsity.
	* Model is limited by information available early in the game.

	---

	# 🙌 Acknowledgments

	Thanks to Lichess for providing open game data and to HuggingFace for model hosting.