YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🧠 Chess Game Length Predictor

▶️ Project Presentation Video: https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d

Predicting whether a chess game will be short, medium, or long using machine learning

This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess. The goal was to predict the length of a chess game (in number of half-moves) by classifying each game into one of three categories:

0 – Short game
1 – Medium-length game
2 – Long game

The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file.

📘 Project Overview

Chess games vary greatly in structure and duration. Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles.

This model predicts a game’s length category using information available before or shortly after the opening phase, including:

Player ratings
Time control increment
Opening characteristics
Engineered features describing opening length
Rating differences and averages

The final exported artifact is a trained Random Forest Classifier.

🔍 Dataset Summary

The dataset contains approximately 19,800 chess games with:

Game metadata (ID, number of turns, time increment)
Player ratings (white, black)
Opening information (short names, ECO-like categories)
Game outcome and victory status

The target variable (turns_class) was created by binning number of turns into three quantile-based classes.

📊 Exploratory Data Analysis

Key insights included:

• Game length shows patterns across different opening families.

Some openings consistently lead to longer or shorter games.

• Player ratings mildly influence length.

Higher-rated matchups tend to produce longer, higher-quality games.

• Victory status interacts with game duration.

Games ending by resignation or timeout skew shorter.

• Opening names were highly diverse.

Hundreds of unique openings required careful preprocessing and encoding.

🛠️ Feature Engineering

Engineered features include:

✔ Opening Length Group

opening_moves was grouped into four categories: very_short, short, medium, long.

✔ Rating Difference & Average

rating_diff = abs(white_rating - black_rating)
rating_avg = (white_rating + black_rating) / 2

✔ Draw Indicator

Boolean feature marking whether a game ended in a draw.

✔ Optional Opening Clusters

KMeans clustering was used to group openings with similar behavior.

✔ One-Hot Encoding

Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality.

🤖 Models Trained

Three models were evaluated:

1️⃣ Logistic Regression

Baseline model. Accuracy: ~46–47%.

2️⃣ Random Forest Classifier

Best-performing model overall. Accuracy: ~47–48%.

3️⃣ Gradient Boosting Classifier

Slightly below Random Forest.

🏆 Winner: Random Forest

Random Forest achieved the strongest performance due to:

Good generalization
Stable accuracy across all classes
Robust handling of high-dimensional categorical features

This repository hosts the exported model:

random_forest_model.pkl

🗂️ Repository Contents

File	Description
`random_forest_model.pkl`	Final trained model
`README.md`	Project documentation
`.gitattributes`	Managed by HuggingFace
`Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb`	Google Colab

🧩 Limitations

Medium-length games show substantial overlap with other classes.
High variance in opening names increases sparsity.
Model is limited by information available early in the game.

🙌 Acknowledgments

Thanks to Lichess for providing open game data and to HuggingFace for model hosting.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support