YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Chess Game Length Predictor

▢️ Project Presentation Video: https://www.loom.com/share/7e7a5a47fec8469f9b457a910fc52c0d

Predicting whether a chess game will be short, medium, or long using machine learning

This repository contains the final trained model from an end-to-end machine learning project analyzing ~20K chess games sourced from Lichess. The goal was to predict the length of a chess game (in number of half-moves) by classifying each game into one of three categories:

  • 0 – Short game
  • 1 – Medium-length game
  • 2 – Long game

The project includes exploratory data analysis, feature engineering, categorical encoding, model training, and exporting the final model as a pickle file.


πŸ“˜ Project Overview

Chess games vary greatly in structure and duration. Some games end quickly due to opening traps or mistakes, while others develop into long positional struggles.

This model predicts a game’s length category using information available before or shortly after the opening phase, including:

  • Player ratings
  • Time control increment
  • Opening characteristics
  • Engineered features describing opening length
  • Rating differences and averages

The final exported artifact is a trained Random Forest Classifier.


πŸ” Dataset Summary

The dataset contains approximately 19,800 chess games with:

  • Game metadata (ID, number of turns, time increment)
  • Player ratings (white, black)
  • Opening information (short names, ECO-like categories)
  • Game outcome and victory status

The target variable (turns_class) was created by binning number of turns into three quantile-based classes.


πŸ“Š Exploratory Data Analysis

Key insights included:

β€’ Game length shows patterns across different opening families.

Some openings consistently lead to longer or shorter games.

β€’ Player ratings mildly influence length.

Higher-rated matchups tend to produce longer, higher-quality games.

β€’ Victory status interacts with game duration.

Games ending by resignation or timeout skew shorter.

β€’ Opening names were highly diverse.

Hundreds of unique openings required careful preprocessing and encoding.


πŸ› οΈ Feature Engineering

Engineered features include:

βœ” Opening Length Group

opening_moves was grouped into four categories: very_short, short, medium, long.

βœ” Rating Difference & Average

  • rating_diff = abs(white_rating - black_rating)
  • rating_avg = (white_rating + black_rating) / 2

βœ” Draw Indicator

Boolean feature marking whether a game ended in a draw.

βœ” Optional Opening Clusters

KMeans clustering was used to group openings with similar behavior.

βœ” One-Hot Encoding

Categorical variables were transformed using one-hot encoding, with filtering to avoid high dimensionality.


πŸ€– Models Trained

Three models were evaluated:

1️⃣ Logistic Regression

Baseline model. Accuracy: ~46–47%.

2️⃣ Random Forest Classifier

Best-performing model overall. Accuracy: ~47–48%.

3️⃣ Gradient Boosting Classifier

Slightly below Random Forest.


πŸ† Winner: Random Forest

Random Forest achieved the strongest performance due to:

  • Good generalization
  • Stable accuracy across all classes
  • Robust handling of high-dimensional categorical features

This repository hosts the exported model:

random_forest_model.pkl

πŸ—‚οΈ Repository Contents

File Description
random_forest_model.pkl Final trained model
README.md Project documentation
.gitattributes Managed by HuggingFace
Copy_of_Assignment_2_Classification,_Regression,_Clustering,_Evaluation.ipynb Google Colab

🧩 Limitations

  • Medium-length games show substantial overlap with other classes.
  • High variance in opening names increases sparsity.
  • Model is limited by information available early in the game.

πŸ™Œ Acknowledgments

Thanks to Lichess for providing open game data and to HuggingFace for model hosting.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support