Hotel Booking Cancellation Prediction

video :

Dataset selection

Project Overview

In this project, I followed the high level flow step by step. the main goal was to look for any corrollations between the features and the canceled reserevations. In the choosen detaset I found missing values and a lot of outlirs - I fixed all of them before handeling the data, you can view it on the Colab notebook.

After Univariate Analysis I spereted the canceled reservations and view the most common features, for exemple -

I recognized that some featurs are more common for canceled orders, for those I created more plots and etc... to check if they are connected to additional featurs for exemple -

at this point, I posed more indepth research question.

Research Questions

1. Are there individual features that significantly impact booking cancellations

2. Do specific customer segments significantly influence the likelihood of booking cancellations?

you can view the full visuals in the colab notebook, my primary finding is that customer characteristics,particularly segment type and group composition, can exhibit a higher probability of canceling. you can see for exmple -

3. Which stay-related features are the most prominent predictors of booking cancellations?

Longer stays, especially those spanning both weekdays and weekends, show a higher cancellation risk, while 1-2 night bookings are the most stable. (Full graph available in the notebook due to size).

Models & Performance

baseline model

Feature Engineering

Guest_Composition
Planning_Index - after clustring

premium_index - after clustring

Booking Channel Structur - after clustring

adr cluster

after clustring, made the resultes into new features

1. Regression Model

Winning Model: Gradient Boosting Regressor
Objective: To predict the specific probability of a cancellation (0.0 to 1.0).
Key Metric: Optimized for the lowest RMSE to ensure precise probability estimates.
File: winning_model.pkl

Feature Importance Analysis

n accordance with the project guidelines, I analyzed which variables were most frequently utilized by the models to identify the primary drivers of the target variable. This step was repeted for each model including the baseline model, for exmple -

2. Classification Model

Winning Model: Random Forest Classifier
Objective: To categorize bookings into three balanced risk tiers:
- Low Risk: Stable bookings.
- Medium Risk: Uncertain bookings (the "gray area").
- High Risk: High likelihood of cancellation.
Methodology: Used Quantile Binning (Terciles) to achieve a perfect 33.3% class balance, ensuring the model learns each category equally.
Why Random Forest?: Selected for its robustness, stability, and excellent performance across multiple decision trees. It achieved a high F1-Score, effectively balancing precision and recall.
File: winning_classification_model.pkl

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support