Hotel Booking Cancellation Prediction
video :
Dataset selection
Project Overview
In this project, I followed the high level flow step by step. the main goal was to look for any corrollations between the features and the canceled reserevations. In the choosen detaset I found missing values and a lot of outlirs - I fixed all of them before handeling the data, you can view it on the Colab notebook.
After Univariate Analysis I spereted the canceled reservations and view the most common features, for exemple -
I recognized that some featurs are more common for canceled orders, for those I created more plots and etc... to check if they are connected to additional featurs for exemple -
at this point, I posed more indepth research question.
Research Questions
1. Are there individual features that significantly impact booking cancellations
2. Do specific customer segments significantly influence the likelihood of booking cancellations?
you can view the full visuals in the colab notebook, my primary finding is that customer characteristics,particularly segment type and group composition, can exhibit a higher probability of canceling. you can see for exmple -
3. Which stay-related features are the most prominent predictors of booking cancellations?
Longer stays, especially those spanning both weekdays and weekends, show a higher cancellation risk, while 1-2 night bookings are the most stable. (Full graph available in the notebook due to size).
Models & Performance
baseline model
Feature Engineering
- premium_index - after clustring
- Booking Channel Structur - after clustring
- adr cluster
after clustring, made the resultes into new features
1. Regression Model
- Winning Model: Gradient Boosting Regressor
- Objective: To predict the specific probability of a cancellation (0.0 to 1.0).
- Key Metric: Optimized for the lowest RMSE to ensure precise probability estimates.
- File:
winning_model.pkl
Feature Importance Analysis
n accordance with the project guidelines, I analyzed which variables were most frequently utilized by the models to identify the primary drivers of the target variable. This step was repeted for each model including the baseline model, for exmple -
2. Classification Model
- Winning Model: Random Forest Classifier
- Objective: To categorize bookings into three balanced risk tiers:
- Low Risk: Stable bookings.
- Medium Risk: Uncertain bookings (the "gray area").
- High Risk: High likelihood of cancellation.
- Methodology: Used Quantile Binning (Terciles) to achieve a perfect 33.3% class balance, ensuring the model learns each category equally.
- Why Random Forest?: Selected for its robustness, stability, and excellent performance across multiple decision trees. It achieved a high F1-Score, effectively balancing precision and recall.
- File:
winning_classification_model.pkl






















