Hotel Booking Cancellation Prediction

video :

Dataset selection

Project Overview

In this project, I followed the high level flow step by step. the main goal was to look for any corrollations between the features and the canceled reserevations. In the choosen detaset I found missing values and a lot of outlirs - I fixed all of them before handeling the data, you can view it on the Colab notebook.

After Univariate Analysis I spereted the canceled reservations and view the most common features, for exemple -

image

I recognized that some featurs are more common for canceled orders, for those I created more plots and etc... to check if they are connected to additional featurs for exemple -

image

image

at this point, I posed more indepth research question.

Research Questions

1. Are there individual features that significantly impact booking cancellations

image

2. Do specific customer segments significantly influence the likelihood of booking cancellations?

you can view the full visuals in the colab notebook, my primary finding is that customer characteristics,particularly segment type and group composition, can exhibit a higher probability of canceling. you can see for exmple -

image

3. Which stay-related features are the most prominent predictors of booking cancellations?

Longer stays, especially those spanning both weekdays and weekends, show a higher cancellation risk, while 1-2 night bookings are the most stable. (Full graph available in the notebook due to size).

Models & Performance

baseline model

image

image

image

Feature Engineering

  1. Guest_Composition image

  2. Planning_Index - after clustring

image

  1. premium_index - after clustring

image

  1. Booking Channel Structur - after clustring

image

  1. adr cluster

image

after clustring, made the resultes into new features

image

image

image

image

1. Regression Model

image

image

  • Winning Model: Gradient Boosting Regressor
  • Objective: To predict the specific probability of a cancellation (0.0 to 1.0).
  • Key Metric: Optimized for the lowest RMSE to ensure precise probability estimates.
  • File: winning_model.pkl

Feature Importance Analysis

n accordance with the project guidelines, I analyzed which variables were most frequently utilized by the models to identify the primary drivers of the target variable. This step was repeted for each model including the baseline model, for exmple -

image

2. Classification Model

image

image

image

  • Winning Model: Random Forest Classifier
  • Objective: To categorize bookings into three balanced risk tiers:
    • Low Risk: Stable bookings.
    • Medium Risk: Uncertain bookings (the "gray area").
    • High Risk: High likelihood of cancellation.
  • Methodology: Used Quantile Binning (Terciles) to achieve a perfect 33.3% class balance, ensuring the model learns each category equally.
  • Why Random Forest?: Selected for its robustness, stability, and excellent performance across multiple decision trees. It achieved a high F1-Score, effectively balancing precision and recall.
  • File: winning_classification_model.pkl
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support