Hotel Bookings — ADR (Average Daily Rate) Prediction

Predicts the Average Daily Rate (adr, in €) for a hotel booking using booking metadata, customer profile, and stay details.

Model Details

Model type: sklearn.ensemble.GradientBoostingRegressor
Training data: mathsian/hotel-bookings — Hotel Bookings Demand (Portugal, 2015-2017)
Sample size used: 12,520 bookings (random sample of 15K, after cleaning)
Train/Test split: 80/20, simple random sampling, random_state=42

Hyperparameters

n_estimators : 200
max_depth : 6
learning_rate : 0.05
subsample : 0.85
min_samples_split : 5

Test-set Performance

Metric	Value
RMSE	19.79 €
MAE	13.18 €
R²	0.830

(Improvement over Linear Regression baseline: RMSE −32%, MAE −39%, R² +31%.)

Features Required (86 features after preprocessing)

Categorical (One-Hot): hotel, meal, market_segment, distribution_channel, reserved_room_type, assigned_room_type, deposit_type, customer_type, arrival_date_month

High-cardinality (Target Encoding with KFold smoothing): country

Numeric (Standard Scaled): 17 original features (lead_time, arrival_date_year, stays_in_*_nights, adults, children, babies, is_repeated_guest, previous_*, booking_changes, days_in_waiting_list, required_car_parking_spaces, total_of_special_requests, has_agent)

Plus 13 engineered features: total_nights, total_guests, is_family, is_solo, nights_per_guest, had_previous_cancellation, had_previous_booking, room_type_changed, lead_time_log, month_sin, month_cos, is_summer, is_high_season

Usage

import pickle
import numpy as np

with open("hotel_bookings_adr_model.pkl", "rb") as f:
    model = pickle.load(f)

# X_new must be a (n_rows, 86) matrix produced by the same preprocessing
# pipeline as the training notebook (see Part 3.2 + Part 4.2 of the
# accompanying notebook for the exact transformations).
predictions = model.predict(X_new)   # → array of predicted adr in €

Important Caveats

Pre-processing is NOT included in this pickle. You must apply the same StandardScaler, One-Hot Encoding, and Target Encoding steps that were applied during training. Without them the predictions will be nonsense.
Data leakage columns must be removed before applying preprocessing: reservation_status, reservation_status_date, is_canceled — these are post-event labels.
Date validity: the model was trained on bookings with arrival dates 2015-2017. Predictions for arrivals outside this window are extrapolations and should be interpreted with caution.

Citation

If you use this model in academic work, please cite the original dataset and the accompanying notebook.

Downloads last month: -