Bike Sharing Demand – Washington D.C. (Hour-Level Analysis)
Lior Feinstin Assigment.
I selected Bike Sharing in Washington D.C. Dataset , I am using only the "hour.csv" file for my analysis. The hourly dataset contains about 17,000 records, where each row represents one hour and includes:
Target variable
cnt– total number of bike rentals (casual + registered) in that hour
Time-related features
yr– year (0 = 2011, 1 = 2012)mnth– month (1–12)hr– hour of the day (0–23)weekday– day of the weekseason– season (1: spring, 2: summer, 3: fall, 4: winter)holiday– whether the day is a holidayworkingday– whether the day is a working day (not weekend/holiday)
Weather-related features
temp– normalized temperatureatemp– normalized “feels like” temperaturehum– normalized humiditywindspeed– normalized wind speedweathersit– weather situation category (clear, mist, light snow/rain, heavy rain/snow)
Main Questions
Regression:
Given the time and weather conditions, how many bikes will be rented in that hour?Classification:
Based on the same features, will demand in that hour be low, medium, or high?
Part 2: EDA
Data Cleaning
- No duplicates or missing values were found in the dataset.
- Coded categorical variables were checked for inconsistencies.
All entries were consistent, for example:seasoncontained only values 1–4 (no invalid codes such as 5 or 6).
Because the data was already clean and well-structured, no rows had to be dropped or imputed.
Outlier Detection
Method
- Outliers were detected using the z-score method on numerical variables.
- Potential outliers were then visualized using plots (boxplots / scatterplots).
it is possible but not likly that the humidy on this instant was zero. therefore, I won't be delting this data.
Decision
Outliers were found, especially in the rental counts during certain hours/Casual Riders/Registered Riders. However, after examining these data points, they appeared to be genuine high-demand hours, such as:
- Commuting rush hours for registered users.
- Warm-weather weekend spikes for casual users.
These observations are not data errors but real behavioral patterns. Therefore, they were kept in the dataset rather than removed.
Correlation Heatmap
To better understand relationships between variables, a correlation heatmap was created (see attached figure in the full report).
Key Insights
tempandatemphave the strongest positive correlation (~0.99), confirming that perceived temperature closely follows actual temperature.cnt(total rentals) shows strong positive correlations with:registered(~0.97) – most rentals come from registered users.casual(~0.69) – casual users also drive demand.tempandatemp(~0.40) – warmer conditions are associated with higher usage.
casualcorrelates positively withtemp(0.46) and0.45), suggesting that casual riders are more weather-sensitive.atemp(- Humidity (
hum) has a moderate negative correlation withcnt(~–0.32), indicating that high humidity tends to reduce bike usage.
Distribution of Values
To understand the distribution of key variables, several histograms were plotted.
Temperature (
temp)
The distribution of temperature is approximately bell-shaped, indicating that most hours occur around moderate temperatures, with fewer hours at extreme cold or heat.Rental count (
cnt)
The distribution ofcntis right-skewed:- Many hours have relatively low rental counts.
- High-demand hours are less frequent but important.
This suggests that while most hours have moderate or low demand, there are distinct peak periods with very high usage.
Average Monthly Bike Rentals
Insights
- Bike rentals gradually increase from month 2 to month 6, suggesting rising activity as the year progresses.
- Months 6, 8, and 9 show the highest average rental levels, indicating peak usage periods in late spring and summer.
- After month 9, rentals decline steadily through month 12, showing reduced demand later in the year.
- Overall, the pattern reflects a strong seasonal cycle, with a clear mid-year peak and lower rental volumes at the beginning and end of the year.
Research Questions and Insights
Q1: How do weather conditions affect bike rentals?
Bike rentals decrease steadily as weather conditions worsen:
- Clear weather (weathersit = 1) has the highest average rentals.
- As conditions move to mist, light rain/snow, and finally heavy rain/snow, average rentals drop.
- The lowest usage is observed in severe weather (weathersit = 4).
This suggests that unfavorable weather strongly suppresses demand.
Q2: Does bike demand vary across different hours of the day?
Insights:
- Bike rentals show clear commuter behavior:
- A sharp morning peak around 8:00.
- A strong evening peak around 17:00–18:00.
- During late night and early morning hours, demand is very low.
- Midday usage increases gradually, especially during non-working hours.
This indicates that commuting patterns are one of the main drivers of hourly bike demand.
Q3: Do working days show different rental patterns compared to weekends and holidays?
The average number of total rentals is slightly higher on working days than on weekends and holidays, but the difference is not dramatic:
- Working days benefit from commuting usage (consistent peaks).
- Weekends and holidays show more leisure-oriented usage.
Overall, bike demand remains relatively steady throughout the week, with commuting patterns boosting working-day usage and leisure activities balancing demand on non-working days.
Q4: Are casual riders more influenced by weather than registered riders?
Casual riders show a sharp decline in usage as weather worsens:
- From about ~40 rentals in clear weather to nearly 0 in heavy rain/snow.
- This indicates that casual riders are highly sensitive to weather conditions.
Registered riders also reduce their usage in poor weather, but the decline is more gradual:
- From roughly ~165 rentals in clear weather to ~72 rentals in severe weather.
- This suggests that commuting needs and routine trips make registered riders less affected by bad weather compared to casual riders.
These patterns show that weather has a stronger impact on casual, leisure-based usage than on registered, commuting-based usage.
Part 3: Define and Train a baseline model
Regression Goal
The goal of the baseline model is to predict the number of bikes rented per hour (cnt) using weather conditions, time-related features, and day-type information.
Feature Selection
I selected features that have a clear relationship with bike demand, based on exploratory data analysis (EDA):
- Weather features:
temp,hum,windspeed,weathersit - Time features:
hr,weekday,mnth,season,yr - Day-type features:
holiday,workingday
This feature set is designed to capture environmental, temporal, and behavioral factors that influence hourly rental counts.
Train–Test Split and Excluded Features
The data was split into training and test sets (80/20). Several columns were excluded before modeling:
cnt– target variable (to be predicted, not used as input)dteday– date string; not directly useful for the baseline modelinstant– record index; acts like an ID and carries no predictive informationcasual,registered– their sum iscnt, so using them would leak the answeratemp– almost perfectly correlated withtemp(~0.99), so it was dropped to avoid redundancy
Scaling Step
The dataset includes features on different scales (e.g., hr ranges 0–23, while temp, hum, windspeed are normalized between 0 and 1). To prevent features with larger numeric ranges from dominating the Linear Regression model, I applied StandardScaler to the numeric variables. This puts all features on a comparable scale and makes the baseline model more stable and interpretable.
Baseline Linear Regression Model
A Linear Regression model was trained on the selected, scaled features.
Model performance was evaluated on the test set using:
- MAE (Mean Absolute Error)
- MSE (Mean Squared Error)
- RMSE (Root Mean Squared Error)
- R² (Coefficient of Determination)
Model Performance Visuals
1. Actual vs. Predicted Plot
(Actual vs. Predicted plot – image to be added here.)
This plot compares predicted rental counts to the true values. Points lying far from the diagonal line indicate prediction errors and show that the baseline linear model struggles with some high-demand hours.
2. Residuals Plot
A residual is defined as:
residual = actual value − predicted value
- A negative residual means the prediction was too high.
- A positive residual means the prediction was too low.
The residual plot shows “unbehaved” residuals, with large positive and negative values and no clear random pattern. This suggests that the baseline Linear Regression model does not fully capture the complexity of the relationships in the data (non-linearity, interactions, etc.).
Feature Importance (Linear Regression Coefficients)
Using the absolute values of the Linear Regression coefficients, the most influential features in the baseline model are:
- Temperature (
temp) – strongest positive coefficient; higher temperatures are strongly associated with more rentals. - Hour of the day (
hr) – important predictor, confirming strong time-of-day patterns in demand. - Year (
yr) – positive coefficient, indicating that overall usage increased from 2011 to 2012. - Humidity (
hum) – large negative coefficient; higher humidity tends to reduce rentals. - Season (
season) – positive contribution, suggesting higher demand in certain seasons (e.g., summer and fall).
Other features such as holiday, weekday, and windspeed have relatively small coefficients, indicating a weaker direct effect in this simple linear baseline model. These findings motivated further feature engineering and the use of more flexible models in later parts of the project.
Part 4: Feature Engineering
4.1 Polynomial Weather Features
I created polynomial features for the main weather variables:
temp2 = temp²hum2 = hum²windspeed2 = windspeed²
Logic:
The effect of weather on bike rentals is not purely linear. Very low and very high temperatures tend to reduce demand, while moderate temperatures increase it. Squared terms allow the model to capture these curved (U-shaped or inverted U-shaped) relationships between weather conditions and rental counts.
4.2 Cyclical Encoding for Hour and Month
I transformed the hour of the day (hr) and the month (mnth) into cyclical features using sine and cosine:
hr_sin,hr_cosmnth_sin,mnth_cos
Logic:
Time-of-day and seasonality are cyclical: after hour 23 comes hour 0, and after December comes January. A simple numeric encoding (0–23, 1–12) makes the model treat “23” and “0” as far apart, which is not true in time. Cyclical encoding preserves the circular structure and helps the model understand that “late night” and “early morning” are close to each other.
4.3 Time-of-Day and Calendar Flags
I added several binary flags to capture different usage regimes, for example:
- Morning peak vs. evening peak
- Working day vs. weekend/holiday
- Weekend evenings vs. other periods
Logic:
Bike usage patterns differ between commuting times and leisure times. Morning and evening peaks on working days usually reflect commuting behavior, while weekend evenings are more related to leisure rides. These flags help the model distinguish these patterns instead of treating every hour in the same way.
4.4 Weather Comfort and Bad-Weather Flags & Interaction Features
Weather Comfort / Bad-Weather Flags
I defined indicators such as:
is_mild_temp– comfortable temperature rangeis_too_cold/is_too_hot– extreme temperaturesis_bad_weather– rainy/snowy or otherwise unpleasant conditions
Logic:
Users are more likely to rent bikes when the weather is comfortable and less likely during very cold, very hot, or rainy/snowy conditions. Instead of expecting the model to infer all thresholds from raw numbers, explicit comfort and bad-weather flags make these patterns easier to learn.
Interaction Features
I added interaction terms, for example:
yr_season– interaction between year and season- Indicators capturing peak hours on working days vs. weekends
Logic:
The effect of one variable often depends on another. For example, the morning peak matters mainly on working days, and bad weather may have a stronger impact during typical commuting times. Interaction features allow the model to capture these “it depends on…” relationships and changes in demand patterns across the year.
4.5 One-Hot Encoding (OHE)
For the categorical variables:
seasonmnthhrweekdayweathersit
I used One-Hot Encoding within a ColumnTransformer. Each category value (for example, season = summer) is converted into its own binary column that takes the value 0 or 1.
Logic:
Linear and tree-based models do not work well with raw integer codes for categories, because they may incorrectly assume an order or distance (e.g., “season 4” > “season 1”). One-Hot Encoding represents categories in a way that does not impose any artificial ordering.
4.6 Applying Clustering (Unsupervised Learning)
To capture typical usage regimes (combinations of time, calendar, and weather), I applied K-Means clustering on a subset of features.
Features Used for Clustering
Time / calendar:
weekday,workingday,season,yr
Weather:
temp,hum,windspeed,weathersit
Engineered features:
hr_sin,hr_cosis_morning_peak,is_evening_peakis_weekend,is_weekend_eveningis_mild_temp,is_too_cold,is_too_hotis_bad_weather
Note: The raw hr column was removed from the clustering features, and only hr_sin and hr_cos were used to represent the circular nature of the hour.
Choosing the Number of Clusters (k)
I plotted:
- The elbow curve (inertia vs. k)
- The silhouette score for different values of k
By combining both plots, I chose k = 8, which:
- Corresponds to an “elbow” in the inertia curve, and
- Achieves one of the best silhouette scores.
4.7 Cluster Visualization
Principal Component Analysis (PCA)
I used PCA to reduce the clustering feature space to two principal components, and plotted the observations in this 2D space, colored by their K-Means cluster ID.
Insight:
The PCA plot looks relatively messy because it compresses many features into just two linear components, so clusters that are well separated in higher dimensions can overlap in 2D. To better show local structure and separation between clusters, I also used t-SNE.
t-SNE Visualization
In the t-SNE map, the clusters form several compact “islands” of points with relatively clear boundaries between many of the colors. This suggests that the K-Means algorithm has found groups of hours that are locally well separated in terms of time, calendar, and weather characteristics.
4.8 Cluster Interpretation
I created a cluster interpretation table (average features and average cnt per cluster) to understand the characteristics of each of the 8 clusters.
Summary of Cluster Profiles
Cluster 3 – Cold, low-demand off-peak hours (~47 bikes)
- Very low temperature (
temp ≈ 0.19,is_too_cold = 1.00), mostly clear weather. - No morning/evening peak flags → off-peak times.
- Lowest average demand → cold, off-peak hours with very few rentals.
- Very low temperature (
Cluster 0 – Bad-weather hours with suppressed demand (~111 bikes)
weathersit = 3,is_bad_weather = 1.00→ rain/snow conditions.- Mix of working and non-working days, some peak hours.
- Demand still relatively low → rainy/snowy hours where weather strongly reduces rentals.
Cluster 1 – Working-day off-peak with mild conditions (~118 bikes)
- Always working days (
workingday = 1.00), no peak flags. - Mild, reasonable weather (
temp ≈ 0.50,hum ≈ 0.65). - Quiet working-day off-peak periods with decent weather but low usage.
- Always working days (
Cluster 6 – Weekend daytime with mild weather (~173 bikes)
- Pure weekend/holiday (
is_weekend = 1.00,workingday = 0.00). - Comfortable temperatures (
is_mild_temp = 0.44). - No commute peaks → weekend daytime leisure usage.
- Pure weekend/holiday (
Cluster 5 – Weekend evenings, warm weather (~221 bikes)
- Weekend and evening (
is_weekend = 1.00,is_weekend_evening = 1.00). - Often warm or slightly hot (
temp ≈ 0.51,is_too_hot = 0.15). - Higher demand than other weekend clusters → weekend evening leisure cluster.
- Weekend and evening (
Cluster 2 – Hot mid-day hours with high demand (~257 bikes)
- Hot conditions (
temp ≈ 0.80,is_too_hot = 1.00) with mostly good weather. - Mixed working/non-working days, no peak flags → mid-day hours.
- Hot daytime hours with relatively strong demand.
- Hot conditions (
Cluster 4 – Working-day morning commute (~352 bikes)
- All working days (
workingday = 1.00,is_weekend = 0.00). is_morning_peak = 1.00, mostly mild temperatures.- Very high demand → weekday morning rush hour cluster.
- All working days (
Cluster 7 – Working-day evening commute, warm (~450 bikes)
- All working days,
is_evening_peak = 1.00. - Warm to hot (
temp ≈ 0.58,is_too_hot = 0.25). - Highest average
cnt→ weekday evening commute with very high rentals.
- All working days,
4.9 Are the Clusters Useful?
Yes, the clusters are useful and align well with real-world patterns:
- They separate cold off-peak hours, bad-weather hours, weekend leisure periods, weekday commute peaks, and hot daytime hours.
- The average
cntvaries strongly across clusters (from ~47 to ~450 rentals), which means that cluster ID is strongly related to demand.
For these reasons, the cluster assignments (and distances to cluster centroids) were used as additional features in the regression and classification models to help capture complex, non-linear demand regimes.
Part 5: Train and Evaluate Three Improved Models
5.1 Selected Models
For the improved regression task, I trained and compared three models on the engineered + cluster features:
- Linear Regression
- Random Forest Regressor
- Gradient Boosting Regressor
All models used the same feature set and the same train–test split.
5.2 Modeling Setup
Features (
X):
Engineered features + clustering features (time, calendar, weather, cyclic encodings, flags, and cluster ID).Target (
y):cnt– total number of bike rentals per hour.Preprocessing:
- One-Hot Encoding (OHE) for categorical variables (e.g.,
season,mnth,hr,weekday,weathersit). - Scaling for numeric variables using
StandardScaler.
- One-Hot Encoding (OHE) for categorical variables (e.g.,
Pipelines:
For each model, I built a scikit-learnPipelinecontaining:preprocessing→regressor(Linear Regression, Random Forest, or Gradient Boosting).Evaluation:
A helper function was used to fit each model and compute metrics on train and test sets.
Random Forest achieves the lowest RMSE and MAE among all models, which means it makes the smallest prediction errors on hourly bike rentals on the test set. It also has the highest R², indicating that it explains the largest proportion of the variance in bike demand compared to the other models (i.e., its predictions follow the true rental patterns most closely).
5.3 Overfitting Check
Overfitting occurs when a model fits the training data too closely (very low training error) but performs worse on new data, while underfitting happens when a model is too simple and has relatively high errors on both train and test sets.
Linear Regression with engineered features
Shows similar but moderate errors on train and test, which indicates underfitting: the model cannot fully capture the complexity of the data.Random Forest Regressor
Fits the training data very strongly and shows a small increase in error on the test set, which reflects some overfitting.
However, Random Forest still has by far the lowest test RMSE and MAE and the highest R² of all models, so it is selected as the best model despite this mild overfitting.
5.4 Feature Importance
- For Gradient Boosting and Random Forest, the most important feature is
num__hr(the hour of the day after numeric preprocessing), confirming the strong time-of-day pattern in demand. - For Linear Regression,
temphas the largest positive coefficient, making it the most important feature in the linear model.
These results highlight the importance of both time-of-day and temperature in explaining bike rentals.
5.5 Discussion of Improvements
Baseline vs. engineered Linear Regression
The baseline Linear Regression on original features achieved test RMSE ≈ 139.4, MAE ≈ 105.0, and R² ≈ 0.39.
After adding engineered features (cyclical time, peak-hour flags, comfort/bad-weather indicators, clusters), Linear Regression improved to RMSE ≈ 92.9, MAE ≈ 69.5, and R² ≈ 0.73, showing that the new features capture important structure in the data.Tree-based models on engineered data
- Gradient Boosting Regressor:
Test RMSE ≈ 54.4, MAE ≈ 36.7, R² ≈ 0.91. - Random Forest Regressor:
Test RMSE ≈ 41.0, MAE ≈ 24.6, R² ≈ 0.95.
These results indicate that non-linear tree models can exploit the engineered features more effectively than Linear Regression.
- Gradient Boosting Regressor:
Reason for overall improvement
The main factor behind the improvement over the baseline model is the feature engineering step. By adding richer features (cyclical hour encoding, peak-hour and weekend indicators, comfort vs. extreme weather flags, and cluster-based features), the models receive a much more informative description of each hour and can better learn how time, day type, and weather jointly affect demand. Linear Regression already benefits significantly from these engineered features, and the tree-based models (Random Forest and Gradient Boosting) can exploit them even further by modeling non-linear relationships and interactions that the baseline linear model could not capture.
5.6 Final Winner (Regression)
Random Forest Regressor is the best model for this dataset because:
- It achieves the lowest test errors (RMSE and MAE) and the highest test R²,
- It predicts hourly bike rentals much more accurately than the baseline Linear Regression and the Gradient Boosting model, and
- Its overfitting is mild and acceptable, with very strong training performance and still excellent generalization on the test set.
For these reasons, Random Forest was selected as the final winning regression model.
Visuals of the winning model vs Baseline model:
The plots clearly show that the Random Forest with engineered features fits the data much better than the baseline Linear Regression without feature engineering.
Link to my Hugging Face model page: https://huggingface.co/Liori25/bike-sharing-random-forest/tree/main
Part 7: Regression-to-Classification
7.1 Strategy – Quantile Binning
I reframed the regression problem as a three-class classification task by discretizing the hourly rental counts into:
- 0 – low demand
- 1 – medium demand
- 2 – high demand
using the 33% and 66% quantiles of cnt computed on the training set.
Why this strategy?
Quantile-based binning produces roughly equal-sized classes, which helps the classifier learn effectively and avoids severe class imbalance.
The resulting classes are indeed well balanced: each class contains about one third of the observations in both the training and test sets.
Part 8: Train & Evaluate Classification Models
8.1 Metric Priorities
Precision vs. Recall (for this task)
For the bike-rental task, recall for the high-demand class is more important than precision.
Missing a true high-demand hour (low recall) means stations may not have enough bikes, leading to lost revenue and unhappy users. Over-predicting high demand (lower precision) is less harmful, because having a few extra bikes available is usually acceptable. Therefore, recall (and F1) for the high-demand class is a key evaluation metric.
False Positives vs. False Negatives
A False Negative (predicting “not high” when demand is actually high) is more critical, because it directly harms service reliability: stations can run out of bikes, frustrating users and damaging trust.
A False Positive (predicting high demand when demand is normal) mainly causes over-allocation of bikes, which has some operational cost but does not usually hurt the customer experience. Thus, reducing False Negatives is the priority.
8.2 Trained Classification Models
I trained three different classification models on the engineered + cluster features:
- Logistic Regression – a linear classifier that separates the classes using linear decision boundaries in the transformed feature space.
- RandomForestClassifier – an ensemble of many decision trees trained on random subsets of data and features; it can capture complex, non-linear interactions between time, weather and day-type features.
- GradientBoostingClassifier – builds a sequence of small decision trees where each tree focuses on correcting the errors of the previous ones, often achieving strong performance on tabular data.
For each model, I computed the classification report (precision, recall, F1-score, support) and plotted a confusion matrix.
8.3 Confusion-Matrix Insights
Logistic Regression
Classifies most hours correctly but struggles with the medium-demand class: medium hours are often predicted as low or high, while direct confusions between low and high are rare. Most errors are off by one level (medium ↔ low/high).RandomForestClassifier
Shows a very strong diagonal in the confusion matrix: the vast majority of low, medium, and high-demand hours are correctly classified. Remaining errors are mainly between neighboring classes, with almost no low↔high confusions.GradientBoostingClassifier
Also predicts most hours correctly; errors occur mainly between adjacent classes (low vs. medium, medium vs. high), and direct low↔high mistakes are very rare.
8.4 Best Classification Model
Based on the test-set metrics (accuracy, macro precision, macro recall, macro F1), the RandomForestClassifier is the best of the three classifiers:
- Highest test accuracy
- Highest macro precision, recall, and F1-score
- Confusion matrix shows few severe errors and mostly “off by one level” mistakes
Technically, RandomForestClassifier performs best because it is an ensemble of many decision trees, which can model complex non-linear relationships and interactions between time variables and weather conditions that a single linear model cannot represent.
The winning classification model (RandomForestClassifier pipeline) is available here:
https://huggingface.co/Liori25/bike-sharing-random-forest/tree/main
thank you



























