File size: 11,835 Bytes

---

{}
---
# Uber and Lyft Dataset Boston

## By Idan Khen


<video controls width="700">
  <source src="https://huggingface.co/Idankhen/Winning_Model/resolve/main/Presentation%204.mp4">
  Your browser does not support the video tag.
</video>



# Model Summary
This project analyzes Uber and Lyft ride data to understand price patterns, build predictive models, and convert the problem from regression to classification.
I begin with data cleaning and exploratory analysis, move to regression modeling with engineered features, and finally reframe price prediction into a multiclass classification task.

## Model Details
This model predicts the *price category* of an Uber ride (Cheap/medium/expensive) based on engineered features such as distance, surge multiplier, weather conditions and cluster based location features. 

## Model Information
The dataset contains over 500,000 rows and 57 features.
It includes detailed ride information such as price, distance, timestamps, pickup and dropoff locations, weather conditions, surge multiplier features, and engineered variables.

### Model Description


- **Developed by:** Idan Khen
- **Input type:** Numeric tabular features
- **Output type:** Class lable 


## Exploratory Data Analysis (EDA)

Before modeling, exploratory analysis was performed to understand the structure and behavior of the data.
This included checking distributions, identifying extreme values, and validating key relationships.
To ensure the dataset was usable, serval cleaning steps were performed: 

- Missing values were removed(Price had 55,095 missing)
- Removed duplicate columns
- Extract hour,weekday and month from the timestamp
- Dropped irrelevant or unsed columns

### Outliers handeling
Outliers were detected in several numerical features, especially in distance and surge-related columns. 
These extreme values represent real but rare ride scenarios (such as very long trips or periods of heavy surge pricing), so removing them would distort the true behavior of the data. 
Therefore, the outliers were kept to preserve the integrity and variability of the dataset.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/nX_qg1krKFdxHxJ_0Xy4S.png" width="500">


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/LbZfnQZEARXZ-93hHWA9q.png" width="500">


### Visual Exploration
After cleaning the dataset, several visualizations were created to better understand feature behavior and relationships.

*Correlation heatmap*

<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Z08ys7YF-nnjaYReVid8R.png" width="600">

*Distribution plots*

<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/JvG5iCw7Muku-TPNeFjGC.png" width="600">


*Sctter Plot*


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/K3aUK7C_smA7SV_wqTIwk.png" width="600">



# Q&A
### 1. Are certain hours of the day associated with highter ride prices?

The grapsh shows that average ride prices remain constat throughout the day. 
This indicates the the hour of the day does not affect ride pricing.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/vKfwMnMYtYmP5RlEIRJSj.png" width="600">


### 2. How do weather conditions affect ride prices?

Both the temperature scatterplot and the cold-warm compariosn showed that the prices are almot the same across cold, mild, warm weather. 
Temperatue doesn't affect ride prices.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/HP7RnS2rTq7VLBMFS-8TX.png" width="600">


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/YIXjRC95l532mNhe341b-.png" width="600">


## 3. Which pickup location tend to have higher ride prices?

Pickup from Boston Uni, Fenway and the Finanical District are the most expensive on average. 
Haymarket square and North End are the cheapset. We can see clear differences by location.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/aaXYuxtRQPIzdmog9EJgZ.png" width="600">


## 4. Are there price differences between Uber and Lyft rides?

Lyft shows a wider and higher price distribution than Uber, meaning Lyft ried tend to be more expensive.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/IJoqL7w6fYisWdDkcI50q.png" width="600">



# Baseline Model

The goal was to build a simple first model using Linear Regression. I split the data into 80% train / 20% test, encoded categorical variables, selected the features (X), and set price as the target (y). 
After training the model, I evaluated it using MAE, MSE, RMSE, and R².
I then reviewed the residual distribution, the Actual vs. Predicted plot, and the feature coefficients to understand model errors and which variables influenced price the most.

*Model's behavoior:* 

-Residual distribution : showed how far predictions were from the true values

-Actucal vs Predicted plot : Revealed clear underestimation for high price rides.

-Coefficient plot: showed that surge_multiplier and distance were the strongest predicitors. 

### Conclusion
The baseline Linear Regression model captured general trends but struggled with the non-linear structure of the data, especially for expensive rides. The residuals showed noticeable spread, and the R² score confirmed limited explanatory power.
This indicated the need for feature engineering and more advanced models in later stages.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/x8ZkElILIrdkuLfEacnDs.png" width="600">


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/zrT7egOPfGi_psoBCIaIo.png" width="600">


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/_oxvMAfpvxZ4xo7m3BYV-.png" width="600">


# Feature Engineering
For feature engineering, I focused on the numeric columns and defined a list of numeric features that would be used for modeling. 
After preparing the base numeric inputs, I generated polynomial features to help the model capture simple non-linear relationships that the original variables alone might miss. 
This expanded the feature space and gave the later models more expressive power.


## Applying Clustering
To improve the feature set, I used K-Means clustering on the scaled polynomial features. I applied the Elbow Method and found that four clusters offered a good balance between model complexity and explained variation. After fitting K-Means with k=4, I added each ride's cluster label back into the dataset.
To better understand the structure of the clusters, I visualized them using PCA for linear dimensionality reduction and UMAP for clearer non-linear separation, both of which clearly displayed distinct cluster groupings. 
Finally, I enhanced the dataset by calculating each ride's distance to its cluster centroid and creating cluster-probability features, which provided the later models with additional information about cluster confidence and structure.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/BtMycLgbDEOZkHZH14c4D.png" width="600">

<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/dItyvpJvX5HMlXcp26kkP.png" width="600">

<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/dU0o9qMt7GhZVNxaFIsI0.png" width="600">

# Train Three Models

I trained three improved regression models using the engineered dataset: Linear Regression, Random Forest Regressor, and Gradient Boosting Regressor.
Each model was fitted on the training data and evaluated on the test set using RMSE, MAE, and R² to measure predictive performance.
All three improved models performed far better than the baseline, reducing error dramatically. 
Performance across Linear Regression, Random Forest, and Gradient Boosting was very similar, with Gradient Boosting achieving the best overall balance of RMSE, MAE, and R², making it the strongest model in this comparison. 
Its boosted tree structure allowed it to capture nonlinear interactions more effectively than the other models.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Otb1cFsJT2ZMHRsWdTjWk.png" width="600">


*Gradient Boosting features importance*


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/egfcECAhQbl7E_omsY7vT.png" width="600">



# Regression to Classifiction
To transform the problem from predicting a continuous price into predicting price categories, I converted the numeric target into discrete classes using three different strategies:

-*Median Split* – converted the target into a binary class (0 = below median, 1 = above median).

-*Quantile Binning* – created three balanced classes based on the 33% and 66% percentiles of the training set.

-*Business-Rule Threshold* – defined “expensive” rides using a simple rule: price > 0.

Before training classification models, I examined the class distributions for train and test to ensure they were reasonably balanced.
Visualizations confirmed that the median split and quantile binning produced well-distributed classes, while the business-rule split created a more imbalanced dataset.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/6TBw50gy-mw3bsMMx_B0_.png" width="600">


# Train & Eval Classification Models

After converting the continuous target into categorical classes, three different classifiers from scikit-learn were trained: Logistic Regression, Random Forest Classifier, and Gradient Boosting Classifier.
To keep computation manageable, a 100,000-row subsample of the training data was used. Each model was trained and evaluated using Accuracy, Macro F1-score, and a full classification report, followed by confusion matrix visualizations.
Logistic Regression showed high confusion between all classes and struggled with the middle class. 
Random Forest improved separation but still mixed boundaries, especially for Class 1. 
Gradient Boosting delivered the most balanced predictions, with the best stability across all classes. 
### Winner: Gradient Boosting Classifier, achieving the strongest overall performance.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/9AHm6ZOqwHH6wKOXGx8Eo.png" width="600">





# Logistic Regression


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/D5ilyilgyJhl0eVeNxopI.png" width="600">


# Random Forest


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/SmrQbCw7eRnmyX6vJXCIH.png" width="600">


# Gredient Boosting


<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/c9Nl6GPiF3Q5I5Uj1MMpT.png" width="600">





# Conculation

This project brought together real-world data exploration, feature engineering, clustering, and predictive modeling into one complete workflow. 
Through exploratory data analysis, I found the main factors that influence ride prices. 
Feature engineering and clustering uncovered hidden patterns in the data and improved how the model learned. 
I trained several regression and classification models, compared how they performed, and chose the best ones based on clear metrics. 

Overall, this process deepened my understanding of practical machine-learning pipelines, from raw data to clear insights.
It also showed me the value of experimentation, iteration, and careful evaluation.



# Video Link: https://www.loom.com/share/6a96400976ba486c89091b9d75738884