Idankhen
/

Winning_Model

Model card Files Files and versions

xet

Community

Idankhen commited on Dec 6, 2025

Commit

392a205

verified ·

1 Parent(s): f06ead7

Update README.md

Browse files

Files changed (1) hide show

README.md +119 -7

README.md CHANGED Viewed

@@ -43,17 +43,17 @@ After cleaning the dataset, several visualizations were created to better unders
 *Correlation heatmap*
-![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Z08ys7YF-nnjaYReVid8R.png)
 *Distribution plots*
-![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/JvG5iCw7Muku-TPNeFjGC.png)
 *Sctter Plot*
-![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/K3aUK7C_smA7SV_wqTIwk.png)
@@ -64,7 +64,7 @@ The grapsh shows that average ride prices remain constat throughout the day.
 This indicates the the hour of the day does not affect ride pricing.
-![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/vKfwMnMYtYmP5RlEIRJSj.png)
 ### 2. How do weather conditions affect ride prices?
@@ -73,10 +73,10 @@ Both the temperature scatterplot and the cold-warm compariosn showed that the pr
 Temperatue doesn't affect ride prices.
-![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/HP7RnS2rTq7VLBMFS-8TX.png)
-![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/YIXjRC95l532mNhe341b-.png)
 ## 3. Which pickup location tend to have higher ride prices?
@@ -85,20 +85,132 @@ Pickup from Boston Uni, Fenway and the Finanical District are the most expensive
 Haymarket square and North End are the cheapset. We can see clear differences by location.
-![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/aaXYuxtRQPIzdmog9EJgZ.png)

 *Correlation heatmap*
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Z08ys7YF-nnjaYReVid8R.png" width="600">
 *Distribution plots*
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/JvG5iCw7Muku-TPNeFjGC.png" width="600">
 *Sctter Plot*
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/K3aUK7C_smA7SV_wqTIwk.png" width="600">
 This indicates the the hour of the day does not affect ride pricing.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/vKfwMnMYtYmP5RlEIRJSj.png" width="600">
 ### 2. How do weather conditions affect ride prices?
 Temperatue doesn't affect ride prices.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/HP7RnS2rTq7VLBMFS-8TX.png" width="600">
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/YIXjRC95l532mNhe341b-.png" width="600">
 ## 3. Which pickup location tend to have higher ride prices?
 Haymarket square and North End are the cheapset. We can see clear differences by location.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/aaXYuxtRQPIzdmog9EJgZ.png" width="600">
+## 4. Are there price differences between Uber and Lyft rides?
+Lyft shows a wider and higher price distribution than Uber, meaning Lyft ried tend to be more expensive.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/IJoqL7w6fYisWdDkcI50q.png" width="600">
+# Baseline Model
+The goal was to build a simple first model using Linear Regression. I split the data into 80% train / 20% test, encoded categorical variables, selected the features (X), and set price as the target (y).
+After training the model, I evaluated it using MAE, MSE, RMSE, and R².
+I then reviewed the residual distribution, the Actual vs. Predicted plot, and the feature coefficients to understand model errors and which variables influenced price the most.
+*Model's behavoior:*
+-Residual distribution : showed how far predictions were from the true values
+-Actucal vs Predicted plot : Revealed clear underestimation for high price rides.
+-Coefficient plot: showed that surge_multiplier and distance were the strongest predicitors.
+### Conclusion
+The baseline Linear Regression model captured general trends but struggled with the non-linear structure of the data, especially for expensive rides. The residuals showed noticeable spread, and the R² score confirmed limited explanatory power.
+This indicated the need for feature engineering and more advanced models in later stages.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/x8ZkElILIrdkuLfEacnDs.png" width="600">
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/zrT7egOPfGi_psoBCIaIo.png" width="600">
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/_oxvMAfpvxZ4xo7m3BYV-.png" width="600">
+# Feature Engineering
+For feature engineering, I focused on the numeric columns and defined a list of numeric features that would be used for modeling.
+After preparing the base numeric inputs, I generated polynomial features to help the model capture simple non-linear relationships that the original variables alone might miss.
+This expanded the feature space and gave the later models more expressive power.
+## Applying Clustering
+To improve the feature set, I used K-Means clustering on the scaled polynomial features. I applied the Elbow Method and found that four clusters offered a good balance between model complexity and explained variation. After fitting K-Means with k=4, I added each ride's cluster label back into the dataset.
+To better understand the structure of the clusters, I visualized them using PCA for linear dimensionality reduction and UMAP for clearer non-linear separation, both of which clearly displayed distinct cluster groupings.
+Finally, I enhanced the dataset by calculating each ride's distance to its cluster centroid and creating cluster-probability features, which provided the later models with additional information about cluster confidence and structure.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/BtMycLgbDEOZkHZH14c4D.png" width="600">
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/dItyvpJvX5HMlXcp26kkP.png" width="600">
+# Train Three Models
+I trained three improved regression models using the engineered dataset: Linear Regression, Random Forest Regressor, and Gradient Boosting Regressor.
+Each model was fitted on the training data and evaluated on the test set using RMSE, MAE, and R² to measure predictive performance.
+All three improved models performed far better than the baseline, reducing error dramatically.
+Performance across Linear Regression, Random Forest, and Gradient Boosting was very similar, with Gradient Boosting achieving the best overall balance of RMSE, MAE, and R², making it the strongest model in this comparison.
+Its boosted tree structure allowed it to capture nonlinear interactions more effectively than the other models.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Otb1cFsJT2ZMHRsWdTjWk.png" width="600">
+*Gradient Boosting features importance*
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/egfcECAhQbl7E_omsY7vT.png" width="600">
+# Regression to Classifiction
+To transform the problem from predicting a continuous price into predicting price categories, I converted the numeric target into discrete classes using three different strategies:
+-*Median Split* – converted the target into a binary class (0 = below median, 1 = above median).
+-*Quantile Binning* – created three balanced classes based on the 33% and 66% percentiles of the training set.
+-*Business-Rule Threshold* – defined “expensive” rides using a simple rule: price > 0.
+Before training classification models, I examined the class distributions for train and test to ensure they were reasonably balanced.
+Visualizations confirmed that the median split and quantile binning produced well-distributed classes, while the business-rule split created a more imbalanced dataset.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/6TBw50gy-mw3bsMMx_B0_.png" width="600">
+# Train & Eval Classification Models
+After converting the continuous target into categorical classes, three different classifiers from scikit-learn were trained: Logistic Regression, Random Forest Classifier, and Gradient Boosting Classifier.
+To keep computation manageable, a 100,000-row subsample of the training data was used. Each model was trained and evaluated using Accuracy, Macro F1-score, and a full classification report, followed by confusion matrix visualizations.
+Logistic Regression showed high confusion between all classes and struggled with the middle class.
+Random Forest improved separation but still mixed boundaries, especially for Class 1.
+Gradient Boosting delivered the most balanced predictions, with the best stability across all classes.
+### Winner: Gradient Boosting Classifier, achieving the strongest overall performance.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/9AHm6ZOqwHH6wKOXGx8Eo.png" width="600">
+# Logistic Regression
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/D5ilyilgyJhl0eVeNxopI.png" width="600">
+# Random Forest
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/SmrQbCw7eRnmyX6vJXCIH.png" width="600">
+# Gredient Boosting
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/c9Nl6GPiF3Q5I5Uj1MMpT.png" width="600">