Idankhen commited on
Commit
392a205
·
verified ·
1 Parent(s): f06ead7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -7
README.md CHANGED
@@ -43,17 +43,17 @@ After cleaning the dataset, several visualizations were created to better unders
43
 
44
  *Correlation heatmap*
45
 
46
- ![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Z08ys7YF-nnjaYReVid8R.png)
47
 
48
  *Distribution plots*
49
 
50
- ![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/JvG5iCw7Muku-TPNeFjGC.png)
51
 
52
 
53
  *Sctter Plot*
54
 
55
 
56
- ![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/K3aUK7C_smA7SV_wqTIwk.png)
57
 
58
 
59
 
@@ -64,7 +64,7 @@ The grapsh shows that average ride prices remain constat throughout the day.
64
  This indicates the the hour of the day does not affect ride pricing.
65
 
66
 
67
- ![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/vKfwMnMYtYmP5RlEIRJSj.png)
68
 
69
 
70
  ### 2. How do weather conditions affect ride prices?
@@ -73,10 +73,10 @@ Both the temperature scatterplot and the cold-warm compariosn showed that the pr
73
  Temperatue doesn't affect ride prices.
74
 
75
 
76
- ![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/HP7RnS2rTq7VLBMFS-8TX.png)
77
 
78
 
79
- ![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/YIXjRC95l532mNhe341b-.png)
80
 
81
 
82
  ## 3. Which pickup location tend to have higher ride prices?
@@ -85,20 +85,132 @@ Pickup from Boston Uni, Fenway and the Finanical District are the most expensive
85
  Haymarket square and North End are the cheapset. We can see clear differences by location.
86
 
87
 
88
- ![image](https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/aaXYuxtRQPIzdmog9EJgZ.png)
89
 
90
 
 
91
 
 
92
 
93
 
 
94
 
95
 
96
 
 
97
 
 
 
 
98
 
 
99
 
 
100
 
 
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
 
104
 
 
43
 
44
  *Correlation heatmap*
45
 
46
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Z08ys7YF-nnjaYReVid8R.png" width="600">
47
 
48
  *Distribution plots*
49
 
50
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/JvG5iCw7Muku-TPNeFjGC.png" width="600">
51
 
52
 
53
  *Sctter Plot*
54
 
55
 
56
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/K3aUK7C_smA7SV_wqTIwk.png" width="600">
57
 
58
 
59
 
 
64
  This indicates the the hour of the day does not affect ride pricing.
65
 
66
 
67
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/vKfwMnMYtYmP5RlEIRJSj.png" width="600">
68
 
69
 
70
  ### 2. How do weather conditions affect ride prices?
 
73
  Temperatue doesn't affect ride prices.
74
 
75
 
76
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/HP7RnS2rTq7VLBMFS-8TX.png" width="600">
77
 
78
 
79
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/YIXjRC95l532mNhe341b-.png" width="600">
80
 
81
 
82
  ## 3. Which pickup location tend to have higher ride prices?
 
85
  Haymarket square and North End are the cheapset. We can see clear differences by location.
86
 
87
 
88
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/aaXYuxtRQPIzdmog9EJgZ.png" width="600">
89
 
90
 
91
+ ## 4. Are there price differences between Uber and Lyft rides?
92
 
93
+ Lyft shows a wider and higher price distribution than Uber, meaning Lyft ried tend to be more expensive.
94
 
95
 
96
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/IJoqL7w6fYisWdDkcI50q.png" width="600">
97
 
98
 
99
 
100
+ # Baseline Model
101
 
102
+ The goal was to build a simple first model using Linear Regression. I split the data into 80% train / 20% test, encoded categorical variables, selected the features (X), and set price as the target (y).
103
+ After training the model, I evaluated it using MAE, MSE, RMSE, and R².
104
+ I then reviewed the residual distribution, the Actual vs. Predicted plot, and the feature coefficients to understand model errors and which variables influenced price the most.
105
 
106
+ *Model's behavoior:*
107
 
108
+ -Residual distribution : showed how far predictions were from the true values
109
 
110
+ -Actucal vs Predicted plot : Revealed clear underestimation for high price rides.
111
 
112
+ -Coefficient plot: showed that surge_multiplier and distance were the strongest predicitors.
113
+
114
+ ### Conclusion
115
+ The baseline Linear Regression model captured general trends but struggled with the non-linear structure of the data, especially for expensive rides. The residuals showed noticeable spread, and the R² score confirmed limited explanatory power.
116
+ This indicated the need for feature engineering and more advanced models in later stages.
117
+
118
+
119
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/x8ZkElILIrdkuLfEacnDs.png" width="600">
120
+
121
+
122
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/zrT7egOPfGi_psoBCIaIo.png" width="600">
123
+
124
+
125
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/_oxvMAfpvxZ4xo7m3BYV-.png" width="600">
126
+
127
+
128
+ # Feature Engineering
129
+ For feature engineering, I focused on the numeric columns and defined a list of numeric features that would be used for modeling.
130
+ After preparing the base numeric inputs, I generated polynomial features to help the model capture simple non-linear relationships that the original variables alone might miss.
131
+ This expanded the feature space and gave the later models more expressive power.
132
+
133
+
134
+ ## Applying Clustering
135
+ To improve the feature set, I used K-Means clustering on the scaled polynomial features. I applied the Elbow Method and found that four clusters offered a good balance between model complexity and explained variation. After fitting K-Means with k=4, I added each ride's cluster label back into the dataset.
136
+ To better understand the structure of the clusters, I visualized them using PCA for linear dimensionality reduction and UMAP for clearer non-linear separation, both of which clearly displayed distinct cluster groupings.
137
+ Finally, I enhanced the dataset by calculating each ride's distance to its cluster centroid and creating cluster-probability features, which provided the later models with additional information about cluster confidence and structure.
138
+
139
+
140
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/BtMycLgbDEOZkHZH14c4D.png" width="600">
141
+
142
+
143
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/dItyvpJvX5HMlXcp26kkP.png" width="600">
144
+
145
+
146
+
147
+ # Train Three Models
148
+
149
+ I trained three improved regression models using the engineered dataset: Linear Regression, Random Forest Regressor, and Gradient Boosting Regressor.
150
+ Each model was fitted on the training data and evaluated on the test set using RMSE, MAE, and R² to measure predictive performance.
151
+ All three improved models performed far better than the baseline, reducing error dramatically.
152
+ Performance across Linear Regression, Random Forest, and Gradient Boosting was very similar, with Gradient Boosting achieving the best overall balance of RMSE, MAE, and R², making it the strongest model in this comparison.
153
+ Its boosted tree structure allowed it to capture nonlinear interactions more effectively than the other models.
154
+
155
+
156
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/Otb1cFsJT2ZMHRsWdTjWk.png" width="600">
157
+
158
+
159
+ *Gradient Boosting features importance*
160
+
161
+
162
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/egfcECAhQbl7E_omsY7vT.png" width="600">
163
+
164
+
165
+
166
+ # Regression to Classifiction
167
+ To transform the problem from predicting a continuous price into predicting price categories, I converted the numeric target into discrete classes using three different strategies:
168
+
169
+ -*Median Split* – converted the target into a binary class (0 = below median, 1 = above median).
170
+
171
+ -*Quantile Binning* – created three balanced classes based on the 33% and 66% percentiles of the training set.
172
+
173
+ -*Business-Rule Threshold* – defined “expensive” rides using a simple rule: price > 0.
174
+
175
+ Before training classification models, I examined the class distributions for train and test to ensure they were reasonably balanced.
176
+ Visualizations confirmed that the median split and quantile binning produced well-distributed classes, while the business-rule split created a more imbalanced dataset.
177
+
178
+
179
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/6TBw50gy-mw3bsMMx_B0_.png" width="600">
180
+
181
+
182
+ # Train & Eval Classification Models
183
+
184
+ After converting the continuous target into categorical classes, three different classifiers from scikit-learn were trained: Logistic Regression, Random Forest Classifier, and Gradient Boosting Classifier.
185
+ To keep computation manageable, a 100,000-row subsample of the training data was used. Each model was trained and evaluated using Accuracy, Macro F1-score, and a full classification report, followed by confusion matrix visualizations.
186
+ Logistic Regression showed high confusion between all classes and struggled with the middle class.
187
+ Random Forest improved separation but still mixed boundaries, especially for Class 1.
188
+ Gradient Boosting delivered the most balanced predictions, with the best stability across all classes.
189
+ ### Winner: Gradient Boosting Classifier, achieving the strongest overall performance.
190
+
191
+
192
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/9AHm6ZOqwHH6wKOXGx8Eo.png" width="600">
193
+
194
+
195
+
196
+
197
+
198
+ # Logistic Regression
199
+
200
+
201
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/D5ilyilgyJhl0eVeNxopI.png" width="600">
202
+
203
+
204
+ # Random Forest
205
+
206
+
207
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/SmrQbCw7eRnmyX6vJXCIH.png" width="600">
208
+
209
+
210
+ # Gredient Boosting
211
+
212
+
213
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6914bfee85498cde4e532078/c9Nl6GPiF3Q5I5Uj1MMpT.png" width="600">
214
 
215
 
216