KalsusEvening commited on
Commit
e269e5f
·
verified ·
1 Parent(s): 5e33d14

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +329 -0
README.md ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - regression
5
+ - classification
6
+ - housing-prices
7
+ - gradient-boosting
8
+ - sklearn
9
+ - clustering
10
+ ---
11
+
12
+ # Melbourne Housing Price Prediction
13
+
14
+ ## 📹 Video Presentation
15
+
16
+ [YOUR VIDEO LINK HERE - Add after recording]
17
+
18
+ ---
19
+
20
+ ## 📋 Project Overview
21
+
22
+ This project builds a complete machine learning pipeline to predict Melbourne housing prices using both **regression** (exact price) and **classification** (price category) models.
23
+
24
+ | | |
25
+ |---|---|
26
+ | **Dataset** | Melbourne Housing Snapshot (Kaggle) |
27
+ | **Original Size** | 13,580 properties, 21 features |
28
+ | **Final Size** | 11,139 properties (82% retained) |
29
+ | **Target** | Price |
30
+
31
+ ### Goals
32
+ 1. Build baseline regression model and improve through feature engineering
33
+ 2. Apply K-Means clustering to discover property segments
34
+ 3. Convert to classification and train classification models
35
+ 4. Compare models and identify best performers
36
+
37
+ ---
38
+
39
+ ## 📊 Part 1-2: Exploratory Data Analysis
40
+
41
+ ### Data Cleaning Summary
42
+
43
+ | Step | Action | Impact |
44
+ |------|--------|--------|
45
+ | Missing Values | Dropped BuildingArea (47% missing), YearBuilt (40% missing) | - |
46
+ | Imputation | Car (median), CouncilArea (mode) | ~1,200 rows |
47
+ | Outliers | Removed using IQR method | 2,441 rows |
48
+ | **Final** | **11,139 rows retained** | **18% removed** |
49
+
50
+ ### Price Distribution
51
+
52
+ ![Price Distribution](./01_price_distribution.png)
53
+
54
+ **Statistics:** Mean $1.12M | Median $976K | Range $131K - $3.42M
55
+
56
+ ---
57
+
58
+ ### Research Question 1: Property Type vs Price
59
+
60
+ ![Price by Type](./02_price_by_type.png)
61
+
62
+ | Type | Count | Mean Price |
63
+ |------|-------|------------|
64
+ | House | 9,055 (81%) | $1,203,259 |
65
+ | Townhouse | 952 (9%) | $936,054 |
66
+ | Unit | 1,132 (10%) | $640,529 |
67
+
68
+ **Finding:** Houses cost $560K more than units on average.
69
+
70
+ ---
71
+
72
+ ### Research Question 2: Distance from CBD
73
+
74
+ ![Price vs Distance](./03_price_vs_distance.png)
75
+
76
+ | Distance | Avg Price |
77
+ |----------|-----------|
78
+ | 0-5 km | $1,361K |
79
+ | 10-15 km | $1,046K |
80
+ | 30+ km | $597K |
81
+
82
+ **Finding:** Every 5km from CBD reduces price by ~$100-150K. Correlation: -0.31
83
+
84
+ ---
85
+
86
+ ### Research Question 3: Regional Price Differences
87
+
88
+ ![Price by Region](./04_price_by_region.png)
89
+
90
+ **Finding:** Southern Metropolitan commands 3.5× premium over Western Victoria.
91
+
92
+ ---
93
+
94
+ ### Correlation Analysis
95
+
96
+ ![Correlation Heatmap](./05_correlation_heatmap.png)
97
+
98
+ **Top Correlations with Price:**
99
+ - Rooms: +0.41
100
+ - Bathroom: +0.40
101
+ - Distance: -0.31
102
+
103
+ ---
104
+
105
+ ## 📈 Part 3: Baseline Model
106
+
107
+ | Metric | Value |
108
+ |--------|-------|
109
+ | Algorithm | Linear Regression |
110
+ | Features | 7 numeric |
111
+ | R² Score | 0.4048 |
112
+ | MAE | $323,527 |
113
+ | RMSE | $425,453 |
114
+
115
+ **Interpretation:** Model explains only 40% of price variance. Significant room for improvement through feature engineering.
116
+
117
+ ---
118
+
119
+ ## 🔧 Part 4: Feature Engineering
120
+
121
+ ### Features Expanded: 7 → 43
122
+
123
+ | Category | Features | Count |
124
+ |----------|----------|-------|
125
+ | Original Numeric | Rooms, Distance, Bathroom, etc. | 7 |
126
+ | One-Hot Encoded | Type, Method, Regionname | 16 |
127
+ | Derived Features | Ratios, indicators, bins | 7 |
128
+ | Cluster Features | Labels + distances to centroids | 8 |
129
+ | **Total** | | **43** |
130
+
131
+ ### New Derived Features
132
+
133
+ | Feature | Purpose |
134
+ |---------|---------|
135
+ | Rooms_per_Bathroom | Property efficiency |
136
+ | Total_Spaces | Overall size indicator |
137
+ | Land_per_Room | Land generosity |
138
+ | Is_Inner_City | Location premium flag |
139
+ | Luxury_Score | Amenity indicator |
140
+
141
+ ---
142
+
143
+ ### K-Means Clustering (k=4)
144
+
145
+ ![Elbow Method](./07_elbow_method.png)
146
+
147
+ We used the Elbow Method and Silhouette Score to determine k=4 clusters.
148
+
149
+ ![Cluster Profiles](./08_cluster_profiles.png)
150
+
151
+ | Cluster | Profile | Avg Price | Avg Distance | Avg Rooms |
152
+ |---------|---------|-----------|--------------|-----------|
153
+ | 0 | Compact Inner Units | $835K | 8.0 km | 2.0 |
154
+ | 1 | Premium Family Estates | $1.48M | 12.9 km | 4.2 |
155
+ | 2 | Outer Suburban Affordable | $998K | 13.7 km | 3.0 |
156
+ | 3 | Inner City Houses | $1.18M | 7.9 km | 3.1 |
157
+
158
+ **Key Insight:** Two distinct pricing drivers discovered:
159
+ - **Location premium** (Clusters 0 & 3): Close to CBD
160
+ - **Size premium** (Clusters 1 & 2): Larger properties
161
+
162
+ ---
163
+
164
+ ## 🎯 Part 5: Improved Regression Models
165
+
166
+ ![Regression Comparison](./06_regression_comparison.png)
167
+
168
+ | Model | R² Score | MAE | Improvement |
169
+ |-------|----------|-----|-------------|
170
+ | Baseline Linear Reg | 0.4048 | $323,527 | - |
171
+ | Improved Linear Reg | 0.6302 | $244,654 | +55.7% |
172
+ | Random Forest | 0.7752 | $178,455 | +91.5% |
173
+ | **Gradient Boosting** | **0.7900** | **$172,891** | **+95.1%** |
174
+
175
+ ### Feature Importance (Random Forest)
176
+
177
+ ![Feature Importance](./09_feature_importance.png)
178
+
179
+ **Top 5 Most Important Features:**
180
+
181
+ | Rank | Feature | Importance |
182
+ |------|---------|------------|
183
+ | 1 | Regionname_Southern Metropolitan | 0.242 |
184
+ | 2 | Distance | 0.172 |
185
+ | 3 | Type_h (House) | 0.137 |
186
+ | 4 | Dist_to_Cluster_0 | 0.099 |
187
+ | 5 | Landsize | 0.062 |
188
+
189
+ **Key Insights:**
190
+ - Location dominates (Region + Distance)
191
+ - Clustering features in top 15 (validated approach)
192
+ - Engineered features proved valuable
193
+
194
+ ---
195
+
196
+ ## 🏆 Part 6: Regression Winner
197
+
198
+ ### Gradient Boosting Regressor
199
+
200
+ | Metric | Value |
201
+ |--------|-------|
202
+ | R² Score | 0.7900 |
203
+ | MAE | $172,891 |
204
+ | RMSE | $252,728 |
205
+ | Improvement over Baseline | +95.1% |
206
+
207
+ **Why Gradient Boosting Won:**
208
+ - Captures non-linear relationships
209
+ - Sequential learning corrects errors iteratively
210
+ - Best balance of accuracy and generalization
211
+
212
+ **Saved as:** `regression_model_gradient_boosting.pkl`
213
+
214
+ ---
215
+
216
+ ## 🔄 Part 7: Regression to Classification
217
+
218
+ We converted continuous Price into 3 balanced categories using quantile binning:
219
+
220
+ ![Class Distribution](./11_class_distribution.png)
221
+
222
+ | Class | Price Range | Count | Percentage |
223
+ |-------|-------------|-------|------------|
224
+ | Low | < $800,000 | 3,593 | 32.3% |
225
+ | Medium | $800K - $1.24M | 3,759 | 33.7% |
226
+ | High | > $1.24M | 3,787 | 34.0% |
227
+
228
+ **Balance:** Imbalance ratio of 1.05 - classes are well balanced.
229
+
230
+ ---
231
+
232
+ ### Precision vs Recall Analysis
233
+
234
+ **For housing price prediction, Precision is more important:**
235
+
236
+ | Error Type | Meaning | Consequence |
237
+ |------------|---------|-------------|
238
+ | **False Positive** | Predict High, actually Low | Buyer overpays significantly |
239
+ | False Negative | Predict Low, actually High | Seller underprices |
240
+
241
+ **Conclusion:** False Positives are worse for buyers - prioritize Precision.
242
+
243
+ ---
244
+
245
+ ## 📊 Part 8: Classification Models
246
+
247
+ ![Classification Comparison](./10_classification_comparison.png)
248
+
249
+ | Model | Accuracy |
250
+ |-------|----------|
251
+ | Logistic Regression | 71.1% |
252
+ | Random Forest | 77.1% |
253
+ | **Gradient Boosting** | **78.9%** |
254
+
255
+ ### Winner Performance: Gradient Boosting Classifier
256
+
257
+ | Class | Precision | Recall | F1-Score |
258
+ |-------|-----------|--------|----------|
259
+ | Low | 0.85 | 0.86 | 0.85 |
260
+ | Medium | 0.69 | 0.70 | 0.70 |
261
+ | High | 0.83 | 0.81 | 0.82 |
262
+
263
+ **Observations:**
264
+ - Medium class hardest to predict (borders both Low and High)
265
+ - High precision for High class (0.83) - reliable for buyers
266
+ - Gradient Boosting wins both regression AND classification
267
+
268
+ **Saved as:** `classification_model_gradient_boosting.pkl`
269
+
270
+ ---
271
+
272
+ ## 📁 Repository Files
273
+
274
+ | File | Description | Size |
275
+ |------|-------------|------|
276
+ | `regression_model_gradient_boosting.pkl` | Regression model (R²=0.79) | 419 KB |
277
+ | `classification_model_gradient_boosting.pkl` | Classification model (78.9%) | 1.15 MB |
278
+ | `scaler.pkl` | Regression StandardScaler | 2.32 KB |
279
+ | `classification_scaler.pkl` | Classification StandardScaler | 2.32 KB |
280
+ | `feature_names.pkl` | 43 feature names | 762 B |
281
+ | `Assignment_2_....ipynb` | Complete Jupyter notebook | 6.48 MB |
282
+
283
+ ---
284
+
285
+ ## 💡 Key Takeaways
286
+
287
+ ### What Worked Well
288
+ 1. **Feature engineering was crucial** - Linear Regression R² improved from 0.40 to 0.63 (+55%)
289
+ 2. **Clustering added value** - 4 cluster features in top 15 importance
290
+ 3. **Ensemble methods excel** - Gradient Boosting won both tasks
291
+ 4. **Location is paramount** - Region and distance dominate predictions
292
+
293
+ ### Challenges
294
+ 1. Medium price class hardest to predict (boundary cases)
295
+ 2. Multicollinearity between Rooms and Bedroom2 (0.94 correlation)
296
+ 3. Right-skewed price distribution required careful outlier handling
297
+
298
+ ### Lessons Learned
299
+ 1. Always establish a baseline before feature engineering
300
+ 2. EDA guides modeling decisions
301
+ 3. Clustering reveals hidden patterns
302
+ 4. Same algorithm can perform dramatically different with good features
303
+
304
+ ---
305
+
306
+ ## 📊 Final Summary
307
+
308
+ | Task | Baseline | Final Model | Improvement |
309
+ |------|----------|-------------|-------------|
310
+ | Regression R² | 0.4048 | 0.7900 | +95.1% |
311
+ | Regression MAE | $323,527 | $172,891 | -46.6% |
312
+ | Classification Accuracy | - | 78.9% | - |
313
+ | Features Used | 7 | 43 | +36 |
314
+
315
+ ---
316
+
317
+ ## 👤 Author
318
+
319
+ **David Wilfand**
320
+
321
+ Assignment #2: Classification, Regression, Clustering, Evaluation
322
+
323
+ ---
324
+
325
+ ## 📚 References
326
+
327
+ - **Dataset:** [Melbourne Housing Snapshot - Kaggle](https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot)
328
+ - **Tools:** scikit-learn, pandas, numpy, matplotlib, seaborn
329
+ - **Algorithms:** Linear Regression, Random Forest, Gradient Boosting, K-Means