File size: 15,824 Bytes
0a74908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8dce47
 
 
 
 
d9c3f9a
 
2600cc6
d9c3f9a
 
 
c8dce47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d9c3f9a
c8dce47
 
 
 
 
 
 
 
 
 
 
 
 
a73cdfa
c8dce47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a73cdfa
 
 
 
 
 
 
c8dce47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
language:
  - en


tags:
  - salary-prediction
  - regression
  - classification
  - clustering
  - tabular
  - scikit-learn
  - stack-overflow
  - developer-survey
  - feature-engineering
  - gradient-boosting


datasets:
  - stack-overflow-developer-survey-2025

base_model: None
---

# Assignment 2 – Developer Salary Prediction
### Stack Overflow Developer Survey 2025 | Classification, Regression & Clustering

---

## Video

# <video src="https://huggingface.co/BentoUniAcc/Stack_Overflow_Salary_Predicting_Model/resolve/main/Data%20Analysis%20Assignment%202%20Video%20Project%201.mp4" controls="controls" style="max-width: 720px;"></video>

---

## Overview

This project uses the **Stack Overflow Annual Developer Survey 2025** (49,123 responses, 170 features) to predict a software developer's annual salary. The pipeline covers end-to-end data science: exploratory analysis, feature engineering, unsupervised clustering, regression, and multi-class classification.

**Research Question:** Can we predict a software developer's annual salary from their professional profile, and which factors matter most?

---

## Dataset

| Property | Value |
|----------|-------|
| Source | Stack Overflow Developer Survey 2025 (Kaggle) |
| Raw rows | 49,123 |
| Raw columns | 170 |
| Target column | `ConvertedCompYearly` (annual salary in USD) |
| Final feature count | 253 (after engineering + cluster feature) |

---

## Part 1 – Setup

- Environment: Google Colab compatible
- Reproducibility seed: `SEED = 42`
- Key libraries: `pandas`, `numpy`, `scikit-learn`, `matplotlib`

---

## Part 2 – Exploratory Data Analysis

### 2.1 Data Cleaning

- Removed rows with no salary value
- Clipped extreme outliers at the 1st and 99th percentile (final median salary β‰ˆ $75K)
- Removed 44 columns with >60% missing values (170 β†’ 126 columns), protecting the top-15 salary correlates regardless of missingness

### 2.2 Missing Value Analysis


![01_Bar_chart_top-40_columns_by_missing](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/Yidxz76808ecCkC9bDRFM.png)

Several columns exceed 60% missingness and are dropped. The protected essential columns are retained despite high missingness and imputed later.

### 2.3 Descriptive Statistics

| Statistic | Value |
|-----------|-------|
| Median salary | ~$75,000 |
| Distribution | Right-skewed |
| Outlier treatment | 1st–99th percentile clip |

### 2.4 Salary Distribution


![02_Distribution_of_Annual_Developer_Salary](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/pL0L7cvLkouusXvuTrFHp.png)

The raw distribution is heavily right-skewed with a long tail above $200K.

### 2.5 Research Questions & Findings

**Q1: Does coding experience predict salary?**


![03_Does_Coding_Experience_Predict_Salary](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/98VvwfqURZ5RcnpdbIIku.png)

Salary increases steeply through the first 15–20 years of experience then flattens. There is wide variance at every experience level, suggesting experience alone is not sufficient to predict salary.

**Q2: How does education level affect salary?**


![04_How_Does_Education_Level_Affect_Salary](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/ZE5DovlexhruZZ4n6oQm8.png)

Median salary rises with education level, but the gap between a Bachelor's and Master's degree is smaller than expected. Professional degrees and doctoral holders show the highest median salaries.

**Q3: Which countries pay developers the most?**


![05_Which_Countries_Pay_Developers_the_Most](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/Ly4_3GC1EKt7agGdNPTD1.png)

The US dominates with a median salary roughly 2–3Γ— the global median. Israeli, Western European and Australian developers cluster in a second tier, while developers in Asia and South America earn less.

**Q4: Do remote workers earn more?**


![06_Do_Remote_Workers_Earn_More](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/xExDwYtGmmheimXVMfnby.png)

Fully remote developers show a slight salary premium over hybrid and in-office roles. The difference is modest, suggesting remote work correlates with higher-paying companies rather than being a direct cause.

**Q5: How does salary vary across developer roles?**


![07_How_Does_Salary_Vary_Across_Developer_Roles](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/4zMNkvffCW-SZ0t3JT6-C.png)

C-Suite and ML/Data Science roles have the widest salary ranges and highest medians. Full-stack and front-end developers cluster around the global median with less variance.

### 2.6 Final Feature Selection (~20 Columns)

| Category | Features |
|----------|----------|
| Target | `ConvertedCompYearly` |
| Numeric | `YearsCode`, `WorkExp`, `JobSat`, `JobSatPoints_11`, `JobSatPoints_4` |
| Demographics | `Age`, `Country`, `EdLevel`, `MainBranch` |
| Work profile | `Employment`, `RemoteWork`, `DevType`, `OrgSize` |
| Tech & AI | `LanguageHaveWorkedWith`, `AISelect` |
| Learning | `LearnCodeChoose`, `SOVisitFreq` |

### 2.7 EDA Takeaways

1. **Salary** is right-skewed; median ~$75K after cleaning
2. **Work experience** (`WorkExp`) and **coding experience** (`YearsCode`) are the strongest numeric predictors
3. **Country** is the dominant signal β€” geography explains more variance than any other feature
4. **Remote work** carries a small positive premium
5. **Developer role and education** have meaningful but secondary effects

---

## Part 3 – Baseline Model

A simple Linear Regression trained on raw numeric columns only β€” no encoding, no feature engineering.

### Train/Test Split
- 80/20 random split, `SEED=42`

### Results

| Metric | Baseline |
|--------|----------|
| MAE    | $45,810  |
| RMSE   | $61,947  |
| RΒ²     | 0.1598   |

### Predicted vs. Actual


![08_Plot_1_Predicted_vs_Actual](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/6rJTbIavUrv0mQz2cZz4B.png)

The baseline model struggles with high earners β€” predictions cluster around the mean and fail to capture the upper salary range. The scatter is wide, consistent with an RΒ² of only 0.16.

---

## Part 4 – Feature Engineering & Clustering

### Engineering Steps

| Step | Description |
|------|-------------|
| 4.1 Numeric features | Derived ratio/interaction features |
| 4.2 Ordinal encoding | `EdLevel`, `OrgSize` mapped to integers |
| 4.3 One-hot encoding | `Country`, `RemoteWork`, `Employment`, `MainBranch`, `AISelect`, `SOVisitFreq`, `Age`, `PrimaryDevType` |
| 4.4 Language flags | Binary flag for each of the top-10 programming languages |
| 4.5 Imputation & scaling | Median imputation + `StandardScaler` β†’ 249 features |

### KMeans Elbow Method


![09_46_KMeans_Elbow](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/BE9Tv-2Wu7kJuA1j7JNDd.png)

The inertia curve decreases gradually without a sharp elbow, reflecting the high-dimensional and overlapping nature of the data. k=4 was selected as a reasonable balance between cluster granularity and interpretability.

### Silhouette Score Comparison


![10_Silhouette_scores_across_k_for_each_clustering_method](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/X4f8cT2a8e9baBeyVGmnA.png)

Silhouette scores are low across all values of k, confirming that natural cluster separation is weak in this dataset. Agglomerative clustering consistently outperforms KMeans, peaking around k=4.

### Three Clustering Algorithms

| Algorithm | k / params | Silhouette |
|-----------|-----------|------------|
| KMeans | k=4 | 0.0109 |
| DBSCAN | eps=5, min_samples=25 | 0.0912 (7 clusters) |
| Agglomerative Ward | k=4 | **0.0224** |

The data's high dimensionality (249 features) makes density-based clustering (DBSCAN) impractical β€” inter-point distances are too large for meaningful core-point detection.

### Cluster Visualisations (PCA 2D)


![11_48_Separate_scatter_plots](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/DFPqTW9luodTCEWaLOJI1.png)

KMeans splits the data into four roughly equal blobs with significant overlap in the PCA projection. The clusters correspond loosely to salary level but boundaries are indistinct.

![12_48_Separate_scatter_plots](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/GmVq3G3FJO2ucpXWGywAl.png)

DBSCAN classifies the vast majority of points as noise, forming 7 clusters. High dimensionality makes distance-based density estimation ineffective on this dataset.


![13_48_Separate_scatter_plots](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/j6b9SMso4DovtAbKgqnSj.png)

Agglomerative clustering produces the clearest separation, isolating a distinct high-salary cluster on the right of the PCA plot. The four tiers align visually with low, mainstream, high-mid, and elite salary groups.

### Cluster Profiles – Agglomerative (Chosen)

| Cluster | Mean Salary | Median Salary | Count |
|---------|------------|---------------|-------|
| 0 | $86,041 | $74,000 | 18,192 |
| 1 | $109,980 | $95,000 | 4,069 |
| 2 | $27,574 | $13,949 | 877 |
| 3 | $101,735 | $93,387 | 317 |

**Winner: Agglomerative Ward (k=4)** β€” highest silhouette score and four interpretable salary tiers (low-income, mainstream, high-mid, elite).

### Cluster Feature Added

`cluster_id` one-hot encoded and appended β†’ **253 final features**

---

## Part 5 – Improved Regression Models

Three models trained on the full 253-feature matrix (249 engineered features + 4 cluster dummies).

### Results

| Model | MAE | RMSE | RΒ² |
|-------|-----|------|----|
| Baseline Linear Regression | $45,810 | $61,947 | 0.1598 |
| Improved Linear Regression | $30,688 | $44,314 | 0.5701 |
| Random Forest (200 trees) | $31,998 | $45,784 | 0.5411 |
| **HistGradientBoosting (300 iters)** | **$28,991** | **$43,039** | **0.5944** |

### Model Performance Comparison


![14_Comparison_table](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/UBAzzh-I7z_S46Oq2vySN.png)

HistGradientBoosting wins on all three metrics. The jump from baseline to improved linear regression is dramatic β€” encoding Country alone accounts for the majority of the RΒ² improvement from 0.16 to 0.57.

### Feature Importance


![15_Feature_importance_for_all_three_models](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/mqxEApvV_TWicLuvpR2sh.png)

Country dummies (especially US) dominate feature importance across all three models. Work experience and years of coding rank consistently high. The cluster feature appears in the top 20 for linear regression, validating the clustering step.

### Winning Model – Predicted vs. Actual


![16_Declare_the_winner_based_on_RΒ²_highest_and_MAE_lowest](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/yNJ302EBP1qFzXOaT4awP.png)

The HistGradientBoosting model tracks the perfect-prediction diagonal much more closely than the baseline. It still under-predicts some very high earners above $300K but captures the mid-range salary distribution well.

### Discussion

- **Baseline β†’ Improved Linear Regression (+0.41 RΒ²):** One-hot encoding `Country` was the single biggest improvement. Geography is the dominant salary signal.
- **Random Forest vs. Linear Regression:** Non-linear feature interactions (e.g. senior developer Γ— US location) are captured naturally by trees.
- **HistGradientBoosting wins:** Sequential boosting focuses on the hardest predictions. Natively handles missing values and is 10–100Γ— faster than standard GradientBoosting.
- **Cluster feature:** Pre-computed salary-tier signal from Part 4 particularly boosts Linear Regression.

### Winner: HistGradientBoosting Regressor

| Metric | Value |
|--------|-------|
| MAE | $28,991 |
| RMSE | $43,039 |
| RΒ² | 0.5944 |

---

## Part 6 – Winning Regression Model Export

The winning regression model is saved to `winning_model_regression.pkl`.

---

## Part 7 – Salary Classification Setup

The continuous salary target is binned into four ordered classes:

| Class | Label | Range (USD/year) |
|-------|-------|------------------|
| 0 | Low | < $30,000 |
| 1 | Mid | $30,000 – $90,000 |
| 2 | High | $90,000 – $160,000 |
| 3 | Very High | > $160,000 |

### Class Distribution


![17_Bar_chart](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/lYp3HW7LWzddsmpPvnsDt.png)

Mid-salary developers make up nearly 40% of the dataset. Very High earners are the smallest class at 13.6%, creating a mild class imbalance that the models must handle.

### Salary Distribution per Class


![18_Salary_Distribution_per_Class](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/DcYHbvYjHLwj9qqq07te8.png)

Each bin shows a clean salary range with minimal overlap at the boundaries, confirming the thresholds were well-chosen. The Very High class has the widest spread, reflecting high variability among top earners.

---

## Part 8 – Classification Models

Same 253-feature matrix as regression, with a stratified 80/20 train/test split.

### Precision vs. Recall & False Positives vs. False Negatives

**Recall is prioritised over precision** in this task. Misclassifying a developer into a lower salary tier (a false negative) carries real-world cost β€” under-negotiation, poor benchmarking, missed career leverage β€” whereas a false positive (over-predicting a tier) is relatively benign.

**False Negatives are more critical than False Positives.** Predicting "Mid" when a developer is truly "High" or "Very High" obscures their earning potential. For this reason, evaluation uses **weighted F1-score**, which balances precision and recall across all four classes with particular attention to recall in the minority tiers (Low and Very High).


### Results

| Model | Accuracy | F1 (weighted) |
|-------|----------|---------------|
| Logistic Regression | 0.598 | 0.597 |
| Random Forest (200 trees) | 0.605 | 0.597 |
| **HistGradientBoosting (300 iters)** | **0.611** | **0.610** |

### Classification Model Comparison


![19_Summary_table](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/d8PLnQGx-aXyF4cHVlMXJ.png)

HistGradientBoosting leads on both accuracy and weighted F1, though the margin between all three models is narrow. The gap is larger on F1, reflecting better handling of the minority classes.

### Confusion Matrices


![20_graph](https://cdn-uploads.huggingface.co/production/uploads/69d8c774af594a45bf54cc48/C8kfKc3bb0tQLheqNEaBM.png)

All three models struggle most with the High class ($90K–$160K), frequently confusing it with Mid. HistGradientBoosting shows the best recall on the Low and Very High tiers β€” the most actionable classes β€” with misclassifications mostly occurring between adjacent salary bands.

### Per-Class Performance – HistGradientBoosting (Winner)

| Class | Precision | Recall | F1 |
|-------|-----------|--------|----|
| Low (<$30K) | 0.66 | 0.76 | 0.71 |
| Mid ($30K–$90K) | 0.69 | 0.59 | 0.64 |
| High ($90K–$160K) | 0.53 | 0.49 | 0.51 |
| Very High (>$160K) | 0.51 | 0.67 | 0.58 |

### Winner: HistGradientBoosting Classifier

Sequential boosting handles the sparse one-hot encoded feature space well, focuses capacity on the most difficult salary boundaries, and outperforms both Logistic Regression and Random Forest on accuracy and F1.

---

## Final Model Files

| File | Contents |
|------|----------|
| `winning_model_regression.pkl` | HistGradientBoosting Regressor (MAE $28,991, RΒ² 0.59) |
| `winning_model_classifier.pkl` | HistGradientBoosting Classifier (Accuracy 0.61, F1 0.61) |