File size: 15,684 Bytes
0c2faf1 ef53687 0c2faf1 276b790 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 ef53687 0c2faf1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 | ---
license: mit
tags:
- regression
- classification
- clustering
- xgboost
- random-forest
- salary-prediction
- developer-survey
- stack-overflow
language:
- en
---
# Stack Overflow Salary Prediction - Developer Survey 2024
## π¬ Project Demo
[Watch the demo video here](/razsarusi/stackoverflow-salary-prediction/blob/main/./%D7%A1%D7%A8%D7%98%D7%95%D7%9F%20%D7%9E%D7%A9%D7%99%D7%9E%D7%94%202%20%D7%93%D7%90%D7%98%D7%94%20%D7%A1%D7%99%D7%99%D7%A0%D7%A1.mp4)
> πΊ A 6-minute walkthrough covering the entire project: data exploration,
> feature engineering, model training, and key insights.
## π Project Overview
This project predicts annual developer compensation (salary) based on factors
like experience, location, technologies, education, and AI tool adoption.
The data comes from the Stack Overflow Annual Developer Survey 2024,
covering 65,437 developers worldwide.
## π― Objectives
1. **Regression**: Predict exact salary in USD
2. **Classification**: Categorize developers into salary tiers (Low/Medium/High)
3. **Clustering**: Discover natural developer segments
## π Dataset
- **Source**: Stack Overflow Annual Developer Survey 2024
- **Size**: 65,437 rows Γ 114 columns (raw)
- **After cleaning**: 22,765 rows Γ 68 features
- **Target**: `ConvertedCompYearly` (USD)
- **Available on Kaggle**: <https://www.kaggle.com/datasets/berkayalan/stack-overflow-annual-developer-survey-2024>
## π Key Findings (EDA)
### Target Variable Analysis
- 23,435 valid salary responses (35.8% of dataset)
- Highly right-skewed distribution
- Range: $1 to $16,256,603 (extreme outliers exist)
- Median salary: $65,000
- Mean salary: $86,155
- 97.1% in realistic range ($1K-$500K)
- **Decision:** Apply log transformation + filter outliers

*Salary distribution showing extreme right-skewness and the value of log transformation for modeling.*
### Data Structure
- 100 categorical (object) columns
- 13 float columns
- 1 integer column (ResponseId)
- Most predictive features need conversion from text to numeric
### Top Paying Countries (by median salary)
| Rank | Country | Median Salary | Sample Size |
|------|---------|--------------|-------------|
| 1 | USA | $141,000 | 4,596 |
| 2 | Israel | $113,334 | 217 |
| 3 | Switzerland | $111,417 | 385 |
| 4 | Australia | $95,796 | 505 |
| 5 | Ireland | $91,295 | 120 |
| 6 | Denmark | $88,993 | 211 |
| 7 | Canada | $87,231 | 861 |
| 8 | UK | $84,038 | 1,376 |
**Key insight**: Geographic location is the most powerful predictor of salary.
The same role can earn 5-10x more in the US/Israel/Switzerland compared to
emerging economies.
### Geographic Salary Variance
Boxplot analysis revealed:
- **USA**: Median $140K with high variance ($100K-$200K interquartile range),
many high-end outliers reaching $500K+
- **Western Europe** (Germany, UK): Median $70-85K, moderate variance
- **Eastern Europe** (Poland, Ukraine): Median $35-55K, but with significant
high-end outliers (likely remote workers for foreign companies)
- **Emerging markets** (India, Brazil): Median $15-25K, low variance
- **Salary range from highest to lowest country median: ~10x difference**

*Salary distributions across the top 10 countries (by sample size). USA dominates both in median salary and variance.*
### Top Paying Developer Roles
| Rank | Role | Median Salary |
|------|------|---------------|
| 1 | Senior Executive (C-Suite, VP) | $120K |
| 2 | Engineering Manager | $115K |
| 3 | Engineer, Site Reliability (SRE) | $98K |
| 4 | Cloud Infrastructure Engineer | $96K |
| 5 | Security Professional | $80K |
| 6 | Data Engineer | $77K |
| 7 | Developer, AI | $75K |
| 8 | Data Scientist / ML Specialist | $73K |
| 9 | Back-end Developer | $68K |
| 10 | Full-stack Developer | $64K |
**Key insights**:
- **Specialization pays**: Infrastructure roles (SRE, Cloud) earn 30-50% more
than general development roles
- **Management track**: Engineering managers and executives top the list
- **Counter-intuitive finding**: AI Developer ranks 7th, not at top despite
the AI hype - market still developing
- **Full-stack paradox**: Largest group (18,260 respondents) but lowest median
in top-15, suggesting market saturation

*Top 15 developer roles ranked by median salary. Note how specialized infrastructure roles (SRE, Cloud) outperform general development roles.*
### Experience vs Salary Relationship
- Overall correlation: **0.38** (moderate, due to country variance)
- Career growth pattern observed:
- Years 0-10: Steep growth ($25K β $78K, 3x increase)
- Years 10-20: Continued growth ($78K β $95K)
- Years 20+: Plateau effect (~$100-110K, role-dependent)
- **Within-country correlation is much stronger** than overall correlation
- Median professional experience in dataset: 8 years

*The career growth curve: rapid early growth followed by plateau effect after ~20 years.*
### Country-Experience Interaction (Simpson's Paradox)
Within-country correlations between experience and salary:
- Germany: 0.438 (highest - structured market)
- India: 0.394 (experience matters)
- USA: 0.319 (role/company matter more)
- UK: 0.271
- Canada: 0.299
**Insight**: The same career trajectory yields vastly different outcomes
based on geography. A junior developer in USA ($65K) earns more than a
senior developer in India ($45K after 25 years). This makes country a
critical feature for the model.

*The "geography is destiny" effect: same experience yields drastically different salaries across countries.*
### Technology Indicators (Linear Correlation with Salary)
| Technology | Users | % | Correlation |
|------------|-------|---|-------------|
| AWS | 9,894 | 43.5% | **+0.139** |
| Go | 3,388 | 14.9% | +0.087 |
| Rust | 2,853 | 12.5% | +0.082 |
| Copilot | 8,203 | 36.0% | +0.060 |
| Scala | 669 | 2.9% | +0.058 |
| Azure | 5,825 | 25.6% | +0.047 |
| Python | 11,142| 48.9% | +0.044 |
| Kubernetes | 4,180 | 18.4% | -0.004 |
| Docker | 11,591| 50.9% | -0.002 |
| **ChatGPT**|14,827 | 65.1% | **-0.102** |
**Insights**:
- **AWS is the strongest single technology indicator** - likely because
AWS adoption correlates with established tech companies in higher-paying countries
- **Docker, Kubernetes, Terraform show ~0 linear correlation** despite
being valuable skills - they have become industry standards (commoditized)
- **ChatGPT usage is negatively correlated** - consistent with junior
developers relying more on AI tools than senior engineers
- These features still provide value through **non-linear interactions**
in tree-based models (Random Forest, XGBoost)
### Key Predictive Features Identified
- **YearsCodePro** - Years of professional coding experience
- **Country** - Geographic location (massive impact)
- **EdLevel** - Education level (8 ordered categories)
- **DevType** - Developer role type (34 categories - needs grouping)
- **OrgSize** - Company size (10 ordered categories)
- **RemoteWork** - Remote/Hybrid/In-person
## π οΈ Methodology
### Data Preprocessing
- Filtered rows with valid salary data (65,437 β 22,765 after outlier removal)
- Removed extreme outliers (<$1K and >$500K)
- Applied log transformation to target (handles right-skewed distribution)
- Converted text-based numeric columns (YearsCode, YearsCodePro)
- Median imputation for missing experience values
### Feature Engineering
- **Ordinal Encoding**: EdLevel (8 levels), OrgSize (10 sizes), Age (8 groups)
- **Country Grouping**: 185 countries β 11 regions based on geography and economy
- **DevType Grouping**: 34 roles β 7 broader categories
- **Multi-select handling**:
- Created 5 binary indicators for Employment status
- Count features for technologies (num_languages, num_databases, etc.)
- Binary flags for high-value technologies (uses_AWS, uses_Python, etc.)
- **One-Hot Encoding**: Applied to Region, DevCategory, RemoteWork, Industry
- **Final dataset**: 22,765 samples Γ 68 features
### Models Trained
- **Regression**: Linear Regression, Random Forest, XGBoost
- **Classification**: Logistic Regression, Random Forest, XGBoost
- **Clustering**: K-Means with K=4 (chosen via Silhouette analysis)
## π Results
### Regression Model Performance
| Model | RΒ² (log) | RΒ² ($) | MAE ($) | RMSE ($) | Training Time |
|-------|----------|--------|---------|----------|---------------|
| Linear Regression | 0.5319 | 0.4333 | 30,917 | 49,592 | <1s |
| Random Forest | 0.5698 | 0.5121 | 28,005 | 46,018 | 30s |
| **XGBoost (best)** | **0.5840** | **0.5326** | **27,513** | **45,039** | **2.6s** |
**Best Model: XGBoost** with RΒ² = 0.5326 (explains 53% of salary variance)
### Feature Importance Analysis
Top features driving predictions (XGBoost):
| Rank | Feature | Importance |
|------|---------|------------|
| 1 | Region_North_America | 36.58% |
| 2 | Region_Western_Europe | 8.89% |
| 3 | Region_Asia_Developing | 6.86% |
| 4 | Region_Asia_Pacific_Developed | 4.62% |
| 5 | YearsCodePro | 3.28% |
### Feature Importance by Category
| Category | Total Importance |
|----------|------------------|
| π **Region (Geography)** | **67.0%** |
| π» Tech indicators | 6.9% |
| β° Experience | 5.7% |
| π Industry | 5.5% |
| πΌ Employment status | 4.9% |
| π’ Other | 3.4% |
| πΌ Developer Category | 3.1% |
| π Tech counts | 1.9% |
| π€ Demographics | 1.6% |
**Key insight**: Geography is the dominant predictor (67%), confirming our EDA finding
that location matters more than skills, experience, or role for salary determination.
The same developer in different regions can have 5-10x salary differences.

*Top 20 most important features in XGBoost. Region_North_America alone accounts for 36.6% of model decisions.*
### Classification Model Performance
Salary categorized into 3 classes (33%/33%/33%):
- **Low**: < $46,185
- **Medium**: $46,185 - $91,719
- **High**: > $91,719
| Model | Accuracy |
|-------|----------|
| Logistic Regression | 68.72% |
| Random Forest | 69.38% |
| **XGBoost (best)** | **70.39%** |
**Best Classifier: XGBoost** with 70.39% accuracy (vs 33% baseline)
### Per-Class Performance (XGBoost)
| Category | Accuracy | Precision | Recall | F1-Score |
|----------|----------|-----------|--------|----------|
| Low | 75.77% | 0.7602 | 0.7577 | 0.7589 |
| High | 74.14% | 0.7595 | 0.7414 | 0.7503 |
| Medium | 61.57% | 0.5999 | 0.6157 | 0.6077 |
**Key insights**:
- Model excels at distinguishing extreme categories (Low/High)
- Misclassifications between Low β High are rare (~4%)
- Medium category is hardest to classify (boundary cases)
- Model tends to predict Medium when uncertain (conservative strategy)

*XGBoost confusion matrix. The model rarely confuses Low with High (~4% error rate), but Medium is harder to classify.*
### Clustering Analysis (K-Means, K=4)
K-Means clustering identified 4 distinct developer personas:
| Cluster | Persona | Size | Median Salary | Years Pro | Top Region |
|---------|---------|------|---------------|-----------|------------|
| 0 | **Mainstream Developer** | 45.1% | $58,375 | 6 | Western Europe |
| 1 | **Junior / Eastern Europe** | 10.2% | $42,962 | 6 | Western/Eastern Europe |
| 2 | **Modern Tech Worker** | 25.1% | $66,000 | 7 | North America |
| 3 | **Elite / Senior** | 19.6% | **$105,258** | **22** | North America |
**Key clustering insights**:
- **ChatGPT usage is inversely correlated with seniority**: Cluster 2 (modern) uses
it 88% of the time, while Cluster 3 (elite/senior) only 44%
- **Cluster 3 (Elite)** stands out with 22+ years experience, North American
location, and high salary - the "veteran developer" persona
- **Cluster 2 (Modern Tech Worker)** represents AI-era developers using all
modern tools (TypeScript, AWS, Copilot, ChatGPT) heavily
- Silhouette scores are low (~0.04) due to high-dimensional data, but clusters
remain interpretable and actionable

*Elbow Method and Silhouette Score analysis used to determine optimal K=4.*

*4 developer personas visualized in 2D using PCA. Despite low variance explained (11.7%), the clusters show meaningful separation.*

*Salary distributions per cluster reveal the clear hierarchy: Elite/Senior cluster has dramatically higher salaries with tighter distribution.*
## π Usage
### Loading the Models
```python
import pickle
# Load regression model (predicts salary in USD)
with open('regression_model.pkl', 'rb') as f:
reg_model = pickle.load(f)
# Load classification model (predicts Low/Medium/High)
with open('classification_model.pkl', 'rb') as f:
cls_model = pickle.load(f)
# Load clustering model (assigns to 1 of 4 personas)
with open('kmeans_model.pkl', 'rb') as f:
kmeans_model = pickle.load(f)
# Load preprocessing tools
with open('scaler.pkl', 'rb') as f:
scaler = pickle.load(f)
with open('label_encoder.pkl', 'rb') as f:
label_encoder = pickle.load(f)
with open('feature_names.pkl', 'rb') as f:
feature_names = pickle.load(f)
```
### Making Predictions
```python
import numpy as np
# Prepare your features (must match feature_names order)
# X_new must have shape (n_samples, 68)
# Regression prediction (returns log-scale salary)
log_salary_pred = reg_model.predict(X_new)
salary_usd = np.expm1(log_salary_pred) # Convert back to USD
# Classification prediction
class_pred = cls_model.predict(X_new)
class_label = label_encoder.inverse_transform(class_pred) # Low/Medium/High
# Clustering (which persona?)
X_scaled = scaler.transform(X_new)
cluster = kmeans_model.predict(X_scaled)
```
## π Project Structure
```
data_science_project/
βββ StackOverflow_Salary_Prediction.ipynb # Main notebook with full pipeline
βββ README.md # This file
βββ models/
β βββ regression_model.pkl # XGBoost regressor (1.2 MB)
β βββ classification_model.pkl # XGBoost classifier (3.3 MB)
β βββ kmeans_model.pkl # K-Means cluster model (92 KB)
β βββ scaler.pkl # StandardScaler for preprocessing
β βββ label_encoder.pkl # LabelEncoder for class names
β βββ feature_names.pkl # List of 68 feature names
βββ images/
βββ 01_salary_distribution.png # Target variable analysis
βββ 02_salary_by_country.png # Country-level boxplot
βββ 03_top_developer_roles.png # Roles ranked by salary
βββ 04_salary_vs_experience.png # Career growth curve
βββ 05_experience_by_country.png # Country comparison
βββ 06_feature_importance.png # XGBoost top features
βββ 07_confusion_matrix.png # Classification results
βββ 08_elbow_method.png # Optimal K selection
βββ 09_clusters_pca.png # 2D cluster visualization
βββ 10_salary_by_cluster.png # Salary per persona
```
## π€ Author
**Raz Sarusi**
*Data Science Course Project - Assignment #2*
## π
Date
*Project completed: May 2026* |