Stack Overflow Salary Prediction - Developer Survey 2024

🎬 Project Demo

Watch the demo video here

πŸ“Ί A 6-minute walkthrough covering the entire project: data exploration, feature engineering, model training, and key insights.

πŸ“Š Project Overview

This project predicts annual developer compensation (salary) based on factors like experience, location, technologies, education, and AI tool adoption.

The data comes from the Stack Overflow Annual Developer Survey 2024, covering 65,437 developers worldwide.

🎯 Objectives

  1. Regression: Predict exact salary in USD
  2. Classification: Categorize developers into salary tiers (Low/Medium/High)
  3. Clustering: Discover natural developer segments

πŸ“ Dataset

πŸ” Key Findings (EDA)

Target Variable Analysis

  • 23,435 valid salary responses (35.8% of dataset)
  • Highly right-skewed distribution
  • Range: $1 to $16,256,603 (extreme outliers exist)
  • Median salary: $65,000
  • Mean salary: $86,155
  • 97.1% in realistic range ($1K-$500K)
  • Decision: Apply log transformation + filter outliers

Salary Distribution

Salary distribution showing extreme right-skewness and the value of log transformation for modeling.

Data Structure

  • 100 categorical (object) columns
  • 13 float columns
  • 1 integer column (ResponseId)
  • Most predictive features need conversion from text to numeric

Top Paying Countries (by median salary)

Rank Country Median Salary Sample Size
1 USA $141,000 4,596
2 Israel $113,334 217
3 Switzerland $111,417 385
4 Australia $95,796 505
5 Ireland $91,295 120
6 Denmark $88,993 211
7 Canada $87,231 861
8 UK $84,038 1,376

Key insight: Geographic location is the most powerful predictor of salary. The same role can earn 5-10x more in the US/Israel/Switzerland compared to emerging economies.

Geographic Salary Variance

Boxplot analysis revealed:

  • USA: Median $140K with high variance ($100K-$200K interquartile range), many high-end outliers reaching $500K+
  • Western Europe (Germany, UK): Median $70-85K, moderate variance
  • Eastern Europe (Poland, Ukraine): Median $35-55K, but with significant high-end outliers (likely remote workers for foreign companies)
  • Emerging markets (India, Brazil): Median $15-25K, low variance
  • Salary range from highest to lowest country median: ~10x difference

Salary by Country

Salary distributions across the top 10 countries (by sample size). USA dominates both in median salary and variance.

Top Paying Developer Roles

Rank Role Median Salary
1 Senior Executive (C-Suite, VP) $120K
2 Engineering Manager $115K
3 Engineer, Site Reliability (SRE) $98K
4 Cloud Infrastructure Engineer $96K
5 Security Professional $80K
6 Data Engineer $77K
7 Developer, AI $75K
8 Data Scientist / ML Specialist $73K
9 Back-end Developer $68K
10 Full-stack Developer $64K

Key insights:

  • Specialization pays: Infrastructure roles (SRE, Cloud) earn 30-50% more than general development roles
  • Management track: Engineering managers and executives top the list
  • Counter-intuitive finding: AI Developer ranks 7th, not at top despite the AI hype - market still developing
  • Full-stack paradox: Largest group (18,260 respondents) but lowest median in top-15, suggesting market saturation

Top Developer Roles

Top 15 developer roles ranked by median salary. Note how specialized infrastructure roles (SRE, Cloud) outperform general development roles.

Experience vs Salary Relationship

  • Overall correlation: 0.38 (moderate, due to country variance)
  • Career growth pattern observed:
    • Years 0-10: Steep growth ($25K β†’ $78K, 3x increase)
    • Years 10-20: Continued growth ($78K β†’ $95K)
    • Years 20+: Plateau effect (~$100-110K, role-dependent)
  • Within-country correlation is much stronger than overall correlation
  • Median professional experience in dataset: 8 years

Salary vs Experience

The career growth curve: rapid early growth followed by plateau effect after ~20 years.

Country-Experience Interaction (Simpson's Paradox)

Within-country correlations between experience and salary:

  • Germany: 0.438 (highest - structured market)
  • India: 0.394 (experience matters)
  • USA: 0.319 (role/company matter more)
  • UK: 0.271
  • Canada: 0.299

Insight: The same career trajectory yields vastly different outcomes based on geography. A junior developer in USA ($65K) earns more than a senior developer in India ($45K after 25 years). This makes country a critical feature for the model.

Experience by Country

The "geography is destiny" effect: same experience yields drastically different salaries across countries.

Technology Indicators (Linear Correlation with Salary)

Technology Users % Correlation
AWS 9,894 43.5% +0.139
Go 3,388 14.9% +0.087
Rust 2,853 12.5% +0.082
Copilot 8,203 36.0% +0.060
Scala 669 2.9% +0.058
Azure 5,825 25.6% +0.047
Python 11,142 48.9% +0.044
Kubernetes 4,180 18.4% -0.004
Docker 11,591 50.9% -0.002
ChatGPT 14,827 65.1% -0.102

Insights:

  • AWS is the strongest single technology indicator - likely because AWS adoption correlates with established tech companies in higher-paying countries
  • Docker, Kubernetes, Terraform show ~0 linear correlation despite being valuable skills - they have become industry standards (commoditized)
  • ChatGPT usage is negatively correlated - consistent with junior developers relying more on AI tools than senior engineers
  • These features still provide value through non-linear interactions in tree-based models (Random Forest, XGBoost)

Key Predictive Features Identified

  • YearsCodePro - Years of professional coding experience
  • Country - Geographic location (massive impact)
  • EdLevel - Education level (8 ordered categories)
  • DevType - Developer role type (34 categories - needs grouping)
  • OrgSize - Company size (10 ordered categories)
  • RemoteWork - Remote/Hybrid/In-person

πŸ› οΈ Methodology

Data Preprocessing

  • Filtered rows with valid salary data (65,437 β†’ 22,765 after outlier removal)
  • Removed extreme outliers (<$1K and >$500K)
  • Applied log transformation to target (handles right-skewed distribution)
  • Converted text-based numeric columns (YearsCode, YearsCodePro)
  • Median imputation for missing experience values

Feature Engineering

  • Ordinal Encoding: EdLevel (8 levels), OrgSize (10 sizes), Age (8 groups)
  • Country Grouping: 185 countries β†’ 11 regions based on geography and economy
  • DevType Grouping: 34 roles β†’ 7 broader categories
  • Multi-select handling:
    • Created 5 binary indicators for Employment status
    • Count features for technologies (num_languages, num_databases, etc.)
    • Binary flags for high-value technologies (uses_AWS, uses_Python, etc.)
  • One-Hot Encoding: Applied to Region, DevCategory, RemoteWork, Industry
  • Final dataset: 22,765 samples Γ— 68 features

Models Trained

  • Regression: Linear Regression, Random Forest, XGBoost
  • Classification: Logistic Regression, Random Forest, XGBoost
  • Clustering: K-Means with K=4 (chosen via Silhouette analysis)

πŸ“ˆ Results

Regression Model Performance

Model RΒ² (log) RΒ² ($) MAE ($) RMSE ($) Training Time
Linear Regression 0.5319 0.4333 30,917 49,592 <1s
Random Forest 0.5698 0.5121 28,005 46,018 30s
XGBoost (best) 0.5840 0.5326 27,513 45,039 2.6s

Best Model: XGBoost with RΒ² = 0.5326 (explains 53% of salary variance)

Feature Importance Analysis

Top features driving predictions (XGBoost):

Rank Feature Importance
1 Region_North_America 36.58%
2 Region_Western_Europe 8.89%
3 Region_Asia_Developing 6.86%
4 Region_Asia_Pacific_Developed 4.62%
5 YearsCodePro 3.28%

Feature Importance by Category

Category Total Importance
🌍 Region (Geography) 67.0%
πŸ’» Tech indicators 6.9%
⏰ Experience 5.7%
🏭 Industry 5.5%
πŸ’Ό Employment status 4.9%
🏒 Other 3.4%
πŸ’Ό Developer Category 3.1%
πŸ“Š Tech counts 1.9%
πŸ‘€ Demographics 1.6%

Key insight: Geography is the dominant predictor (67%), confirming our EDA finding that location matters more than skills, experience, or role for salary determination. The same developer in different regions can have 5-10x salary differences.

Feature Importance

Top 20 most important features in XGBoost. Region_North_America alone accounts for 36.6% of model decisions.

Classification Model Performance

Salary categorized into 3 classes (33%/33%/33%):

  • Low: < $46,185
  • Medium: $46,185 - $91,719
  • High: > $91,719
Model Accuracy
Logistic Regression 68.72%
Random Forest 69.38%
XGBoost (best) 70.39%

Best Classifier: XGBoost with 70.39% accuracy (vs 33% baseline)

Per-Class Performance (XGBoost)

Category Accuracy Precision Recall F1-Score
Low 75.77% 0.7602 0.7577 0.7589
High 74.14% 0.7595 0.7414 0.7503
Medium 61.57% 0.5999 0.6157 0.6077

Key insights:

  • Model excels at distinguishing extreme categories (Low/High)
  • Misclassifications between Low ↔ High are rare (~4%)
  • Medium category is hardest to classify (boundary cases)
  • Model tends to predict Medium when uncertain (conservative strategy)

Confusion Matrix

XGBoost confusion matrix. The model rarely confuses Low with High (~4% error rate), but Medium is harder to classify.

Clustering Analysis (K-Means, K=4)

K-Means clustering identified 4 distinct developer personas:

Cluster Persona Size Median Salary Years Pro Top Region
0 Mainstream Developer 45.1% $58,375 6 Western Europe
1 Junior / Eastern Europe 10.2% $42,962 6 Western/Eastern Europe
2 Modern Tech Worker 25.1% $66,000 7 North America
3 Elite / Senior 19.6% $105,258 22 North America

Key clustering insights:

  • ChatGPT usage is inversely correlated with seniority: Cluster 2 (modern) uses it 88% of the time, while Cluster 3 (elite/senior) only 44%
  • Cluster 3 (Elite) stands out with 22+ years experience, North American location, and high salary - the "veteran developer" persona
  • Cluster 2 (Modern Tech Worker) represents AI-era developers using all modern tools (TypeScript, AWS, Copilot, ChatGPT) heavily
  • Silhouette scores are low (~0.04) due to high-dimensional data, but clusters remain interpretable and actionable

Elbow Method

Elbow Method and Silhouette Score analysis used to determine optimal K=4.

PCA Clusters

4 developer personas visualized in 2D using PCA. Despite low variance explained (11.7%), the clusters show meaningful separation.

Salary by Cluster

Salary distributions per cluster reveal the clear hierarchy: Elite/Senior cluster has dramatically higher salaries with tighter distribution.

πŸš€ Usage

Loading the Models

import pickle

# Load regression model (predicts salary in USD)
with open('regression_model.pkl', 'rb') as f:
    reg_model = pickle.load(f)

# Load classification model (predicts Low/Medium/High)
with open('classification_model.pkl', 'rb') as f:
    cls_model = pickle.load(f)

# Load clustering model (assigns to 1 of 4 personas)
with open('kmeans_model.pkl', 'rb') as f:
    kmeans_model = pickle.load(f)

# Load preprocessing tools
with open('scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

with open('label_encoder.pkl', 'rb') as f:
    label_encoder = pickle.load(f)

with open('feature_names.pkl', 'rb') as f:
    feature_names = pickle.load(f)

Making Predictions

import numpy as np

# Prepare your features (must match feature_names order)
# X_new must have shape (n_samples, 68)

# Regression prediction (returns log-scale salary)
log_salary_pred = reg_model.predict(X_new)
salary_usd = np.expm1(log_salary_pred)  # Convert back to USD

# Classification prediction
class_pred = cls_model.predict(X_new)
class_label = label_encoder.inverse_transform(class_pred)  # Low/Medium/High

# Clustering (which persona?)
X_scaled = scaler.transform(X_new)
cluster = kmeans_model.predict(X_scaled)

πŸ“š Project Structure

data_science_project/
β”œβ”€β”€ StackOverflow_Salary_Prediction.ipynb  # Main notebook with full pipeline
β”œβ”€β”€ README.md                                # This file
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ regression_model.pkl                 # XGBoost regressor (1.2 MB)
β”‚   β”œβ”€β”€ classification_model.pkl             # XGBoost classifier (3.3 MB)
β”‚   β”œβ”€β”€ kmeans_model.pkl                     # K-Means cluster model (92 KB)
β”‚   β”œβ”€β”€ scaler.pkl                           # StandardScaler for preprocessing
β”‚   β”œβ”€β”€ label_encoder.pkl                    # LabelEncoder for class names
β”‚   └── feature_names.pkl                    # List of 68 feature names
└── images/
    β”œβ”€β”€ 01_salary_distribution.png           # Target variable analysis
    β”œβ”€β”€ 02_salary_by_country.png             # Country-level boxplot
    β”œβ”€β”€ 03_top_developer_roles.png           # Roles ranked by salary
    β”œβ”€β”€ 04_salary_vs_experience.png          # Career growth curve
    β”œβ”€β”€ 05_experience_by_country.png         # Country comparison
    β”œβ”€β”€ 06_feature_importance.png            # XGBoost top features
    β”œβ”€β”€ 07_confusion_matrix.png              # Classification results
    β”œβ”€β”€ 08_elbow_method.png                  # Optimal K selection
    β”œβ”€β”€ 09_clusters_pca.png                  # 2D cluster visualization
    └── 10_salary_by_cluster.png             # Salary per persona

πŸ‘€ Author

Raz Sarusi

Data Science Course Project - Assignment #2

πŸ“… Date

Project completed: May 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support