Stack Overflow Salary Prediction - Developer Survey 2024

🎬 Project Demo

📺 A 6-minute walkthrough covering the entire project: data exploration, feature engineering, model training, and key insights.

📊 Project Overview

This project predicts annual developer compensation (salary) based on factors like experience, location, technologies, education, and AI tool adoption.

The data comes from the Stack Overflow Annual Developer Survey 2024, covering 65,437 developers worldwide.

🎯 Objectives

Regression: Predict exact salary in USD
Classification: Categorize developers into salary tiers (Low/Medium/High)
Clustering: Discover natural developer segments

📁 Dataset

Source: Stack Overflow Annual Developer Survey 2024
Size: 65,437 rows × 114 columns (raw)
After cleaning: 22,765 rows × 68 features
Target: ConvertedCompYearly (USD)
Available on Kaggle: https://www.kaggle.com/datasets/berkayalan/stack-overflow-annual-developer-survey-2024

🔍 Key Findings (EDA)

Target Variable Analysis

23,435 valid salary responses (35.8% of dataset)
Highly right-skewed distribution
Range: $1 to $16,256,603 (extreme outliers exist)
Median salary: $65,000
Mean salary: $86,155
97.1% in realistic range ($1K-$500K)
Decision: Apply log transformation + filter outliers

Salary distribution showing extreme right-skewness and the value of log transformation for modeling.

Data Structure

100 categorical (object) columns
13 float columns
1 integer column (ResponseId)
Most predictive features need conversion from text to numeric

Top Paying Countries (by median salary)

Rank	Country	Median Salary	Sample Size
1	USA	$141,000	4,596
2	Israel	$113,334	217
3	Switzerland	$111,417	385
4	Australia	$95,796	505
5	Ireland	$91,295	120
6	Denmark	$88,993	211
7	Canada	$87,231	861
8	UK	$84,038	1,376

Key insight: Geographic location is the most powerful predictor of salary. The same role can earn 5-10x more in the US/Israel/Switzerland compared to emerging economies.

Geographic Salary Variance

Boxplot analysis revealed:

USA: Median $140K with high variance ($100K-$200K interquartile range), many high-end outliers reaching $500K+
Western Europe (Germany, UK): Median $70-85K, moderate variance
Eastern Europe (Poland, Ukraine): Median $35-55K, but with significant high-end outliers (likely remote workers for foreign companies)
Emerging markets (India, Brazil): Median $15-25K, low variance
Salary range from highest to lowest country median: ~10x difference

Salary distributions across the top 10 countries (by sample size). USA dominates both in median salary and variance.

Top Paying Developer Roles

Rank	Role	Median Salary
1	Senior Executive (C-Suite, VP)	$120K
2	Engineering Manager	$115K
3	Engineer, Site Reliability (SRE)	$98K
4	Cloud Infrastructure Engineer	$96K
5	Security Professional	$80K
6	Data Engineer	$77K
7	Developer, AI	$75K
8	Data Scientist / ML Specialist	$73K
9	Back-end Developer	$68K
10	Full-stack Developer	$64K

Key insights:

Specialization pays: Infrastructure roles (SRE, Cloud) earn 30-50% more than general development roles
Management track: Engineering managers and executives top the list
Counter-intuitive finding: AI Developer ranks 7th, not at top despite the AI hype - market still developing
Full-stack paradox: Largest group (18,260 respondents) but lowest median in top-15, suggesting market saturation

Top 15 developer roles ranked by median salary. Note how specialized infrastructure roles (SRE, Cloud) outperform general development roles.

Experience vs Salary Relationship

Overall correlation: 0.38 (moderate, due to country variance)
Career growth pattern observed:
- Years 0-10: Steep growth ($25K → $78K, 3x increase)
- Years 10-20: Continued growth ($78K → $95K)
- Years 20+: Plateau effect (~$100-110K, role-dependent)
Within-country correlation is much stronger than overall correlation
Median professional experience in dataset: 8 years

The career growth curve: rapid early growth followed by plateau effect after ~20 years.

Country-Experience Interaction (Simpson's Paradox)

Within-country correlations between experience and salary:

Germany: 0.438 (highest - structured market)
India: 0.394 (experience matters)
USA: 0.319 (role/company matter more)
UK: 0.271
Canada: 0.299

Insight: The same career trajectory yields vastly different outcomes based on geography. A junior developer in USA ($65K) earns more than a senior developer in India ($45K after 25 years). This makes country a critical feature for the model.

The "geography is destiny" effect: same experience yields drastically different salaries across countries.

Technology Indicators (Linear Correlation with Salary)

Technology	Users	%	Correlation
AWS	9,894	43.5%	+0.139
Go	3,388	14.9%	+0.087
Rust	2,853	12.5%	+0.082
Copilot	8,203	36.0%	+0.060
Scala	669	2.9%	+0.058
Azure	5,825	25.6%	+0.047
Python	11,142	48.9%	+0.044
Kubernetes	4,180	18.4%	-0.004
Docker	11,591	50.9%	-0.002
ChatGPT	14,827	65.1%	-0.102

Insights:

AWS is the strongest single technology indicator - likely because AWS adoption correlates with established tech companies in higher-paying countries
Docker, Kubernetes, Terraform show ~0 linear correlation despite being valuable skills - they have become industry standards (commoditized)
ChatGPT usage is negatively correlated - consistent with junior developers relying more on AI tools than senior engineers
These features still provide value through non-linear interactions in tree-based models (Random Forest, XGBoost)

Key Predictive Features Identified

YearsCodePro - Years of professional coding experience
Country - Geographic location (massive impact)
EdLevel - Education level (8 ordered categories)
DevType - Developer role type (34 categories - needs grouping)
OrgSize - Company size (10 ordered categories)
RemoteWork - Remote/Hybrid/In-person

🛠️ Methodology

Data Preprocessing

Filtered rows with valid salary data (65,437 → 22,765 after outlier removal)
Removed extreme outliers (<$1K and >$500K)
Applied log transformation to target (handles right-skewed distribution)
Converted text-based numeric columns (YearsCode, YearsCodePro)
Median imputation for missing experience values

Feature Engineering

Ordinal Encoding: EdLevel (8 levels), OrgSize (10 sizes), Age (8 groups)
Country Grouping: 185 countries → 11 regions based on geography and economy
DevType Grouping: 34 roles → 7 broader categories
Multi-select handling:
- Created 5 binary indicators for Employment status
- Count features for technologies (num_languages, num_databases, etc.)
- Binary flags for high-value technologies (uses_AWS, uses_Python, etc.)
One-Hot Encoding: Applied to Region, DevCategory, RemoteWork, Industry
Final dataset: 22,765 samples × 68 features

Models Trained

Regression: Linear Regression, Random Forest, XGBoost
Classification: Logistic Regression, Random Forest, XGBoost
Clustering: K-Means with K=4 (chosen via Silhouette analysis)

📈 Results

Regression Model Performance

Model	R² (log)	R² ($)	MAE ($)	RMSE ($)	Training Time
Linear Regression	0.5319	0.4333	30,917	49,592	<1s
Random Forest	0.5698	0.5121	28,005	46,018	30s
XGBoost (best)	0.5840	0.5326	27,513	45,039	2.6s

Best Model: XGBoost with R² = 0.5326 (explains 53% of salary variance)

Feature Importance Analysis

Top features driving predictions (XGBoost):

Rank	Feature	Importance
1	Region_North_America	36.58%
2	Region_Western_Europe	8.89%
3	Region_Asia_Developing	6.86%
4	Region_Asia_Pacific_Developed	4.62%
5	YearsCodePro	3.28%

Feature Importance by Category

Category	Total Importance
🌍 Region (Geography)	67.0%
💻 Tech indicators	6.9%
⏰ Experience	5.7%
🏭 Industry	5.5%
💼 Employment status	4.9%
🏢 Other	3.4%
💼 Developer Category	3.1%
📊 Tech counts	1.9%
👤 Demographics	1.6%

Key insight: Geography is the dominant predictor (67%), confirming our EDA finding that location matters more than skills, experience, or role for salary determination. The same developer in different regions can have 5-10x salary differences.

Top 20 most important features in XGBoost. Region_North_America alone accounts for 36.6% of model decisions.

Classification Model Performance

Salary categorized into 3 classes (33%/33%/33%):

Low: < $46,185
Medium: $46,185 - $91,719
High: > $91,719

Model	Accuracy
Logistic Regression	68.72%
Random Forest	69.38%
XGBoost (best)	70.39%

Best Classifier: XGBoost with 70.39% accuracy (vs 33% baseline)

Per-Class Performance (XGBoost)

Category	Accuracy	Precision	Recall	F1-Score
Low	75.77%	0.7602	0.7577	0.7589
High	74.14%	0.7595	0.7414	0.7503
Medium	61.57%	0.5999	0.6157	0.6077

Key insights:

Model excels at distinguishing extreme categories (Low/High)
Misclassifications between Low ↔ High are rare (~4%)
Medium category is hardest to classify (boundary cases)
Model tends to predict Medium when uncertain (conservative strategy)

XGBoost confusion matrix. The model rarely confuses Low with High (~4% error rate), but Medium is harder to classify.

Clustering Analysis (K-Means, K=4)

K-Means clustering identified 4 distinct developer personas:

Cluster	Persona	Size	Median Salary	Years Pro	Top Region
0	Mainstream Developer	45.1%	$58,375	6	Western Europe
1	Junior / Eastern Europe	10.2%	$42,962	6	Western/Eastern Europe
2	Modern Tech Worker	25.1%	$66,000	7	North America
3	Elite / Senior	19.6%	$105,258	22	North America

Key clustering insights:

ChatGPT usage is inversely correlated with seniority: Cluster 2 (modern) uses it 88% of the time, while Cluster 3 (elite/senior) only 44%
Cluster 3 (Elite) stands out with 22+ years experience, North American location, and high salary - the "veteran developer" persona
Cluster 2 (Modern Tech Worker) represents AI-era developers using all modern tools (TypeScript, AWS, Copilot, ChatGPT) heavily
Silhouette scores are low (~0.04) due to high-dimensional data, but clusters remain interpretable and actionable

Elbow Method and Silhouette Score analysis used to determine optimal K=4.

4 developer personas visualized in 2D using PCA. Despite low variance explained (11.7%), the clusters show meaningful separation.

Salary distributions per cluster reveal the clear hierarchy: Elite/Senior cluster has dramatically higher salaries with tighter distribution.

🚀 Usage

Loading the Models

import pickle

# Load regression model (predicts salary in USD)
with open('regression_model.pkl', 'rb') as f:
    reg_model = pickle.load(f)

# Load classification model (predicts Low/Medium/High)
with open('classification_model.pkl', 'rb') as f:
    cls_model = pickle.load(f)

# Load clustering model (assigns to 1 of 4 personas)
with open('kmeans_model.pkl', 'rb') as f:
    kmeans_model = pickle.load(f)

# Load preprocessing tools
with open('scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

with open('label_encoder.pkl', 'rb') as f:
    label_encoder = pickle.load(f)

with open('feature_names.pkl', 'rb') as f:
    feature_names = pickle.load(f)

Making Predictions

import numpy as np

# Prepare your features (must match feature_names order)
# X_new must have shape (n_samples, 68)

# Regression prediction (returns log-scale salary)
log_salary_pred = reg_model.predict(X_new)
salary_usd = np.expm1(log_salary_pred)  # Convert back to USD

# Classification prediction
class_pred = cls_model.predict(X_new)
class_label = label_encoder.inverse_transform(class_pred)  # Low/Medium/High

# Clustering (which persona?)
X_scaled = scaler.transform(X_new)
cluster = kmeans_model.predict(X_scaled)

📚 Project Structure

data_science_project/
├── StackOverflow_Salary_Prediction.ipynb  # Main notebook with full pipeline
├── README.md                                # This file
├── models/
│   ├── regression_model.pkl                 # XGBoost regressor (1.2 MB)
│   ├── classification_model.pkl             # XGBoost classifier (3.3 MB)
│   ├── kmeans_model.pkl                     # K-Means cluster model (92 KB)
│   ├── scaler.pkl                           # StandardScaler for preprocessing
│   ├── label_encoder.pkl                    # LabelEncoder for class names
│   └── feature_names.pkl                    # List of 68 feature names
└── images/
    ├── 01_salary_distribution.png           # Target variable analysis
    ├── 02_salary_by_country.png             # Country-level boxplot
    ├── 03_top_developer_roles.png           # Roles ranked by salary
    ├── 04_salary_vs_experience.png          # Career growth curve
    ├── 05_experience_by_country.png         # Country comparison
    ├── 06_feature_importance.png            # XGBoost top features
    ├── 07_confusion_matrix.png              # Classification results
    ├── 08_elbow_method.png                  # Optimal K selection
    ├── 09_clusters_pca.png                  # 2D cluster visualization
    └── 10_salary_by_cluster.png             # Salary per persona

👤 Author

Raz Sarusi

Data Science Course Project - Assignment #2

📅 Date

Project completed: May 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support