Stack Overflow Salary Prediction - Developer Survey 2024
π¬ Project Demo
πΊ A 6-minute walkthrough covering the entire project: data exploration, feature engineering, model training, and key insights.
π Project Overview
This project predicts annual developer compensation (salary) based on factors like experience, location, technologies, education, and AI tool adoption.
The data comes from the Stack Overflow Annual Developer Survey 2024, covering 65,437 developers worldwide.
π― Objectives
- Regression: Predict exact salary in USD
- Classification: Categorize developers into salary tiers (Low/Medium/High)
- Clustering: Discover natural developer segments
π Dataset
- Source: Stack Overflow Annual Developer Survey 2024
- Size: 65,437 rows Γ 114 columns (raw)
- After cleaning: 22,765 rows Γ 68 features
- Target:
ConvertedCompYearly(USD) - Available on Kaggle: https://www.kaggle.com/datasets/berkayalan/stack-overflow-annual-developer-survey-2024
π Key Findings (EDA)
Target Variable Analysis
- 23,435 valid salary responses (35.8% of dataset)
- Highly right-skewed distribution
- Range: $1 to $16,256,603 (extreme outliers exist)
- Median salary: $65,000
- Mean salary: $86,155
- 97.1% in realistic range ($1K-$500K)
- Decision: Apply log transformation + filter outliers
Salary distribution showing extreme right-skewness and the value of log transformation for modeling.
Data Structure
- 100 categorical (object) columns
- 13 float columns
- 1 integer column (ResponseId)
- Most predictive features need conversion from text to numeric
Top Paying Countries (by median salary)
| Rank | Country | Median Salary | Sample Size |
|---|---|---|---|
| 1 | USA | $141,000 | 4,596 |
| 2 | Israel | $113,334 | 217 |
| 3 | Switzerland | $111,417 | 385 |
| 4 | Australia | $95,796 | 505 |
| 5 | Ireland | $91,295 | 120 |
| 6 | Denmark | $88,993 | 211 |
| 7 | Canada | $87,231 | 861 |
| 8 | UK | $84,038 | 1,376 |
Key insight: Geographic location is the most powerful predictor of salary. The same role can earn 5-10x more in the US/Israel/Switzerland compared to emerging economies.
Geographic Salary Variance
Boxplot analysis revealed:
- USA: Median $140K with high variance ($100K-$200K interquartile range), many high-end outliers reaching $500K+
- Western Europe (Germany, UK): Median $70-85K, moderate variance
- Eastern Europe (Poland, Ukraine): Median $35-55K, but with significant high-end outliers (likely remote workers for foreign companies)
- Emerging markets (India, Brazil): Median $15-25K, low variance
- Salary range from highest to lowest country median: ~10x difference
Salary distributions across the top 10 countries (by sample size). USA dominates both in median salary and variance.
Top Paying Developer Roles
| Rank | Role | Median Salary |
|---|---|---|
| 1 | Senior Executive (C-Suite, VP) | $120K |
| 2 | Engineering Manager | $115K |
| 3 | Engineer, Site Reliability (SRE) | $98K |
| 4 | Cloud Infrastructure Engineer | $96K |
| 5 | Security Professional | $80K |
| 6 | Data Engineer | $77K |
| 7 | Developer, AI | $75K |
| 8 | Data Scientist / ML Specialist | $73K |
| 9 | Back-end Developer | $68K |
| 10 | Full-stack Developer | $64K |
Key insights:
- Specialization pays: Infrastructure roles (SRE, Cloud) earn 30-50% more than general development roles
- Management track: Engineering managers and executives top the list
- Counter-intuitive finding: AI Developer ranks 7th, not at top despite the AI hype - market still developing
- Full-stack paradox: Largest group (18,260 respondents) but lowest median in top-15, suggesting market saturation
Top 15 developer roles ranked by median salary. Note how specialized infrastructure roles (SRE, Cloud) outperform general development roles.
Experience vs Salary Relationship
- Overall correlation: 0.38 (moderate, due to country variance)
- Career growth pattern observed:
- Years 0-10: Steep growth ($25K β $78K, 3x increase)
- Years 10-20: Continued growth ($78K β $95K)
- Years 20+: Plateau effect (~$100-110K, role-dependent)
- Within-country correlation is much stronger than overall correlation
- Median professional experience in dataset: 8 years
The career growth curve: rapid early growth followed by plateau effect after ~20 years.
Country-Experience Interaction (Simpson's Paradox)
Within-country correlations between experience and salary:
- Germany: 0.438 (highest - structured market)
- India: 0.394 (experience matters)
- USA: 0.319 (role/company matter more)
- UK: 0.271
- Canada: 0.299
Insight: The same career trajectory yields vastly different outcomes based on geography. A junior developer in USA ($65K) earns more than a senior developer in India ($45K after 25 years). This makes country a critical feature for the model.
The "geography is destiny" effect: same experience yields drastically different salaries across countries.
Technology Indicators (Linear Correlation with Salary)
| Technology | Users | % | Correlation |
|---|---|---|---|
| AWS | 9,894 | 43.5% | +0.139 |
| Go | 3,388 | 14.9% | +0.087 |
| Rust | 2,853 | 12.5% | +0.082 |
| Copilot | 8,203 | 36.0% | +0.060 |
| Scala | 669 | 2.9% | +0.058 |
| Azure | 5,825 | 25.6% | +0.047 |
| Python | 11,142 | 48.9% | +0.044 |
| Kubernetes | 4,180 | 18.4% | -0.004 |
| Docker | 11,591 | 50.9% | -0.002 |
| ChatGPT | 14,827 | 65.1% | -0.102 |
Insights:
- AWS is the strongest single technology indicator - likely because AWS adoption correlates with established tech companies in higher-paying countries
- Docker, Kubernetes, Terraform show ~0 linear correlation despite being valuable skills - they have become industry standards (commoditized)
- ChatGPT usage is negatively correlated - consistent with junior developers relying more on AI tools than senior engineers
- These features still provide value through non-linear interactions in tree-based models (Random Forest, XGBoost)
Key Predictive Features Identified
- YearsCodePro - Years of professional coding experience
- Country - Geographic location (massive impact)
- EdLevel - Education level (8 ordered categories)
- DevType - Developer role type (34 categories - needs grouping)
- OrgSize - Company size (10 ordered categories)
- RemoteWork - Remote/Hybrid/In-person
π οΈ Methodology
Data Preprocessing
- Filtered rows with valid salary data (65,437 β 22,765 after outlier removal)
- Removed extreme outliers (<$1K and >$500K)
- Applied log transformation to target (handles right-skewed distribution)
- Converted text-based numeric columns (YearsCode, YearsCodePro)
- Median imputation for missing experience values
Feature Engineering
- Ordinal Encoding: EdLevel (8 levels), OrgSize (10 sizes), Age (8 groups)
- Country Grouping: 185 countries β 11 regions based on geography and economy
- DevType Grouping: 34 roles β 7 broader categories
- Multi-select handling:
- Created 5 binary indicators for Employment status
- Count features for technologies (num_languages, num_databases, etc.)
- Binary flags for high-value technologies (uses_AWS, uses_Python, etc.)
- One-Hot Encoding: Applied to Region, DevCategory, RemoteWork, Industry
- Final dataset: 22,765 samples Γ 68 features
Models Trained
- Regression: Linear Regression, Random Forest, XGBoost
- Classification: Logistic Regression, Random Forest, XGBoost
- Clustering: K-Means with K=4 (chosen via Silhouette analysis)
π Results
Regression Model Performance
| Model | RΒ² (log) | RΒ² ($) | MAE ($) | RMSE ($) | Training Time |
|---|---|---|---|---|---|
| Linear Regression | 0.5319 | 0.4333 | 30,917 | 49,592 | <1s |
| Random Forest | 0.5698 | 0.5121 | 28,005 | 46,018 | 30s |
| XGBoost (best) | 0.5840 | 0.5326 | 27,513 | 45,039 | 2.6s |
Best Model: XGBoost with RΒ² = 0.5326 (explains 53% of salary variance)
Feature Importance Analysis
Top features driving predictions (XGBoost):
| Rank | Feature | Importance |
|---|---|---|
| 1 | Region_North_America | 36.58% |
| 2 | Region_Western_Europe | 8.89% |
| 3 | Region_Asia_Developing | 6.86% |
| 4 | Region_Asia_Pacific_Developed | 4.62% |
| 5 | YearsCodePro | 3.28% |
Feature Importance by Category
| Category | Total Importance |
|---|---|
| π Region (Geography) | 67.0% |
| π» Tech indicators | 6.9% |
| β° Experience | 5.7% |
| π Industry | 5.5% |
| πΌ Employment status | 4.9% |
| π’ Other | 3.4% |
| πΌ Developer Category | 3.1% |
| π Tech counts | 1.9% |
| π€ Demographics | 1.6% |
Key insight: Geography is the dominant predictor (67%), confirming our EDA finding that location matters more than skills, experience, or role for salary determination. The same developer in different regions can have 5-10x salary differences.
Top 20 most important features in XGBoost. Region_North_America alone accounts for 36.6% of model decisions.
Classification Model Performance
Salary categorized into 3 classes (33%/33%/33%):
- Low: < $46,185
- Medium: $46,185 - $91,719
- High: > $91,719
| Model | Accuracy |
|---|---|
| Logistic Regression | 68.72% |
| Random Forest | 69.38% |
| XGBoost (best) | 70.39% |
Best Classifier: XGBoost with 70.39% accuracy (vs 33% baseline)
Per-Class Performance (XGBoost)
| Category | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Low | 75.77% | 0.7602 | 0.7577 | 0.7589 |
| High | 74.14% | 0.7595 | 0.7414 | 0.7503 |
| Medium | 61.57% | 0.5999 | 0.6157 | 0.6077 |
Key insights:
- Model excels at distinguishing extreme categories (Low/High)
- Misclassifications between Low β High are rare (~4%)
- Medium category is hardest to classify (boundary cases)
- Model tends to predict Medium when uncertain (conservative strategy)
XGBoost confusion matrix. The model rarely confuses Low with High (~4% error rate), but Medium is harder to classify.
Clustering Analysis (K-Means, K=4)
K-Means clustering identified 4 distinct developer personas:
| Cluster | Persona | Size | Median Salary | Years Pro | Top Region |
|---|---|---|---|---|---|
| 0 | Mainstream Developer | 45.1% | $58,375 | 6 | Western Europe |
| 1 | Junior / Eastern Europe | 10.2% | $42,962 | 6 | Western/Eastern Europe |
| 2 | Modern Tech Worker | 25.1% | $66,000 | 7 | North America |
| 3 | Elite / Senior | 19.6% | $105,258 | 22 | North America |
Key clustering insights:
- ChatGPT usage is inversely correlated with seniority: Cluster 2 (modern) uses it 88% of the time, while Cluster 3 (elite/senior) only 44%
- Cluster 3 (Elite) stands out with 22+ years experience, North American location, and high salary - the "veteran developer" persona
- Cluster 2 (Modern Tech Worker) represents AI-era developers using all modern tools (TypeScript, AWS, Copilot, ChatGPT) heavily
- Silhouette scores are low (~0.04) due to high-dimensional data, but clusters remain interpretable and actionable
Elbow Method and Silhouette Score analysis used to determine optimal K=4.
4 developer personas visualized in 2D using PCA. Despite low variance explained (11.7%), the clusters show meaningful separation.
Salary distributions per cluster reveal the clear hierarchy: Elite/Senior cluster has dramatically higher salaries with tighter distribution.
π Usage
Loading the Models
import pickle
# Load regression model (predicts salary in USD)
with open('regression_model.pkl', 'rb') as f:
reg_model = pickle.load(f)
# Load classification model (predicts Low/Medium/High)
with open('classification_model.pkl', 'rb') as f:
cls_model = pickle.load(f)
# Load clustering model (assigns to 1 of 4 personas)
with open('kmeans_model.pkl', 'rb') as f:
kmeans_model = pickle.load(f)
# Load preprocessing tools
with open('scaler.pkl', 'rb') as f:
scaler = pickle.load(f)
with open('label_encoder.pkl', 'rb') as f:
label_encoder = pickle.load(f)
with open('feature_names.pkl', 'rb') as f:
feature_names = pickle.load(f)
Making Predictions
import numpy as np
# Prepare your features (must match feature_names order)
# X_new must have shape (n_samples, 68)
# Regression prediction (returns log-scale salary)
log_salary_pred = reg_model.predict(X_new)
salary_usd = np.expm1(log_salary_pred) # Convert back to USD
# Classification prediction
class_pred = cls_model.predict(X_new)
class_label = label_encoder.inverse_transform(class_pred) # Low/Medium/High
# Clustering (which persona?)
X_scaled = scaler.transform(X_new)
cluster = kmeans_model.predict(X_scaled)
π Project Structure
data_science_project/
βββ StackOverflow_Salary_Prediction.ipynb # Main notebook with full pipeline
βββ README.md # This file
βββ models/
β βββ regression_model.pkl # XGBoost regressor (1.2 MB)
β βββ classification_model.pkl # XGBoost classifier (3.3 MB)
β βββ kmeans_model.pkl # K-Means cluster model (92 KB)
β βββ scaler.pkl # StandardScaler for preprocessing
β βββ label_encoder.pkl # LabelEncoder for class names
β βββ feature_names.pkl # List of 68 feature names
βββ images/
βββ 01_salary_distribution.png # Target variable analysis
βββ 02_salary_by_country.png # Country-level boxplot
βββ 03_top_developer_roles.png # Roles ranked by salary
βββ 04_salary_vs_experience.png # Career growth curve
βββ 05_experience_by_country.png # Country comparison
βββ 06_feature_importance.png # XGBoost top features
βββ 07_confusion_matrix.png # Classification results
βββ 08_elbow_method.png # Optimal K selection
βββ 09_clusters_pca.png # 2D cluster visualization
βββ 10_salary_by_cluster.png # Salary per persona
π€ Author
Raz Sarusi
Data Science Course Project - Assignment #2
π Date
Project completed: May 2026









