| --- |
| license: mit |
| tags: |
| - regression |
| - classification |
| - clustering |
| - xgboost |
| - random-forest |
| - salary-prediction |
| - developer-survey |
| - stack-overflow |
| language: |
| - en |
| --- |
| |
| # Stack Overflow Salary Prediction - Developer Survey 2024 |
|
|
| ## π¬ Project Demo |
|
|
| [Watch the demo video here](/razsarusi/stackoverflow-salary-prediction/blob/main/./%D7%A1%D7%A8%D7%98%D7%95%D7%9F%20%D7%9E%D7%A9%D7%99%D7%9E%D7%94%202%20%D7%93%D7%90%D7%98%D7%94%20%D7%A1%D7%99%D7%99%D7%A0%D7%A1.mp4) |
|
|
| > πΊ A 6-minute walkthrough covering the entire project: data exploration, |
| > feature engineering, model training, and key insights. |
|
|
| ## π Project Overview |
|
|
| This project predicts annual developer compensation (salary) based on factors |
| like experience, location, technologies, education, and AI tool adoption. |
|
|
| The data comes from the Stack Overflow Annual Developer Survey 2024, |
| covering 65,437 developers worldwide. |
|
|
| ## π― Objectives |
|
|
| 1. **Regression**: Predict exact salary in USD |
| 2. **Classification**: Categorize developers into salary tiers (Low/Medium/High) |
| 3. **Clustering**: Discover natural developer segments |
|
|
|
|
| ## π Dataset |
|
|
| - **Source**: Stack Overflow Annual Developer Survey 2024 |
| - **Size**: 65,437 rows Γ 114 columns (raw) |
| - **After cleaning**: 22,765 rows Γ 68 features |
| - **Target**: `ConvertedCompYearly` (USD) |
| - **Available on Kaggle**: <https://www.kaggle.com/datasets/berkayalan/stack-overflow-annual-developer-survey-2024> |
|
|
|
|
| ## π Key Findings (EDA) |
|
|
| ### Target Variable Analysis |
| - 23,435 valid salary responses (35.8% of dataset) |
| - Highly right-skewed distribution |
| - Range: $1 to $16,256,603 (extreme outliers exist) |
| - Median salary: $65,000 |
| - Mean salary: $86,155 |
| - 97.1% in realistic range ($1K-$500K) |
| - **Decision:** Apply log transformation + filter outliers |
|
|
|  |
|
|
| *Salary distribution showing extreme right-skewness and the value of log transformation for modeling.* |
|
|
| ### Data Structure |
| - 100 categorical (object) columns |
| - 13 float columns |
| - 1 integer column (ResponseId) |
| - Most predictive features need conversion from text to numeric |
|
|
| ### Top Paying Countries (by median salary) |
|
|
| | Rank | Country | Median Salary | Sample Size | |
| |------|---------|--------------|-------------| |
| | 1 | USA | $141,000 | 4,596 | |
| | 2 | Israel | $113,334 | 217 | |
| | 3 | Switzerland | $111,417 | 385 | |
| | 4 | Australia | $95,796 | 505 | |
| | 5 | Ireland | $91,295 | 120 | |
| | 6 | Denmark | $88,993 | 211 | |
| | 7 | Canada | $87,231 | 861 | |
| | 8 | UK | $84,038 | 1,376 | |
|
|
| **Key insight**: Geographic location is the most powerful predictor of salary. |
| The same role can earn 5-10x more in the US/Israel/Switzerland compared to |
| emerging economies. |
|
|
| ### Geographic Salary Variance |
| Boxplot analysis revealed: |
| - **USA**: Median $140K with high variance ($100K-$200K interquartile range), |
| many high-end outliers reaching $500K+ |
| - **Western Europe** (Germany, UK): Median $70-85K, moderate variance |
| - **Eastern Europe** (Poland, Ukraine): Median $35-55K, but with significant |
| high-end outliers (likely remote workers for foreign companies) |
| - **Emerging markets** (India, Brazil): Median $15-25K, low variance |
| - **Salary range from highest to lowest country median: ~10x difference** |
|
|
|  |
|
|
| *Salary distributions across the top 10 countries (by sample size). USA dominates both in median salary and variance.* |
|
|
| ### Top Paying Developer Roles |
|
|
| | Rank | Role | Median Salary | |
| |------|------|---------------| |
| | 1 | Senior Executive (C-Suite, VP) | $120K | |
| | 2 | Engineering Manager | $115K | |
| | 3 | Engineer, Site Reliability (SRE) | $98K | |
| | 4 | Cloud Infrastructure Engineer | $96K | |
| | 5 | Security Professional | $80K | |
| | 6 | Data Engineer | $77K | |
| | 7 | Developer, AI | $75K | |
| | 8 | Data Scientist / ML Specialist | $73K | |
| | 9 | Back-end Developer | $68K | |
| | 10 | Full-stack Developer | $64K | |
|
|
| **Key insights**: |
| - **Specialization pays**: Infrastructure roles (SRE, Cloud) earn 30-50% more |
| than general development roles |
| - **Management track**: Engineering managers and executives top the list |
| - **Counter-intuitive finding**: AI Developer ranks 7th, not at top despite |
| the AI hype - market still developing |
| - **Full-stack paradox**: Largest group (18,260 respondents) but lowest median |
| in top-15, suggesting market saturation |
|
|
|  |
|
|
| *Top 15 developer roles ranked by median salary. Note how specialized infrastructure roles (SRE, Cloud) outperform general development roles.* |
|
|
| ### Experience vs Salary Relationship |
| - Overall correlation: **0.38** (moderate, due to country variance) |
| - Career growth pattern observed: |
| - Years 0-10: Steep growth ($25K β $78K, 3x increase) |
| - Years 10-20: Continued growth ($78K β $95K) |
| - Years 20+: Plateau effect (~$100-110K, role-dependent) |
| - **Within-country correlation is much stronger** than overall correlation |
| - Median professional experience in dataset: 8 years |
|
|
|  |
|
|
| *The career growth curve: rapid early growth followed by plateau effect after ~20 years.* |
|
|
| ### Country-Experience Interaction (Simpson's Paradox) |
| Within-country correlations between experience and salary: |
| - Germany: 0.438 (highest - structured market) |
| - India: 0.394 (experience matters) |
| - USA: 0.319 (role/company matter more) |
| - UK: 0.271 |
| - Canada: 0.299 |
|
|
| **Insight**: The same career trajectory yields vastly different outcomes |
| based on geography. A junior developer in USA ($65K) earns more than a |
| senior developer in India ($45K after 25 years). This makes country a |
| critical feature for the model. |
|
|
|  |
|
|
| *The "geography is destiny" effect: same experience yields drastically different salaries across countries.* |
|
|
| ### Technology Indicators (Linear Correlation with Salary) |
|
|
| | Technology | Users | % | Correlation | |
| |------------|-------|---|-------------| |
| | AWS | 9,894 | 43.5% | **+0.139** | |
| | Go | 3,388 | 14.9% | +0.087 | |
| | Rust | 2,853 | 12.5% | +0.082 | |
| | Copilot | 8,203 | 36.0% | +0.060 | |
| | Scala | 669 | 2.9% | +0.058 | |
| | Azure | 5,825 | 25.6% | +0.047 | |
| | Python | 11,142| 48.9% | +0.044 | |
| | Kubernetes | 4,180 | 18.4% | -0.004 | |
| | Docker | 11,591| 50.9% | -0.002 | |
| | **ChatGPT**|14,827 | 65.1% | **-0.102** | |
|
|
| **Insights**: |
| - **AWS is the strongest single technology indicator** - likely because |
| AWS adoption correlates with established tech companies in higher-paying countries |
| - **Docker, Kubernetes, Terraform show ~0 linear correlation** despite |
| being valuable skills - they have become industry standards (commoditized) |
| - **ChatGPT usage is negatively correlated** - consistent with junior |
| developers relying more on AI tools than senior engineers |
| - These features still provide value through **non-linear interactions** |
| in tree-based models (Random Forest, XGBoost) |
|
|
| ### Key Predictive Features Identified |
| - **YearsCodePro** - Years of professional coding experience |
| - **Country** - Geographic location (massive impact) |
| - **EdLevel** - Education level (8 ordered categories) |
| - **DevType** - Developer role type (34 categories - needs grouping) |
| - **OrgSize** - Company size (10 ordered categories) |
| - **RemoteWork** - Remote/Hybrid/In-person |
|
|
|
|
| ## π οΈ Methodology |
|
|
| ### Data Preprocessing |
| - Filtered rows with valid salary data (65,437 β 22,765 after outlier removal) |
| - Removed extreme outliers (<$1K and >$500K) |
| - Applied log transformation to target (handles right-skewed distribution) |
| - Converted text-based numeric columns (YearsCode, YearsCodePro) |
| - Median imputation for missing experience values |
|
|
| ### Feature Engineering |
| - **Ordinal Encoding**: EdLevel (8 levels), OrgSize (10 sizes), Age (8 groups) |
| - **Country Grouping**: 185 countries β 11 regions based on geography and economy |
| - **DevType Grouping**: 34 roles β 7 broader categories |
| - **Multi-select handling**: |
| - Created 5 binary indicators for Employment status |
| - Count features for technologies (num_languages, num_databases, etc.) |
| - Binary flags for high-value technologies (uses_AWS, uses_Python, etc.) |
| - **One-Hot Encoding**: Applied to Region, DevCategory, RemoteWork, Industry |
| - **Final dataset**: 22,765 samples Γ 68 features |
|
|
| ### Models Trained |
| - **Regression**: Linear Regression, Random Forest, XGBoost |
| - **Classification**: Logistic Regression, Random Forest, XGBoost |
| - **Clustering**: K-Means with K=4 (chosen via Silhouette analysis) |
|
|
|
|
| ## π Results |
|
|
| ### Regression Model Performance |
|
|
| | Model | RΒ² (log) | RΒ² ($) | MAE ($) | RMSE ($) | Training Time | |
| |-------|----------|--------|---------|----------|---------------| |
| | Linear Regression | 0.5319 | 0.4333 | 30,917 | 49,592 | <1s | |
| | Random Forest | 0.5698 | 0.5121 | 28,005 | 46,018 | 30s | |
| | **XGBoost (best)** | **0.5840** | **0.5326** | **27,513** | **45,039** | **2.6s** | |
|
|
| **Best Model: XGBoost** with RΒ² = 0.5326 (explains 53% of salary variance) |
|
|
| ### Feature Importance Analysis |
|
|
| Top features driving predictions (XGBoost): |
| | Rank | Feature | Importance | |
| |------|---------|------------| |
| | 1 | Region_North_America | 36.58% | |
| | 2 | Region_Western_Europe | 8.89% | |
| | 3 | Region_Asia_Developing | 6.86% | |
| | 4 | Region_Asia_Pacific_Developed | 4.62% | |
| | 5 | YearsCodePro | 3.28% | |
| |
| ### Feature Importance by Category |
| |
| | Category | Total Importance | |
| |----------|------------------| |
| | π **Region (Geography)** | **67.0%** | |
| | π» Tech indicators | 6.9% | |
| | β° Experience | 5.7% | |
| | π Industry | 5.5% | |
| | πΌ Employment status | 4.9% | |
| | π’ Other | 3.4% | |
| | πΌ Developer Category | 3.1% | |
| | π Tech counts | 1.9% | |
| | π€ Demographics | 1.6% | |
| |
| **Key insight**: Geography is the dominant predictor (67%), confirming our EDA finding |
| that location matters more than skills, experience, or role for salary determination. |
| The same developer in different regions can have 5-10x salary differences. |
| |
|  |
| |
| *Top 20 most important features in XGBoost. Region_North_America alone accounts for 36.6% of model decisions.* |
| |
| ### Classification Model Performance |
| |
| Salary categorized into 3 classes (33%/33%/33%): |
| - **Low**: < $46,185 |
| - **Medium**: $46,185 - $91,719 |
| - **High**: > $91,719 |
| |
| | Model | Accuracy | |
| |-------|----------| |
| | Logistic Regression | 68.72% | |
| | Random Forest | 69.38% | |
| | **XGBoost (best)** | **70.39%** | |
| |
| **Best Classifier: XGBoost** with 70.39% accuracy (vs 33% baseline) |
| |
| ### Per-Class Performance (XGBoost) |
| |
| | Category | Accuracy | Precision | Recall | F1-Score | |
| |----------|----------|-----------|--------|----------| |
| | Low | 75.77% | 0.7602 | 0.7577 | 0.7589 | |
| | High | 74.14% | 0.7595 | 0.7414 | 0.7503 | |
| | Medium | 61.57% | 0.5999 | 0.6157 | 0.6077 | |
| |
| **Key insights**: |
| - Model excels at distinguishing extreme categories (Low/High) |
| - Misclassifications between Low β High are rare (~4%) |
| - Medium category is hardest to classify (boundary cases) |
| - Model tends to predict Medium when uncertain (conservative strategy) |
| |
|  |
| |
| *XGBoost confusion matrix. The model rarely confuses Low with High (~4% error rate), but Medium is harder to classify.* |
| |
| ### Clustering Analysis (K-Means, K=4) |
| |
| K-Means clustering identified 4 distinct developer personas: |
| |
| | Cluster | Persona | Size | Median Salary | Years Pro | Top Region | |
| |---------|---------|------|---------------|-----------|------------| |
| | 0 | **Mainstream Developer** | 45.1% | $58,375 | 6 | Western Europe | |
| | 1 | **Junior / Eastern Europe** | 10.2% | $42,962 | 6 | Western/Eastern Europe | |
| | 2 | **Modern Tech Worker** | 25.1% | $66,000 | 7 | North America | |
| | 3 | **Elite / Senior** | 19.6% | **$105,258** | **22** | North America | |
| |
| **Key clustering insights**: |
| - **ChatGPT usage is inversely correlated with seniority**: Cluster 2 (modern) uses |
| it 88% of the time, while Cluster 3 (elite/senior) only 44% |
| - **Cluster 3 (Elite)** stands out with 22+ years experience, North American |
| location, and high salary - the "veteran developer" persona |
| - **Cluster 2 (Modern Tech Worker)** represents AI-era developers using all |
| modern tools (TypeScript, AWS, Copilot, ChatGPT) heavily |
| - Silhouette scores are low (~0.04) due to high-dimensional data, but clusters |
| remain interpretable and actionable |
| |
|  |
| |
| *Elbow Method and Silhouette Score analysis used to determine optimal K=4.* |
| |
|  |
| |
| *4 developer personas visualized in 2D using PCA. Despite low variance explained (11.7%), the clusters show meaningful separation.* |
| |
|  |
| |
| *Salary distributions per cluster reveal the clear hierarchy: Elite/Senior cluster has dramatically higher salaries with tighter distribution.* |
| |
| |
| ## π Usage |
| |
| ### Loading the Models |
| ```python |
| import pickle |
| |
| # Load regression model (predicts salary in USD) |
| with open('regression_model.pkl', 'rb') as f: |
| reg_model = pickle.load(f) |
| |
| # Load classification model (predicts Low/Medium/High) |
| with open('classification_model.pkl', 'rb') as f: |
| cls_model = pickle.load(f) |
|
|
| # Load clustering model (assigns to 1 of 4 personas) |
| with open('kmeans_model.pkl', 'rb') as f: |
| kmeans_model = pickle.load(f) |
|
|
| # Load preprocessing tools |
| with open('scaler.pkl', 'rb') as f: |
| scaler = pickle.load(f) |
| |
| with open('label_encoder.pkl', 'rb') as f: |
| label_encoder = pickle.load(f) |
|
|
| with open('feature_names.pkl', 'rb') as f: |
| feature_names = pickle.load(f) |
| ``` |
| |
| ### Making Predictions |
| ```python |
| import numpy as np |
|
|
| # Prepare your features (must match feature_names order) |
| # X_new must have shape (n_samples, 68) |
| |
| # Regression prediction (returns log-scale salary) |
| log_salary_pred = reg_model.predict(X_new) |
| salary_usd = np.expm1(log_salary_pred) # Convert back to USD |
|
|
| # Classification prediction |
| class_pred = cls_model.predict(X_new) |
| class_label = label_encoder.inverse_transform(class_pred) # Low/Medium/High |
| |
| # Clustering (which persona?) |
| X_scaled = scaler.transform(X_new) |
| cluster = kmeans_model.predict(X_scaled) |
| ``` |
| |
| ## π Project Structure |
| |
| ``` |
| data_science_project/ |
| βββ StackOverflow_Salary_Prediction.ipynb # Main notebook with full pipeline |
| βββ README.md # This file |
| βββ models/ |
| β βββ regression_model.pkl # XGBoost regressor (1.2 MB) |
| β βββ classification_model.pkl # XGBoost classifier (3.3 MB) |
| β βββ kmeans_model.pkl # K-Means cluster model (92 KB) |
| β βββ scaler.pkl # StandardScaler for preprocessing |
| β βββ label_encoder.pkl # LabelEncoder for class names |
| β βββ feature_names.pkl # List of 68 feature names |
| βββ images/ |
| βββ 01_salary_distribution.png # Target variable analysis |
| βββ 02_salary_by_country.png # Country-level boxplot |
| βββ 03_top_developer_roles.png # Roles ranked by salary |
| βββ 04_salary_vs_experience.png # Career growth curve |
| βββ 05_experience_by_country.png # Country comparison |
| βββ 06_feature_importance.png # XGBoost top features |
| βββ 07_confusion_matrix.png # Classification results |
| βββ 08_elbow_method.png # Optimal K selection |
| βββ 09_clusters_pca.png # 2D cluster visualization |
| βββ 10_salary_by_cluster.png # Salary per persona |
| ``` |
| |
| ## π€ Author |
|
|
| **Raz Sarusi** |
|
|
| *Data Science Course Project - Assignment #2* |
|
|
| ## π
Date |
|
|
| *Project completed: May 2026* |