Add YAML metadata and improve README

0c2faf1 verified 14 days ago

15.7 kB

	---
	license: mit
	tags:
	- regression
	- classification
	- clustering
	- xgboost
	- random-forest
	- salary-prediction
	- developer-survey
	- stack-overflow
	language:
	- en
	---

	# Stack Overflow Salary Prediction - Developer Survey 2024

	## 🎬 Project Demo

	[Watch the demo video here](/razsarusi/stackoverflow-salary-prediction/blob/main/./%D7%A1%D7%A8%D7%98%D7%95%D7%9F%20%D7%9E%D7%A9%D7%99%D7%9E%D7%94%202%20%D7%93%D7%90%D7%98%D7%94%20%D7%A1%D7%99%D7%99%D7%A0%D7%A1.mp4)

	> 📺 A 6-minute walkthrough covering the entire project: data exploration,
	> feature engineering, model training, and key insights.

	## 📊 Project Overview

	This project predicts annual developer compensation (salary) based on factors
	like experience, location, technologies, education, and AI tool adoption.

	The data comes from the Stack Overflow Annual Developer Survey 2024,
	covering 65,437 developers worldwide.

	## 🎯 Objectives

	1. Regression: Predict exact salary in USD
	2. Classification: Categorize developers into salary tiers (Low/Medium/High)
	3. Clustering: Discover natural developer segments


	## 📁 Dataset

	- Source: Stack Overflow Annual Developer Survey 2024
	- Size: 65,437 rows × 114 columns (raw)
	- After cleaning: 22,765 rows × 68 features
	- Target: `ConvertedCompYearly` (USD)
	- Available on Kaggle: <https://www.kaggle.com/datasets/berkayalan/stack-overflow-annual-developer-survey-2024>


	## 🔍 Key Findings (EDA)

	### Target Variable Analysis
	- 23,435 valid salary responses (35.8% of dataset)
	- Highly right-skewed distribution
	- Range: $1 to $16,256,603 (extreme outliers exist)
	- Median salary: $65,000
	- Mean salary: $86,155
	- 97.1% in realistic range ($1K-$500K)
	- Decision: Apply log transformation + filter outliers

	![Salary Distribution](images/01_salary_distribution.png)

	Salary distribution showing extreme right-skewness and the value of log transformation for modeling.

	### Data Structure
	- 100 categorical (object) columns
	- 13 float columns
	- 1 integer column (ResponseId)
	- Most predictive features need conversion from text to numeric

	### Top Paying Countries (by median salary)

	\| Rank \| Country \| Median Salary \| Sample Size \|
	\|------\|---------\|--------------\|-------------\|
	\| 1 \| USA \| $141,000 \| 4,596 \|
	\| 2 \| Israel \| $113,334 \| 217 \|
	\| 3 \| Switzerland \| $111,417 \| 385 \|
	\| 4 \| Australia \| $95,796 \| 505 \|
	\| 5 \| Ireland \| $91,295 \| 120 \|
	\| 6 \| Denmark \| $88,993 \| 211 \|
	\| 7 \| Canada \| $87,231 \| 861 \|
	\| 8 \| UK \| $84,038 \| 1,376 \|

	Key insight: Geographic location is the most powerful predictor of salary.
	The same role can earn 5-10x more in the US/Israel/Switzerland compared to
	emerging economies.

	### Geographic Salary Variance
	Boxplot analysis revealed:
	- USA: Median $140K with high variance ($100K-$200K interquartile range),
	many high-end outliers reaching $500K+
	- Western Europe (Germany, UK): Median $70-85K, moderate variance
	- Eastern Europe (Poland, Ukraine): Median $35-55K, but with significant
	high-end outliers (likely remote workers for foreign companies)
	- Emerging markets (India, Brazil): Median $15-25K, low variance
	- Salary range from highest to lowest country median: ~10x difference

	![Salary by Country](images/02_salary_by_country.png)

	Salary distributions across the top 10 countries (by sample size). USA dominates both in median salary and variance.

	### Top Paying Developer Roles

	\| Rank \| Role \| Median Salary \|
	\|------\|------\|---------------\|
	\| 1 \| Senior Executive (C-Suite, VP) \| $120K \|
	\| 2 \| Engineering Manager \| $115K \|
	\| 3 \| Engineer, Site Reliability (SRE) \| $98K \|
	\| 4 \| Cloud Infrastructure Engineer \| $96K \|
	\| 5 \| Security Professional \| $80K \|
	\| 6 \| Data Engineer \| $77K \|
	\| 7 \| Developer, AI \| $75K \|
	\| 8 \| Data Scientist / ML Specialist \| $73K \|
	\| 9 \| Back-end Developer \| $68K \|
	\| 10 \| Full-stack Developer \| $64K \|

	Key insights:
	- Specialization pays: Infrastructure roles (SRE, Cloud) earn 30-50% more
	than general development roles
	- Management track: Engineering managers and executives top the list
	- Counter-intuitive finding: AI Developer ranks 7th, not at top despite
	the AI hype - market still developing
	- Full-stack paradox: Largest group (18,260 respondents) but lowest median
	in top-15, suggesting market saturation

	![Top Developer Roles](images/03_top_developer_roles.png)

	Top 15 developer roles ranked by median salary. Note how specialized infrastructure roles (SRE, Cloud) outperform general development roles.

	### Experience vs Salary Relationship
	- Overall correlation: 0.38 (moderate, due to country variance)
	- Career growth pattern observed:
	- Years 0-10: Steep growth ($25K → $78K, 3x increase)
	- Years 10-20: Continued growth ($78K → $95K)
	- Years 20+: Plateau effect (~$100-110K, role-dependent)
	- Within-country correlation is much stronger than overall correlation
	- Median professional experience in dataset: 8 years

	![Salary vs Experience](images/04_salary_vs_experience.png)

	The career growth curve: rapid early growth followed by plateau effect after ~20 years.

	### Country-Experience Interaction (Simpson's Paradox)
	Within-country correlations between experience and salary:
	- Germany: 0.438 (highest - structured market)
	- India: 0.394 (experience matters)
	- USA: 0.319 (role/company matter more)
	- UK: 0.271
	- Canada: 0.299

	Insight: The same career trajectory yields vastly different outcomes
	based on geography. A junior developer in USA ($65K) earns more than a
	senior developer in India ($45K after 25 years). This makes country a
	critical feature for the model.

	![Experience by Country](images/05_experience_by_country.png)

	The "geography is destiny" effect: same experience yields drastically different salaries across countries.

	### Technology Indicators (Linear Correlation with Salary)

	\| Technology \| Users \| % \| Correlation \|
	\|------------\|-------\|---\|-------------\|
	\| AWS \| 9,894 \| 43.5% \| +0.139 \|
	\| Go \| 3,388 \| 14.9% \| +0.087 \|
	\| Rust \| 2,853 \| 12.5% \| +0.082 \|
	\| Copilot \| 8,203 \| 36.0% \| +0.060 \|
	\| Scala \| 669 \| 2.9% \| +0.058 \|
	\| Azure \| 5,825 \| 25.6% \| +0.047 \|
	\| Python \| 11,142\| 48.9% \| +0.044 \|
	\| Kubernetes \| 4,180 \| 18.4% \| -0.004 \|
	\| Docker \| 11,591\| 50.9% \| -0.002 \|
	\| ChatGPT\|14,827 \| 65.1% \| -0.102 \|

	Insights:
	- AWS is the strongest single technology indicator - likely because
	AWS adoption correlates with established tech companies in higher-paying countries
	- Docker, Kubernetes, Terraform show ~0 linear correlation despite
	being valuable skills - they have become industry standards (commoditized)
	- ChatGPT usage is negatively correlated - consistent with junior
	developers relying more on AI tools than senior engineers
	- These features still provide value through non-linear interactions
	in tree-based models (Random Forest, XGBoost)

	### Key Predictive Features Identified
	- YearsCodePro - Years of professional coding experience
	- Country - Geographic location (massive impact)
	- EdLevel - Education level (8 ordered categories)
	- DevType - Developer role type (34 categories - needs grouping)
	- OrgSize - Company size (10 ordered categories)
	- RemoteWork - Remote/Hybrid/In-person


	## 🛠️ Methodology

	### Data Preprocessing
	- Filtered rows with valid salary data (65,437 → 22,765 after outlier removal)
	- Removed extreme outliers (<$1K and >$500K)
	- Applied log transformation to target (handles right-skewed distribution)
	- Converted text-based numeric columns (YearsCode, YearsCodePro)
	- Median imputation for missing experience values

	### Feature Engineering
	- Ordinal Encoding: EdLevel (8 levels), OrgSize (10 sizes), Age (8 groups)
	- Country Grouping: 185 countries → 11 regions based on geography and economy
	- DevType Grouping: 34 roles → 7 broader categories
	- Multi-select handling:
	- Created 5 binary indicators for Employment status
	- Count features for technologies (num_languages, num_databases, etc.)
	- Binary flags for high-value technologies (uses_AWS, uses_Python, etc.)
	- One-Hot Encoding: Applied to Region, DevCategory, RemoteWork, Industry
	- Final dataset: 22,765 samples × 68 features

	### Models Trained
	- Regression: Linear Regression, Random Forest, XGBoost
	- Classification: Logistic Regression, Random Forest, XGBoost
	- Clustering: K-Means with K=4 (chosen via Silhouette analysis)


	## 📈 Results

	### Regression Model Performance

	\| Model \| R² (log) \| R² ($) \| MAE ($) \| RMSE ($) \| Training Time \|
	\|-------\|----------\|--------\|---------\|----------\|---------------\|
	\| Linear Regression \| 0.5319 \| 0.4333 \| 30,917 \| 49,592 \| <1s \|
	\| Random Forest \| 0.5698 \| 0.5121 \| 28,005 \| 46,018 \| 30s \|
	\| XGBoost (best) \| 0.5840 \| 0.5326 \| 27,513 \| 45,039 \| 2.6s \|

	Best Model: XGBoost with R² = 0.5326 (explains 53% of salary variance)

	### Feature Importance Analysis

	Top features driving predictions (XGBoost):
	\| Rank \| Feature \| Importance \|
	\|------\|---------\|------------\|
	\| 1 \| Region_North_America \| 36.58% \|
	\| 2 \| Region_Western_Europe \| 8.89% \|
	\| 3 \| Region_Asia_Developing \| 6.86% \|
	\| 4 \| Region_Asia_Pacific_Developed \| 4.62% \|
	\| 5 \| YearsCodePro \| 3.28% \|

	### Feature Importance by Category

	\| Category \| Total Importance \|
	\|----------\|------------------\|
	\| 🌍 Region (Geography) \| 67.0% \|
	\| 💻 Tech indicators \| 6.9% \|
	\| ⏰ Experience \| 5.7% \|
	\| 🏭 Industry \| 5.5% \|
	\| 💼 Employment status \| 4.9% \|
	\| 🏢 Other \| 3.4% \|
	\| 💼 Developer Category \| 3.1% \|
	\| 📊 Tech counts \| 1.9% \|
	\| 👤 Demographics \| 1.6% \|

	Key insight: Geography is the dominant predictor (67%), confirming our EDA finding
	that location matters more than skills, experience, or role for salary determination.
	The same developer in different regions can have 5-10x salary differences.

	![Feature Importance](images/06_feature_importance.png)

	Top 20 most important features in XGBoost. Region_North_America alone accounts for 36.6% of model decisions.

	### Classification Model Performance

	Salary categorized into 3 classes (33%/33%/33%):
	- Low: < $46,185
	- Medium: $46,185 - $91,719
	- High: > $91,719

	\| Model \| Accuracy \|
	\|-------\|----------\|
	\| Logistic Regression \| 68.72% \|
	\| Random Forest \| 69.38% \|
	\| XGBoost (best) \| 70.39% \|

	Best Classifier: XGBoost with 70.39% accuracy (vs 33% baseline)

	### Per-Class Performance (XGBoost)

	\| Category \| Accuracy \| Precision \| Recall \| F1-Score \|
	\|----------\|----------\|-----------\|--------\|----------\|
	\| Low \| 75.77% \| 0.7602 \| 0.7577 \| 0.7589 \|
	\| High \| 74.14% \| 0.7595 \| 0.7414 \| 0.7503 \|
	\| Medium \| 61.57% \| 0.5999 \| 0.6157 \| 0.6077 \|

	Key insights:
	- Model excels at distinguishing extreme categories (Low/High)
	- Misclassifications between Low ↔ High are rare (~4%)
	- Medium category is hardest to classify (boundary cases)
	- Model tends to predict Medium when uncertain (conservative strategy)

	![Confusion Matrix](images/07_confusion_matrix.png)

	XGBoost confusion matrix. The model rarely confuses Low with High (~4% error rate), but Medium is harder to classify.

	### Clustering Analysis (K-Means, K=4)

	K-Means clustering identified 4 distinct developer personas:

	\| Cluster \| Persona \| Size \| Median Salary \| Years Pro \| Top Region \|
	\|---------\|---------\|------\|---------------\|-----------\|------------\|
	\| 0 \| Mainstream Developer \| 45.1% \| $58,375 \| 6 \| Western Europe \|
	\| 1 \| Junior / Eastern Europe \| 10.2% \| $42,962 \| 6 \| Western/Eastern Europe \|
	\| 2 \| Modern Tech Worker \| 25.1% \| $66,000 \| 7 \| North America \|
	\| 3 \| Elite / Senior \| 19.6% \| $105,258 \| 22 \| North America \|

	Key clustering insights:
	- ChatGPT usage is inversely correlated with seniority: Cluster 2 (modern) uses
	it 88% of the time, while Cluster 3 (elite/senior) only 44%
	- Cluster 3 (Elite) stands out with 22+ years experience, North American
	location, and high salary - the "veteran developer" persona
	- Cluster 2 (Modern Tech Worker) represents AI-era developers using all
	modern tools (TypeScript, AWS, Copilot, ChatGPT) heavily
	- Silhouette scores are low (~0.04) due to high-dimensional data, but clusters
	remain interpretable and actionable

	![Elbow Method](images/08_elbow_method.png)

	Elbow Method and Silhouette Score analysis used to determine optimal K=4.

	![PCA Clusters](images/09_clusters_pca.png)

	4 developer personas visualized in 2D using PCA. Despite low variance explained (11.7%), the clusters show meaningful separation.

	![Salary by Cluster](images/10_salary_by_cluster.png)

	Salary distributions per cluster reveal the clear hierarchy: Elite/Senior cluster has dramatically higher salaries with tighter distribution.


	## 🚀 Usage

	### Loading the Models
	```python
	import pickle

	# Load regression model (predicts salary in USD)
	with open('regression_model.pkl', 'rb') as f:
	reg_model = pickle.load(f)

	# Load classification model (predicts Low/Medium/High)
	with open('classification_model.pkl', 'rb') as f:
	cls_model = pickle.load(f)

	# Load clustering model (assigns to 1 of 4 personas)
	with open('kmeans_model.pkl', 'rb') as f:
	kmeans_model = pickle.load(f)

	# Load preprocessing tools
	with open('scaler.pkl', 'rb') as f:
	scaler = pickle.load(f)

	with open('label_encoder.pkl', 'rb') as f:
	label_encoder = pickle.load(f)

	with open('feature_names.pkl', 'rb') as f:
	feature_names = pickle.load(f)
	```

	### Making Predictions
	```python
	import numpy as np

	# Prepare your features (must match feature_names order)
	# X_new must have shape (n_samples, 68)

	# Regression prediction (returns log-scale salary)
	log_salary_pred = reg_model.predict(X_new)
	salary_usd = np.expm1(log_salary_pred) # Convert back to USD

	# Classification prediction
	class_pred = cls_model.predict(X_new)
	class_label = label_encoder.inverse_transform(class_pred) # Low/Medium/High

	# Clustering (which persona?)
	X_scaled = scaler.transform(X_new)
	cluster = kmeans_model.predict(X_scaled)
	```

	## 📚 Project Structure

	```
	data_science_project/
	├── StackOverflow_Salary_Prediction.ipynb # Main notebook with full pipeline
	├── README.md # This file
	├── models/
	│ ├── regression_model.pkl # XGBoost regressor (1.2 MB)
	│ ├── classification_model.pkl # XGBoost classifier (3.3 MB)
	│ ├── kmeans_model.pkl # K-Means cluster model (92 KB)
	│ ├── scaler.pkl # StandardScaler for preprocessing
	│ ├── label_encoder.pkl # LabelEncoder for class names
	│ └── feature_names.pkl # List of 68 feature names
	└── images/
	├── 01_salary_distribution.png # Target variable analysis
	├── 02_salary_by_country.png # Country-level boxplot
	├── 03_top_developer_roles.png # Roles ranked by salary
	├── 04_salary_vs_experience.png # Career growth curve
	├── 05_experience_by_country.png # Country comparison
	├── 06_feature_importance.png # XGBoost top features
	├── 07_confusion_matrix.png # Classification results
	├── 08_elbow_method.png # Optimal K selection
	├── 09_clusters_pca.png # 2D cluster visualization
	└── 10_salary_by_cluster.png # Salary per persona
	```

	## 👤 Author

	Raz Sarusi

	Data Science Course Project - Assignment #2

	## 📅 Date

	Project completed: May 2026