Update README.md

1eea5e6 verified about 3 hours ago

23.5 kB

	---
	---
	tags:
	- regression
	- classification
	- clustering
	- tabular
	- linkedin
	- job-postings
	- sklearn
	- random-forest
	- decision-tree
	- kmeans
	- shap
	license: mit
	---

	# 📊 LinkedIn Job Posting Engagement Analysis

	> Which LinkedIn job posting characteristics predict candidate engagement (views) — and how well can engagement be predicted or classified using only posting-level features?

	Personal motivation: As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions.

	---

	## 📹 Presentation Video

	<video src=["https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4](https://www.loom.com/share/c7d9b89a54234f699204b16a9a313c7d)" controls style="max-width:720px;"></video>

	---

	## 🚀 Interactive Dashboard

	👉 [Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)

	\| Tab \| Description \|
	\|---\|---\|
	\| 🎯 Engagement Predictor \| Enter posting details → get predicted views + High/Normal classification in real time \|
	\| 📊 EDA Dashboard \| All 5 EDA findings as interactive charts \|
	\| ℹ️ About \| Feature groups, model details, limitations \|

	---

	## 📋 Dataset at a Glance

	\| Property \| Value \|
	\|---\|---\|
	\| Source \| [LinkedIn Job Postings — arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) \|
	\| Original size \| 123,850 rows × 49 columns \|
	\| Working sample \| 30,000 rows · `random_state=42` \|
	\| After join with companies \| 30,000 rows × 40 columns \|
	\| After cleaning \| 29,572 rows × 51 columns (in `df_model`) \|
	\| Train / Test split \| 23,657 / 5,915 (80/20, `random_state=42`) \|
	\| Regression target \| `log_views = log1p(views)` — log-transformed to handle right skew \|
	\| Classification target \| `high_engagement` — top 25% of training views (threshold derived from training set only) \|

	---

	## ⚠️ Scope & Limitations

	> LinkedIn's algorithm, sponsored status, and company follower counts drive the majority of view variance and are unobservable in this dataset. Models use posting-level features only. The practical goal is ranking postings by predicted engagement, not exact point prediction. Results show associations, not causal relationships.

	---

	## 🗂️ Repository Files

	\| File \| Description \|
	\|---\|---\|
	\| `notebook.ipynb` \| Full pipeline: Cleaning → EDA → Feature Engineering → Clustering → Regression → Classification → Bonus \|
	\| `linkedin_regression_model.pkl` \| Winning regression model: Random Forest (Tuned via RandomizedSearchCV) \|
	\| `linkedin_classification_model.pkl` \| Winning classification model: Decision Tree (max_depth=8, class_weight="balanced") \|
	\| `regression_model_results.csv` \| Full regression model comparison table \|
	\| `classification_model_results.csv` \| Full classification model comparison table \|

	---

	## 🧹 Data Cleaning Pipeline

	7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:

	```
	Step 1 — Reproducible sampling
	123,850 rows → sample(n=30,000, random_state=42)
	Joined with companies.csv on company_id (left join, rows preserved)
	Result: 30,000 rows × 40 columns

	Step 2 — Duplicate & missing target removal
	Removed duplicate rows
	Dropped rows where views is NaN or negative
	Result: 29,572 usable rows

	Step 3 — Date parsing
	listed_time, original_listed_time, expiry, closed_time → parsed to datetime
	Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend

	Step 4 — Missing value analysis & column dropping
	Threshold: >70% missing → drop
	Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
	remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)

	Step 5 — Leakage columns excluded
	expiry, applies → removed (post-publication outcomes)
	views → kept as target only, never as a feature

	Step 6 — Salary imputation strategy
	has_salary_info = 1 if salary present, else 0
	salary_midpoint computed from min/max salary where available
	Missing salary → imputed inside sklearn Pipeline on training data only

	Step 7 — Log transformation of target
	Raw views: mean=14.9, std=98.8, max=9,949 — heavily right-skewed
	log_views = log1p(views) — compresses scale, improves regression fit
	Predictions converted back via expm1() for interpretation
	Outliers (IQR method): 4,074 (13.8%) — kept, not removed
	```

	---

	## 🔍 EDA — 5 Research Questions

	> Note on notebook ordering: Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact.

	---

	### 💰 Q2 — Salary Transparency vs Views

	```
	No salary info ████████████░░░░░░░░░░░░░ ~12 avg views (70.1% of postings)
	Has salary info ████████████████████████░ ~21 avg views (29.9% of postings)
	+74.3% lift ✓
	```

	> Only 8,562 of 29,572 postings (29.9%) disclose salary. Transparent postings attract 74.3% more views on average. This is the highest-leverage, lowest-cost recruiter action available.

	---

	### 📝 Q3 — Description Length vs Views

	```
	< 100 words ██████░░░░░░░░░░░░░░ ~8 avg views — signals incomplete posting
	100–250 words █████████░░░░░░░░░░░ ~13 avg views
	250–500 words ████████████████████ ~24 avg views PEAK ★ — sweet spot
	500–750 words ████████████████░░░░ ~18 avg views
	> 1000 words ███████░░░░░░░░░░░░░ ~10 avg views — overwhelms candidates
	```

	> Non-linear relationship confirmed. Sweet spot: 250–500 words. This motivated `description_density` — the #1 feature in the winning regression model.

	---

	### 📅 Q4 — Day of Week vs Views

	```
	Monday ████████████████████ 39 avg views ★ best day (n=1,837)
	Tuesday █████████████████░░░ 25 avg views
	Wednesday ████████████████░░░░ 22 avg views
	Thursday ███████████████░░░░░ 18 avg views
	Friday ███░░░░░░░░░░░░░░░░░ 7 avg views ✗ worst day (n=10,076)
	Saturday ████████████░░░░░░░░ 28 avg views (weekend — n=2,116 total, noisier)
	Sunday ████████████░░░░░░░░ 28 avg views (weekend — noisier)
	```

	> Counterintuitive finding: Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features.

	---

	### 💼 Q1 — Work Type vs Views

	```
	Contract ████████████████████ 29.97 avg views median: 7.0
	Internship █████████████████░░░ 25.71 avg views median: 5.0
	Full-time ████████░░░░░░░░░░░░ 13.70 avg views median: 4.0 ← 80% of volume
	Other ███████░░░░░░░░░░░░░ 11.27 avg views median: 4.0
	Part-time ██████░░░░░░░░░░░░░░ 9.59 avg views median: 4.0
	```

	> Contract and Internship roles show the highest engagement. However, Full-time dominates volume (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal.

	---

	### 🎓 Q5 — Seniority Level vs Views

	```
	Entry-level ████████████████████ 18 avg views n=792
	Senior-level ████████████░░░░░░░░ 16 avg views n=3,577
	Other/Mid ██████████░░░░░░░░░░ 15 avg views n=25,203

	Entry vs Senior: +12.4% more views
	Entry vs Other: +18.9% more views
	```

	> Supply-side effect — more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for candidate pool size, not intrinsic desirability.

	---

	### 🔥 Feature Correlation with log(views+1)

	```
	Feature Corr Direction Note
	─────────────────────────────────────────────────────────────────────
	desc_salary_interaction +0.18 ↑ views strongest single predictor
	has_salary_info +0.14 ↑ views salary transparency
	salary_log +0.12 ↑ views salary level
	description_density +0.10 ↑ views content quality
	description_word_count +0.08 ↑ views description length
	is_software_role +0.08 ↑ views tech role demand
	is_data_role +0.07 ↑ views data role demand
	is_entry_role +0.06 ↑ views larger candidate pool
	posting_weekend -0.04 ↓ views small negative signal
	is_senior_role -0.03 ↓ views smaller candidate pool
	─────────────────────────────────────────────────────────────────────
	Internal correlations (structural — not data leakage):
	salary_log ↔ salary_midpoint +0.96 log transform of same variable
	desc_wc ↔ desc_density +0.55 density uses length in formula
	is_software ↔ is_data +0.35 often co-occur in job titles
	is_senior ↔ is_entry -0.28 mutually exclusive by construction
	```

	> Most features show weak linear correlation — no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.

	### 🌡️ Correlation Heatmap (feature-to-feature + target)

	```
	log desc has sal desc is_ is_ is_ post is_
	views dens sal log wc soft data entr wknd snr
	──────────────────────────────────────────────────────────────────────────────────────
	log_views │ 1.00 0.10 0.14 0.12 0.08 0.08 0.07 0.06 -0.04 -0.03
	description_density │ 0.10 1.00 0.02 0.04 0.55 0.01 0.01 -0.01 0.00 0.00
	has_salary_info │ 0.14 0.02 1.00 0.72 0.03 0.06 0.07 -0.03 -0.01 -0.02
	salary_log │ 0.12 0.04 0.72 1.00 0.04 0.05 0.06 -0.02 -0.01 -0.01
	description_word_count │ 0.08 0.55 0.03 0.04 1.00 0.01 0.01 -0.01 0.00 0.00
	is_software_role │ 0.08 0.01 0.06 0.05 0.01 1.00 0.35 -0.08 0.00 -0.05
	is_data_role │ 0.07 0.01 0.07 0.06 0.01 0.35 1.00 -0.06 0.00 -0.04
	is_entry_role │ 0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06 1.00 0.01 -0.28
	posting_weekend │ -0.04 0.00 -0.01 -0.01 0.00 0.00 0.00 0.01 1.00 0.00
	is_senior_role │ -0.03 0.00 -0.02 -0.01 0.00 -0.05 -0.04 -0.28 0.00 1.00
	──────────────────────────────────────────────────────────────────────────────────────
	Key structural correlations:
	salary_log ↔ has_salary_info +0.72 same underlying signal, different form
	desc_wc ↔ desc_density +0.55 density formula uses word count
	is_software ↔ is_data +0.35 frequently co-occur in job titles
	is_entry ↔ is_senior -0.28 mutually exclusive flags
	```

	> The heatmap confirms no multicollinearity crisis — the highest inter-feature correlation (salary_log ↔ has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models.

	---

	## ⚙️ Feature Engineering — 20 Base + 6 Cluster = 30 Total Features

	\| Group \| Features \|
	\|---\|---\|
	\| Text length \| `title_length`, `title_word_count`, `description_length`, `description_word_count` \|
	\| Text structure \| `description_density` ★, `title_desc_ratio` \|
	\| Salary \| `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` \|
	\| Role keywords \| `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` \|
	\| Interactions \| `desc_salary_interaction` ★, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` \|
	\| Clustering \| `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` \|

	Missing value strategy:
	- Columns with >70% missing → dropped
	- Salary → `has_salary_info` flag + `salary_midpoint` where available; remaining NaN imputed inside sklearn Pipeline on training data only
	- Remaining numeric → `SimpleImputer(strategy="median")` inside Pipeline

	---

	## 🔵 Clustering — KMeans k=6

	Features used for clustering (12 total, leakage-checked):
	`title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`

	Methods used to select k:
	1. Elbow method — inconclusive, no sharp elbow
	2. KMeans silhouette scores on full training matrix
	3. Cluster-size stability table
	4. Interactive K-Means widget (visualization aid — uses sample)
	5. Hierarchical clustering dendrogram (Ward linkage, 300 obs)
	6. Agglomerative clustering comparison (k=2–10)

	```
	Silhouette scores by k (full training matrix):

	k=2 ████████░░░░░░░░░░░░ 0.198 smallest cluster: 6,830 (28.9%)
	k=3 █████████░░░░░░░░░░░ 0.221 smallest cluster: 2,100 (8.9%)
	k=4 ████████████████░░░░ 0.312 ← strong BUT largest=72% of data
	k=5 ██████████░░░░░░░░░░ 0.250 smallest: 526 (unstable)
	k=6 ████████████░░░░░░░░ 0.290 ← SELECTED ★ smallest: 583 (2.5%)
	k=7 ████████████░░░░░░░░ 0.286 singleton cluster appeared
	k=8+ singleton clusters appeared

	Why NOT k=10 (highest score): singleton cluster (1 observation)
	Why NOT k=4 (strong score): largest cluster = 72% — not meaningful separation
	Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290
	```

	Cluster profiles at k=6 (training set n=23,657):

	\| Cluster \| Label \| Size \| Share \| Key Signal \|
	\|---\|---\|---\|---\|---\|
	\| 0 \| Manager-focused \| 4,571 \| 19% \| `is_manager_role=1.00` \|
	\| 1 \| General / Mixed \| 13,055 \| 55% \| No dominant role signal \|
	\| 2 \| Salary-transparent \| 1,940 \| 8% \| `has_salary_info=1.00` \|
	\| 3 \| Data roles \| 1,451 \| 6% \| `is_data_role=1.00` \|
	\| 4 \| Software roles \| 2,057 \| 9% \| `is_software_role=1.00` \|
	\| 5 \| Entry / low salary \| 583 \| 2% \| Smallest cluster \|

	Official final silhouette score: 0.290 (full training matrix)

	Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.

	---

	## 📈 Regression — Predicting `log1p(views)`

	### Baseline

	```
	Mean Baseline (predict training mean for all observations):
	RMSE_log = 0.8708 R² = -0.0002 ← floor every model must beat
	MAE_views ≈ 10.64

	Baseline Linear Regression (20 features, no clustering):
	RMSE_log = 0.8425 R² = 0.0639
	```

	### Full Model Comparison (after feature engineering + clustering)

	\| Model \| RMSE_log ↓ \| R² ↑ \| Notes \|
	\|---\|---\|---\|---\|
	\| Random Forest (Tuned) ★ \| 0.8347 \| 0.0811 \| RandomizedSearchCV winner \|
	\| Random Forest (Controlled) \| 0.8349 \| 0.0807 \| Manual constraints \|
	\| Gradient Boosting \| 0.8370 \| 0.0770 \| — \|
	\| Linear Regression + Features \| 0.8420 \| 0.0640 \| — \|
	\| RidgeCV \| 0.8420 \| 0.0640 \| — \|
	\| Lasso Regression \| 0.8430 \| 0.0640 \| — \|
	\| PCA + Linear Regression \| 0.8440 \| 0.0600 \| 15 components, 96.3% variance \|
	\| Mean Baseline \| 0.8708 \| -0.0002 \| Floor \|

	Key lessons:
	- Unrestricted RF → train R²=0.854, test R²=0.003 (massive overfit). Fixed by `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features` constraints.
	- 3-fold CV mean RMSE_log: 0.8747 (±0.0125) — stable across folds
	- Outlier robustness test: capping views at 99th pct → RMSE_log 0.8147, R²=0.0812

	### Top Feature Importances (RF Tuned)

	```
	description_density ████████████ #1 — content quality proxy
	description_length ██████████░░ #2 — raw description size
	description_word_count ████████░░░░ #3 — word count
	title-description interaction████████░░░░ #4 — combined text signal
	is_software_role ██████░░░░░░ #5 — tech role demand
	is_data_role █████░░░░░░░ #6 — data role demand
	salary_log / has_salary_info ████░░░░░░░░ #7+ — salary signals
	```

	> `desc_salary_interaction` ranks #2 in SHAP analysis but further down in Gini importance — both agree on description quality and salary as top drivers.

	### Why R² = 0.081 Is Acceptable

	```
	R² = 0.081 → model explains ~8% of variance in log(views+1)

	✓ Beats mean baseline (R²≈0) — real posting-level signal captured
	✓ Social engagement inherently noisy — platform factors dominate
	✓ 92% of variance from unobservable sources (algorithm, followers, ads)
	✓ Practical use = ranking postings, not forecasting exact counts
	```

	---

	## 🟠 Classification — High Engagement vs. Normal

	```
	Target: high_engagement = 1 if views ≥ 75th percentile of TRAINING views
	Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
	Feature matrix: X_clf uses 24 features (see notebook cell 207)
	Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
	```

	### Model Comparison

	\| Model \| F1 (Class 1) \| Recall (Class 1) \| Notes \|
	\|---\|---\|---\|---\|
	\| Decision Tree ★ \| HIGHEST \| HIGHEST \| max_depth=8, class_weight="balanced" \|
	\| Logistic Regression \| near-best \| high \| Close to DT in F1 \|
	\| Random Forest \| moderate \| lower \| Lowest FP count \|
	\| Dummy Baseline \| 0.00 \| 0.00 \| Always predicts Class 0 \|

	5-fold CV F1: 0.4424 ± 0.0152 — stable, no lucky split

	### Error Cost Analysis

	```
	FN (missed high-engagement) = most costly error
	→ Company fails to prioritize, promote, or learn from a strong posting

	FP (false alarm) = also costly
	→ Recruiter wastes time and budget on a posting that won't perform
	```

	Decision Tree minimises FN (catches most high-engagement postings) but produces more FP.
	Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings.

	---

	## 💡 Business Insights

	1. Salary transparency is the single highest-leverage action — 74.3% more views for free. Fewer than 30% of postings disclose salary today.
	2. Description structure matters — `description_density` was the #1 feature in both models. Sweet spot: 250–500 words.
	3. Tech roles attract disproportionate engagement — `is_software_role` and `is_data_role` carry real signal beyond salary.
	4. Work type is associated with engagement — contract roles lead, but full-time dominates volume (80%).
	5. Platform factors dominate — R²≈0.08 is expected and acceptable. Model value is in ranking postings, not exact prediction.

	---

	## 🎁 Bonus Work

	### 🧠 SHAP Explainability

	```
	SHAP mean \|value\| — RF Tuned regression (test observations)

	description_density ████████████ strongest positive impact ↑
	desc_salary_interaction ██████████░░ salary × description synergy ↑
	salary_log ████████░░░░ salary level ↑
	has_salary_info ██████░░░░░░ disclosed → more views ↑
	posting_weekend ██░░░░░░░░░░ weekend → fewer views ↓
	```

	`desc_salary_interaction` ranks #2 in SHAP but lower in Gini — confirms it captures genuine non-linear interaction that neither feature achieves alone.

	### 📊 Feature Importance: Regression vs Classification

	```
	Regression RF Classification DT
	description_density #1 #2
	desc_salary_interaction #2 (SHAP) varies
	salary_log #7+ varies
	is_entry_role lower rises in classification
	is_data_role #6 varies
	──────────────────────────────────────────────────────────
	Agreement: description quality + salary dominate both models
	Divergence: seniority/role flags matter more for threshold-crossing
	(classification) than for predicting exact counts (regression)
	```

	### 🔬 Additional Extras

	- Interactive K-Means Widget — explore different k values visually (notebook cell 4.11)
	- Hierarchical Clustering Dendrogram — Ward linkage, 300 obs sample (cell 4.12)
	- Agglomerative Clustering Diagnostic — k=2–10 comparison (cell 4.13)
	- Outlier Robustness Test — views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
	- 3-fold CV for regression — mean RMSE_log 0.8747 ± 0.0125

	---

	## 🛠️ How to Use the Models

	```python
	import pickle, numpy as np

	with open("linkedin_regression_model.pkl", "rb") as f:
	reg_model = pickle.load(f)
	with open("linkedin_classification_model.pkl", "rb") as f:
	clf_model = pickle.load(f)

	# Regression — predict log(views+1), convert back to raw view estimate
	log_views_pred = reg_model.predict(X_test_fe)
	views_pred = np.expm1(log_views_pred)

	# Classification — predict high-engagement label (0 = Normal, 1 = High)
	label = clf_model.predict(X_clf)
	```

	> Regression model expects 30-column `X_test_fe` (including cluster dummies).
	> Classification model expects 24-column `X_clf` (see notebook cell 207).
	> Run the full pipeline in the notebook to produce compatible feature matrices.

	---

	Assignment 2 — Classification, Regression, Clustering, Evaluation \| LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)