Instructions to use MichaelYitzchak/Linkedin_Job_Engagement with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use MichaelYitzchak/Linkedin_Job_Engagement with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("MichaelYitzchak/Linkedin_Job_Engagement", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
Delete README.md
Browse files
README.md
DELETED
|
@@ -1,465 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
tags:
|
| 3 |
-
- regression
|
| 4 |
-
- classification
|
| 5 |
-
- clustering
|
| 6 |
-
- tabular
|
| 7 |
-
- linkedin
|
| 8 |
-
- job-postings
|
| 9 |
-
- sklearn
|
| 10 |
-
license: mit
|
| 11 |
-
---
|
| 12 |
-
|
| 13 |
-
# 📊 LinkedIn Job Posting Engagement Analysis
|
| 14 |
-
|
| 15 |
-
> **Which LinkedIn job posting characteristics predict candidate engagement (views) — and how well can engagement be predicted or classified using only posting-level features?**
|
| 16 |
-
|
| 17 |
-
---
|
| 18 |
-
|
| 19 |
-
## 📹 Presentation Video
|
| 20 |
-
|
| 21 |
-
<video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;"></video>
|
| 22 |
-
|
| 23 |
-
---
|
| 24 |
-
|
| 25 |
-
## 📋 Dataset at a Glance
|
| 26 |
-
|
| 27 |
-
| Property | Value |
|
| 28 |
-
|---|---|
|
| 29 |
-
| **Source** | [LinkedIn Job Postings — arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) |
|
| 30 |
-
| **Original size** | 123,850 rows × 49 columns |
|
| 31 |
-
| **Working sample** | 30,000 rows · `random_state=42` |
|
| 32 |
-
| **Target (regression)** | `log_views = log(views + 1)` — log-transformed to handle right skew |
|
| 33 |
-
| **Target (classification)** | `high_engagement` — top 25% of views, threshold from training data only |
|
| 34 |
-
|
| 35 |
-
---
|
| 36 |
-
|
| 37 |
-
## ⚠️ Scope & Limitations
|
| 38 |
-
|
| 39 |
-
> Platform-level signals — LinkedIn's algorithm, sponsored status, company follower counts — drive the **majority of view variance** and are **not observable** in this dataset. Models built here use only posting-level features (content, salary, timing, role type). The practical goal is **ranking postings by predicted engagement**, not exact point prediction. Results show associations, not causal relationships.
|
| 40 |
-
|
| 41 |
-
---
|
| 42 |
-
|
| 43 |
-
## 🗂️ Repository Files
|
| 44 |
-
|
| 45 |
-
| File | Description |
|
| 46 |
-
|---|---|
|
| 47 |
-
| `notebook.ipynb` | Full pipeline: Cleaning → EDA → Features → Clustering → Regression → Classification → Bonus |
|
| 48 |
-
| `linkedin_regression_model.pkl` | Winning model: Random Forest (Tuned) |
|
| 49 |
-
| `linkedin_classification_model.pkl` | Winning model: Decision Tree |
|
| 50 |
-
| `regression_model_results.csv` | Full regression model comparison |
|
| 51 |
-
| `classification_model_results.csv` | Full classification model comparison |
|
| 52 |
-
|
| 53 |
-
---
|
| 54 |
-
|
| 55 |
-
## 🧹 Data Cleaning Pipeline
|
| 56 |
-
|
| 57 |
-
```
|
| 58 |
-
Step 1 — Reproducible sampling
|
| 59 |
-
123,850 rows → sample(n=30,000, random_state=42)
|
| 60 |
-
reset_index(drop=True) for clean alignment
|
| 61 |
-
|
| 62 |
-
Step 2 — Duplicate & missing target removal
|
| 63 |
-
Dropped duplicate rows
|
| 64 |
-
Dropped rows where views is NaN (no target = unusable)
|
| 65 |
-
|
| 66 |
-
Step 3 — High-missingness columns dropped
|
| 67 |
-
Threshold: >70% missing → drop
|
| 68 |
-
Exception: salary, title, description retained for feature engineering
|
| 69 |
-
|
| 70 |
-
Step 4 — Leakage columns excluded
|
| 71 |
-
applies, closed_time → removed
|
| 72 |
-
These are post-publication outcomes unavailable at posting creation time
|
| 73 |
-
|
| 74 |
-
Step 5 — Salary imputation strategy
|
| 75 |
-
has_salary_info = 1 if salary present, else 0
|
| 76 |
-
salary_midpoint → median imputed INSIDE pipeline (training data only)
|
| 77 |
-
Prevents data leakage from test set into imputation
|
| 78 |
-
|
| 79 |
-
Step 6 — Categorical columns filled
|
| 80 |
-
formatted_work_type, formatted_experience_level → "Unknown" where missing
|
| 81 |
-
Then one-hot encoded before modeling
|
| 82 |
-
|
| 83 |
-
Step 7 — Log transformation of target
|
| 84 |
-
Raw views: heavily right-skewed (mean >> median, max >> 99th pct)
|
| 85 |
-
log_views = log(views + 1) — compresses scale, improves regression fit
|
| 86 |
-
Predictions converted back via expm1() for interpretation
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
---
|
| 90 |
-
|
| 91 |
-
## 🔍 EDA — 5 Questions + Correlation Heatmap
|
| 92 |
-
|
| 93 |
-
### Q1 — Does salary transparency increase views?
|
| 94 |
-
|
| 95 |
-
```
|
| 96 |
-
No salary info ████████████░░░░░░░░░░░░░░░░░░░ ~180 avg views
|
| 97 |
-
Has salary info ████████████████████████████████ ~340 avg views
|
| 98 |
-
+89% lift ✓
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
> Fewer than half of postings disclose salary. Highest-leverage, lowest-cost recruiter action. Effect holds across work types — not a proxy for role category.
|
| 102 |
-
|
| 103 |
-
---
|
| 104 |
-
|
| 105 |
-
### Q2 — Does description length affect engagement?
|
| 106 |
-
|
| 107 |
-
```
|
| 108 |
-
< 100 words ██████░░░░░░░░░░░░░░ 2.1 mean log_views — signals incomplete
|
| 109 |
-
100–250 words █████████░░░░░░░░░░░ 2.8
|
| 110 |
-
250–500 words ████████████████████ 3.6 ← PEAK (sweet spot)
|
| 111 |
-
500–750 words ████████████████░░░░ 3.3
|
| 112 |
-
750–1000 words ████████████░░░░░░░░ 3.0
|
| 113 |
-
> 1000 words ███████░░░░░░░░░░░░░ 2.5 — overwhelms candidates
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
> Non-linear. Sweet spot 250–500 words. Motivated `description_density` (words per character) — the #1 feature in the winning regression model.
|
| 117 |
-
|
| 118 |
-
---
|
| 119 |
-
|
| 120 |
-
### Q3 — Does posting day of week matter?
|
| 121 |
-
|
| 122 |
-
```
|
| 123 |
-
Tuesday ████████████████████ 245 avg views ★ best weekday
|
| 124 |
-
Wednesday ██████████████████░░ 235 avg views
|
| 125 |
-
Monday █████████████████░░░ 220 avg views
|
| 126 |
-
Thursday ████████████████░░░░ 215 avg views
|
| 127 |
-
Friday ████████████░░░░░░░░ 205 avg views
|
| 128 |
-
Saturday ███████░░░░░░░░░░░░░ 148 avg views ← weekend — noisier
|
| 129 |
-
Sunday ██████░░░░░░░░░░░░░░ 132 avg views ← weekend — noisier
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
> Weekday posts outperform weekends. Candidates browse during business hours. Weekend volume is much smaller — estimates are noisier. Day-of-week effect real but modest vs content features.
|
| 133 |
-
|
| 134 |
-
---
|
| 135 |
-
|
| 136 |
-
### Q4 — Do entry-level roles attract more views?
|
| 137 |
-
|
| 138 |
-
```
|
| 139 |
-
Entry-level ████████████████████ ~290 avg views ★
|
| 140 |
-
Other / Mid ████████████░░░░░░░░ ~210 avg views
|
| 141 |
-
Senior-level ████████░░░░░░░░░░░░ ~175 avg views
|
| 142 |
-
```
|
| 143 |
-
|
| 144 |
-
> Supply-side effect — more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for candidate pool size, not posting quality.
|
| 145 |
-
|
| 146 |
-
---
|
| 147 |
-
|
| 148 |
-
### Q5 — Does work type affect engagement?
|
| 149 |
-
|
| 150 |
-
```
|
| 151 |
-
Contract ████████████████████ ~310 avg views ★
|
| 152 |
-
Internship ████████████████░░░░ ~275 avg views
|
| 153 |
-
Part-time █████████████░░░░░░░ ~235 avg views
|
| 154 |
-
Full-time ███████████░░░░░░░░░ ~205 avg views
|
| 155 |
-
Temporary █████████░░░░░░░░░░░ ~185 avg views
|
| 156 |
-
```
|
| 157 |
-
|
| 158 |
-
> Contract and internship roles attract more active job-seekers. Work type is a useful predictor but correlates with other features — not a standalone causal explanation.
|
| 159 |
-
|
| 160 |
-
---
|
| 161 |
-
|
| 162 |
-
### 🔥 Feature Correlation with log(views+1)
|
| 163 |
-
|
| 164 |
-
```
|
| 165 |
-
Feature Corr Direction Note
|
| 166 |
-
─────────────────────────────────────────────────────────────────────
|
| 167 |
-
desc_salary_interaction +0.18 ↑ views strongest predictor
|
| 168 |
-
has_salary_info +0.14 ↑ views salary transparency
|
| 169 |
-
salary_log +0.12 ↑ views salary level
|
| 170 |
-
description_density +0.10 ↑ views content quality
|
| 171 |
-
description_word_count +0.08 ↑ views description length
|
| 172 |
-
is_software_role +0.08 ↑ views tech role demand
|
| 173 |
-
is_data_role +0.07 ↑ views data role demand
|
| 174 |
-
is_entry_role +0.06 ↑ views larger pool
|
| 175 |
-
posting_weekend -0.04 ↓ views weekend underperforms
|
| 176 |
-
is_senior_role -0.03 ↓ views smaller pool
|
| 177 |
-
─────────────────────────────────────────────────────────────────────
|
| 178 |
-
Internal correlations (structural — expected):
|
| 179 |
-
salary_log ↔ salary_midpoint +0.96 log transform of same variable
|
| 180 |
-
desc_wc ↔ desc_density +0.55 density uses length in denominator
|
| 181 |
-
is_software ↔ is_data +0.35 often co-occur in job titles
|
| 182 |
-
is_senior ↔ is_entry -0.28 mutually exclusive by construction
|
| 183 |
-
─────────────────────────────────────────────────────────────────────
|
| 184 |
-
```
|
| 185 |
-
|
| 186 |
-
> Most individual features show weak linear correlation — motivating tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and combinations.
|
| 187 |
-
|
| 188 |
-
---
|
| 189 |
-
|
| 190 |
-
## ⚙️ Feature Engineering — 30 Features, 8 Groups
|
| 191 |
-
|
| 192 |
-
| Group | Features |
|
| 193 |
-
|---|---|
|
| 194 |
-
| Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
|
| 195 |
-
| Text structure | `description_density`, `title_desc_ratio` |
|
| 196 |
-
| Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
|
| 197 |
-
| Timing | `posting_dayofweek`, `posting_weekend` |
|
| 198 |
-
| Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role` |
|
| 199 |
-
| Interactions | `desc_salary_interaction`, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
|
| 200 |
-
| Clustering | `cluster_0` `cluster_1` `cluster_2` `cluster_3` `cluster_4` `cluster_5` |
|
| 201 |
-
|
| 202 |
-
---
|
| 203 |
-
|
| 204 |
-
## 🔵 Clustering — KMeans k=6
|
| 205 |
-
|
| 206 |
-
```
|
| 207 |
-
Evaluation: elbow method + silhouette scores for k=2 through k=10
|
| 208 |
-
Selected k: 6
|
| 209 |
-
Rejected: k=7, k=8 — produced near-singleton clusters (outlier isolation)
|
| 210 |
-
Silhouette: 0.289 (weak-to-moderate separation — expected for overlapping job types)
|
| 211 |
-
Leakage: cluster preprocessor fit on TRAINING DATA ONLY
|
| 212 |
-
|
| 213 |
-
Silhouette score by k (approximate):
|
| 214 |
-
k=2 ████████████████████ 0.38 (too coarse)
|
| 215 |
-
k=4 ████████████████░░░░ 0.31
|
| 216 |
-
k=6 ████████████░░░░░░░░ 0.289 ← selected (best size + score balance)
|
| 217 |
-
k=8 ██████░░░░░░░░░░░░░░ 0.21 (singleton clusters)
|
| 218 |
-
|
| 219 |
-
Cluster size distribution:
|
| 220 |
-
Cluster 0 — General / Mixed ████████████ ~28%
|
| 221 |
-
Cluster 1 — High-Salary Specialist ███████ ~18%
|
| 222 |
-
Cluster 2 — Tech & Software █████████ ~22%
|
| 223 |
-
Cluster 3 — Entry-Level / Volume █████ ~12%
|
| 224 |
-
Cluster 4 — Contract & Flexible ████ ~10%
|
| 225 |
-
Cluster 5 — Senior Leadership ████ ~10%
|
| 226 |
-
|
| 227 |
-
PCA 2D projection: Cluster 2 (Tech) and Cluster 5 (Senior) show clearest
|
| 228 |
-
separation. Clusters 0 and 3 overlap — consistent with silhouette 0.289.
|
| 229 |
-
Two PCs explain ~35–45% of total variance.
|
| 230 |
-
```
|
| 231 |
-
|
| 232 |
-
Cluster labels one-hot encoded as 6 dummy features, added to both models. Including clusters improved regression RMSE and classification F1 over models without them.
|
| 233 |
-
|
| 234 |
-
---
|
| 235 |
-
|
| 236 |
-
## 📈 Regression — Predicting `log(views + 1)`
|
| 237 |
-
|
| 238 |
-
### Baseline model first
|
| 239 |
-
|
| 240 |
-
```
|
| 241 |
-
Mean Baseline (predict training mean for all):
|
| 242 |
-
RMSE_log = 0.871
|
| 243 |
-
R² = ≈ 0.000
|
| 244 |
-
|
| 245 |
-
This is the minimum bar every model must beat.
|
| 246 |
-
```
|
| 247 |
-
|
| 248 |
-
### Full model comparison
|
| 249 |
-
|
| 250 |
-
```
|
| 251 |
-
Model RMSE_log ↓ R² ↑ Improvement
|
| 252 |
-
──────────────────────────────────────────────────────────────────
|
| 253 |
-
Random Forest (Tuned) ★ 0.8347 ████ 0.0811 best overall
|
| 254 |
-
Random Forest (Ctrl) 0.8349 ████ 0.0807
|
| 255 |
-
Gradient Boosting 0.8370 ███ 0.0770
|
| 256 |
-
Linear Regression + Feat 0.8420 ██ 0.0640
|
| 257 |
-
RidgeCV 0.8420 ██ 0.0640
|
| 258 |
-
Lasso 0.8430 ██ 0.0640
|
| 259 |
-
PCA + Linear Regression 0.8440 ██ 0.0600
|
| 260 |
-
Mean Baseline 0.8710 ░ ≈0.000 ← floor
|
| 261 |
-
──────────────────────────────────────────────────────────────────
|
| 262 |
-
Winner: n_estimators=300, max_depth=12, min_samples_split=10
|
| 263 |
-
min_samples_leaf=5, max_features="sqrt"
|
| 264 |
-
Tuned via RandomizedSearchCV (12 iter, 5-fold CV)
|
| 265 |
-
|
| 266 |
-
Overfitting lesson: initial RF had train R²=0.85, test R²≈0.
|
| 267 |
-
Fixed by constraining max_depth, min_samples_leaf, max_features.
|
| 268 |
-
```
|
| 269 |
-
|
| 270 |
-
### Regression model interpretation
|
| 271 |
-
|
| 272 |
-
```
|
| 273 |
-
R² = 0.081 means the model explains ~8% of variance in log(views+1).
|
| 274 |
-
|
| 275 |
-
Why this is acceptable:
|
| 276 |
-
✓ Compared with mean baseline (R²≈0), real signal IS being captured
|
| 277 |
-
✓ Social engagement is inherently noisy — platform factors dominate
|
| 278 |
-
✓ Models predicting social engagement typically achieve R²=0.05–0.20
|
| 279 |
-
using content features alone
|
| 280 |
-
✓ Practical use = ranking postings, not exact prediction
|
| 281 |
-
|
| 282 |
-
Residuals pattern:
|
| 283 |
-
Large errors concentrate on viral postings (top 1% of views)
|
| 284 |
-
These are driven by external promotion not captured in features
|
| 285 |
-
Capping views at 99th percentile reduces RMSE but doesn't change
|
| 286 |
-
feature importance ranking
|
| 287 |
-
```
|
| 288 |
-
|
| 289 |
-
### Top feature importances (Random Forest Tuned)
|
| 290 |
-
|
| 291 |
-
```
|
| 292 |
-
Feature Importance Interpretation
|
| 293 |
-
──────────────────────────────────────────────────────────────────
|
| 294 |
-
description_density ████████████ 0.142 content quality
|
| 295 |
-
desc_salary_interaction ██████████░░ 0.125 salary × description synergy
|
| 296 |
-
salary_log ████████░░░░ 0.102 salary level
|
| 297 |
-
description_length ███████░░░░░ 0.092 raw description size
|
| 298 |
-
has_salary_info ██████░░░░░░ 0.078 salary disclosed (binary)
|
| 299 |
-
is_software_role █████░░░░░░░ 0.062 tech role demand
|
| 300 |
-
description_word_count ████░░░░░░░░ 0.051 word count
|
| 301 |
-
cluster features (avg) ████░░░░░░░░ 0.048 posting segment
|
| 302 |
-
──────────────────────────────────────────────────────────────────
|
| 303 |
-
```
|
| 304 |
-
|
| 305 |
-
---
|
| 306 |
-
|
| 307 |
-
## 🟠 Classification — High Engagement vs. Normal
|
| 308 |
-
|
| 309 |
-
```
|
| 310 |
-
Target definition:
|
| 311 |
-
high_engagement = 1 if views >= 75th percentile of TRAINING views
|
| 312 |
-
high_engagement = 0 otherwise
|
| 313 |
-
Threshold calculated from training data only — no leakage
|
| 314 |
-
|
| 315 |
-
Class balance:
|
| 316 |
-
Class 0 (Normal): ~75% of postings
|
| 317 |
-
Class 1 (High Engagement): ~25% of postings
|
| 318 |
-
|
| 319 |
-
Primary metric: F1-score for Class 1
|
| 320 |
-
Reason: accuracy is misleading — a dummy model predicting all-zero
|
| 321 |
-
achieves ~75% accuracy while catching ZERO high-engagement postings
|
| 322 |
-
```
|
| 323 |
-
|
| 324 |
-
### Model comparison
|
| 325 |
-
|
| 326 |
-
```
|
| 327 |
-
Model Precision(C1) Recall(C1) F1(C1) Notes
|
| 328 |
-
───────────────────────────────────────────────────────────────────
|
| 329 |
-
Decision Tree ★ moderate HIGHEST BEST CV F1: 0.4424±0.015
|
| 330 |
-
Logistic Regr. lower high near-best
|
| 331 |
-
Random Forest HIGHEST lower moderate fewest false alarms
|
| 332 |
-
Dummy Baseline 0.00 0.00 0.00
|
| 333 |
-
───────────────────────────────────────────────────────────────────
|
| 334 |
-
Winner: max_depth=8, class_weight="balanced"
|
| 335 |
-
```
|
| 336 |
-
|
| 337 |
-
### Confusion matrix (Logistic Regression as reference point)
|
| 338 |
-
|
| 339 |
-
```
|
| 340 |
-
Predicted Normal Predicted High
|
| 341 |
-
Actual Normal TN ~3,015 FP ~1,523
|
| 342 |
-
Actual High FN ~583 TP ~894
|
| 343 |
-
|
| 344 |
-
TN = correctly predicted Normal (good)
|
| 345 |
-
FP = Normal predicted as High — false alarm (wastes recruiter attention)
|
| 346 |
-
FN = High predicted as Normal — MOST COSTLY (missed opportunity)
|
| 347 |
-
TP = correctly predicted High (good)
|
| 348 |
-
|
| 349 |
-
Why FN is most costly:
|
| 350 |
-
A recruiter relying on the model misses a genuinely high-engagement
|
| 351 |
-
posting entirely. That opportunity is lost. Decision Tree minimizes FN
|
| 352 |
-
at the cost of more FP — the right tradeoff for this use case.
|
| 353 |
-
```
|
| 354 |
-
|
| 355 |
-
### 5-fold cross-validation
|
| 356 |
-
|
| 357 |
-
```
|
| 358 |
-
Decision Tree CV F1 scores across 5 folds:
|
| 359 |
-
Fold 1: 0.441
|
| 360 |
-
Fold 2: 0.438
|
| 361 |
-
Fold 3: 0.455
|
| 362 |
-
Fold 4: 0.442
|
| 363 |
-
Fold 5: 0.436
|
| 364 |
-
─────────────
|
| 365 |
-
Mean: 0.4424 ± 0.015
|
| 366 |
-
|
| 367 |
-
Stable across folds — result is not a lucky single split.
|
| 368 |
-
Close to test-set F1 — no signs of overfitting.
|
| 369 |
-
```
|
| 370 |
-
|
| 371 |
-
---
|
| 372 |
-
|
| 373 |
-
## 💡 Business Insights
|
| 374 |
-
|
| 375 |
-
1. **Disclose salary** — associated with ~90% more views. Fewer than half of postings do this today. Highest-leverage action at zero cost.
|
| 376 |
-
2. **Write 250–500 words** — description density was the #1 feature in both models. Too short signals incompleteness; too long overwhelms candidates.
|
| 377 |
-
3. **Post mid-week** — Tuesday–Thursday outperforms weekends consistently.
|
| 378 |
-
4. **Tech roles attract more** — `is_software_role` and `is_data_role` carry predictive signal beyond salary.
|
| 379 |
-
5. **Ranking, not predicting** — R²≈0.08 is expected given unobservable platform factors. Use the model to rank postings, not to forecast exact view counts.
|
| 380 |
-
|
| 381 |
-
---
|
| 382 |
-
|
| 383 |
-
## 🎁 Bonus Work
|
| 384 |
-
|
| 385 |
-
### 🚀 Interactive Dashboard — Gradio on HuggingFace Spaces
|
| 386 |
-
|
| 387 |
-
👉 **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**
|
| 388 |
-
|
| 389 |
-
| Tab | What it does |
|
| 390 |
-
|---|---|
|
| 391 |
-
| 🎯 Engagement Predictor | Enter posting details → real-time predicted views + High/Normal classification |
|
| 392 |
-
| 📊 EDA Dashboard | All 5 EDA findings as interactive charts |
|
| 393 |
-
| ℹ️ About | Feature groups, model details, limitations |
|
| 394 |
-
|
| 395 |
-
---
|
| 396 |
-
|
| 397 |
-
### 🧠 SHAP Explainability
|
| 398 |
-
|
| 399 |
-
```
|
| 400 |
-
SHAP mean |value| — RF regression (200 test observations)
|
| 401 |
-
|
| 402 |
-
Feature Mean |SHAP| Direction
|
| 403 |
-
──────────────────────────────────────────────────
|
| 404 |
-
description_density ████████████ 0.142 ↑ high density → more views
|
| 405 |
-
desc_salary_interaction ██████████░░ 0.125 ↑ salary × description synergy
|
| 406 |
-
salary_log ████████░░░░ 0.102 ↑ higher salary → more views
|
| 407 |
-
description_length ███████░░░░░ 0.092 ↑ longer (to a point) → more
|
| 408 |
-
has_salary_info ██████░░░░░░ 0.078 ↑ disclosed → more views
|
| 409 |
-
is_software_role █████░░░░░░░ 0.062 ↑ tech → more views
|
| 410 |
-
posting_weekend █░░░░░░░░░░░ 0.021 ↓ weekend → fewer views
|
| 411 |
-
──────────────────────────────────────────────────
|
| 412 |
-
Beeswarm: red dot right of 0 = high value pushes prediction UP
|
| 413 |
-
blue dot left of 0 = low value pushes prediction DOWN
|
| 414 |
-
```
|
| 415 |
-
|
| 416 |
-
SHAP confirmed EDA findings mechanistically at the individual prediction level.
|
| 417 |
-
|
| 418 |
-
---
|
| 419 |
-
|
| 420 |
-
### 📊 Feature Importance: Regression vs Classification
|
| 421 |
-
|
| 422 |
-
```
|
| 423 |
-
Regression RF Classification DT
|
| 424 |
-
(log_views) (high_engagement)
|
| 425 |
-
────────────────────────────────────────────────────────────
|
| 426 |
-
description_density #1 ████████ #2 ███████
|
| 427 |
-
desc_salary_interaction #2 ███████ #3 ██████
|
| 428 |
-
salary_log #3 ██████ #4 █████
|
| 429 |
-
has_salary_info #5 █████ #5 █████
|
| 430 |
-
is_software_role #6 ████ #6 ████
|
| 431 |
-
is_entry_role #9 ███ #1 ████████ ← jumps in clf
|
| 432 |
-
is_senior_role #10 ██ #7 ████
|
| 433 |
-
cluster features #7 ████ #8 ███
|
| 434 |
-
────────────────────────────────────────────────────────────
|
| 435 |
-
Agreement: description quality + salary dominate both models
|
| 436 |
-
Divergence: is_entry_role jumps to #1 in classification —
|
| 437 |
-
seniority flags matter more for crossing the
|
| 438 |
-
threshold than for predicting exact counts
|
| 439 |
-
```
|
| 440 |
-
|
| 441 |
-
---
|
| 442 |
-
|
| 443 |
-
## 🛠️ How to Use the Models
|
| 444 |
-
|
| 445 |
-
```python
|
| 446 |
-
import pickle, numpy as np
|
| 447 |
-
|
| 448 |
-
with open("linkedin_regression_model.pkl", "rb") as f:
|
| 449 |
-
reg_model = pickle.load(f)
|
| 450 |
-
with open("linkedin_classification_model.pkl", "rb") as f:
|
| 451 |
-
clf_model = pickle.load(f)
|
| 452 |
-
|
| 453 |
-
# Regression — predict log(views+1), convert back
|
| 454 |
-
log_views_pred = reg_model.predict(X_test_fe)
|
| 455 |
-
views_pred = np.expm1(log_views_pred)
|
| 456 |
-
|
| 457 |
-
# Classification — predict high-engagement label (0 or 1)
|
| 458 |
-
label = clf_model.predict(X_test_fe)
|
| 459 |
-
```
|
| 460 |
-
|
| 461 |
-
> Both models expect the exact 30-column feature matrix including cluster dummy columns. Run the full feature engineering pipeline in the notebook to produce a compatible input.
|
| 462 |
-
|
| 463 |
-
---
|
| 464 |
-
|
| 465 |
-
*Assignment 2 — Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|