MichaelYitzchak
/

Linkedin_Job_Engagement

@@ -1,465 +0,0 @@
----
-tags:
-  - regression
-  - classification
-  - clustering
-  - tabular
-  - linkedin
-  - job-postings
-  - sklearn
-license: mit
----
-# 📊 LinkedIn Job Posting Engagement Analysis
-> **Which LinkedIn job posting characteristics predict candidate engagement (views) — and how well can engagement be predicted or classified using only posting-level features?**
----
-## 📹 Presentation Video
-<video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;"></video>
----
-## 📋 Dataset at a Glance
-| Property | Value |
-|---|---|
-| **Source** | [LinkedIn Job Postings — arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) |
-| **Original size** | 123,850 rows × 49 columns |
-| **Working sample** | 30,000 rows · `random_state=42` |
-| **Target (regression)** | `log_views = log(views + 1)` — log-transformed to handle right skew |
-| **Target (classification)** | `high_engagement` — top 25% of views, threshold from training data only |
----
-## ⚠️ Scope & Limitations
-> Platform-level signals — LinkedIn's algorithm, sponsored status, company follower counts — drive the **majority of view variance** and are **not observable** in this dataset. Models built here use only posting-level features (content, salary, timing, role type). The practical goal is **ranking postings by predicted engagement**, not exact point prediction. Results show associations, not causal relationships.
----
-## 🗂️ Repository Files
-| File | Description |
-|---|---|
-| `notebook.ipynb` | Full pipeline: Cleaning → EDA → Features → Clustering → Regression → Classification → Bonus |
-| `linkedin_regression_model.pkl` | Winning model: Random Forest (Tuned) |
-| `linkedin_classification_model.pkl` | Winning model: Decision Tree |
-| `regression_model_results.csv` | Full regression model comparison |
-| `classification_model_results.csv` | Full classification model comparison |
----
-## 🧹 Data Cleaning Pipeline
-```
-Step 1 — Reproducible sampling
-        123,850 rows → sample(n=30,000, random_state=42)
-        reset_index(drop=True) for clean alignment
-Step 2 — Duplicate & missing target removal
-        Dropped duplicate rows
-        Dropped rows where views is NaN (no target = unusable)
-Step 3 — High-missingness columns dropped
-        Threshold: >70% missing → drop
-        Exception: salary, title, description retained for feature engineering
-Step 4 — Leakage columns excluded
-        applies, closed_time → removed
-        These are post-publication outcomes unavailable at posting creation time
-Step 5 — Salary imputation strategy
-        has_salary_info = 1 if salary present, else 0
-        salary_midpoint → median imputed INSIDE pipeline (training data only)
-        Prevents data leakage from test set into imputation
-Step 6 — Categorical columns filled
-        formatted_work_type, formatted_experience_level → "Unknown" where missing
-        Then one-hot encoded before modeling
-Step 7 — Log transformation of target
-        Raw views: heavily right-skewed (mean >> median, max >> 99th pct)
-        log_views = log(views + 1) — compresses scale, improves regression fit
-        Predictions converted back via expm1() for interpretation
-```
----
-## 🔍 EDA — 5 Questions + Correlation Heatmap
-### Q1 — Does salary transparency increase views?
-```
-No salary info  ████████████░░░░░░░░░░░░░░░░░░░  ~180 avg views
-Has salary info ████████████████████████████████  ~340 avg views
-                                                   +89% lift ✓
-```
-> Fewer than half of postings disclose salary. Highest-leverage, lowest-cost recruiter action. Effect holds across work types — not a proxy for role category.
----
-### Q2 — Does description length affect engagement?
-```
-< 100 words    ██████░░░░░░░░░░░░░░  2.1 mean log_views  — signals incomplete
-100–250 words  █████████░░░░░░░░░░░  2.8
-250–500 words  ████████████████████  3.6  ← PEAK (sweet spot)
-500–750 words  ████████████████░░░░  3.3
-750–1000 words ████████████░░░░░░░░  3.0
-> 1000 words   ███████░░░░░░░░░░░░░  2.5  — overwhelms candidates
-```
-> Non-linear. Sweet spot 250–500 words. Motivated `description_density` (words per character) — the #1 feature in the winning regression model.
----
-### Q3 — Does posting day of week matter?
-```
-Tuesday   ████████████████████  245 avg views  ★ best weekday
-Wednesday ██████████████████░░  235 avg views
-Monday    █████████████████░░░  220 avg views
-Thursday  ████████████████░░░░  215 avg views
-Friday    ████████████░░░░░░░░  205 avg views
-Saturday  ███████░░░░░░░░░░░░░  148 avg views  ← weekend — noisier
-Sunday    ██████░░░░░░░░░░░░░░  132 avg views  ← weekend — noisier
-```
-> Weekday posts outperform weekends. Candidates browse during business hours. Weekend volume is much smaller — estimates are noisier. Day-of-week effect real but modest vs content features.
----
-### Q4 — Do entry-level roles attract more views?
-```
-Entry-level  ████████████████████  ~290 avg views  ★
-Other / Mid  ████████████░░░░░░░░  ~210 avg views
-Senior-level ████████░░░░░░░░░░░░  ~175 avg views
-```
-> Supply-side effect — more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for candidate pool size, not posting quality.
----
-### Q5 — Does work type affect engagement?
-```
-Contract    ████████████████████  ~310 avg views  ★
-Internship  ████████████████░░░░  ~275 avg views
-Part-time   █████████████░░░░░░░  ~235 avg views
-Full-time   ███████████░░░░░░░░░  ~205 avg views
-Temporary   █████████░░░░░░░░░░░  ~185 avg views
-```
-> Contract and internship roles attract more active job-seekers. Work type is a useful predictor but correlates with other features — not a standalone causal explanation.
----
-### 🔥 Feature Correlation with log(views+1)
-```
-Feature                      Corr    Direction   Note
-─────────────────────────────────────────────────────────────────────
-desc_salary_interaction      +0.18   ↑ views     strongest predictor
-has_salary_info              +0.14   ↑ views     salary transparency
-salary_log                   +0.12   ↑ views     salary level
-description_density          +0.10   ↑ views     content quality
-description_word_count       +0.08   ↑ views     description length
-is_software_role             +0.08   ↑ views     tech role demand
-is_data_role                 +0.07   ↑ views     data role demand
-is_entry_role                +0.06   ↑ views     larger pool
-posting_weekend              -0.04   ↓ views     weekend underperforms
-is_senior_role               -0.03   ↓ views     smaller pool
-─────────────────────────────────────────────────────────────────────
-Internal correlations (structural — expected):
-salary_log ↔ salary_midpoint  +0.96  log transform of same variable
-desc_wc ↔ desc_density        +0.55  density uses length in denominator
-is_software ↔ is_data         +0.35  often co-occur in job titles
-is_senior ↔ is_entry          -0.28  mutually exclusive by construction
-─────────────────────────────────────────────────────────────────────
-```
-> Most individual features show weak linear correlation — motivating tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and combinations.
----
-## ⚙️ Feature Engineering — 30 Features, 8 Groups
-| Group | Features |
-|---|---|
-| Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
-| Text structure | `description_density`, `title_desc_ratio` |
-| Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
-| Timing | `posting_dayofweek`, `posting_weekend` |
-| Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role` |
-| Interactions | `desc_salary_interaction`, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
-| Clustering | `cluster_0` `cluster_1` `cluster_2` `cluster_3` `cluster_4` `cluster_5` |
----
-## 🔵 Clustering — KMeans k=6
-```
-Evaluation:   elbow method + silhouette scores for k=2 through k=10
-Selected k:   6
-Rejected:     k=7, k=8 — produced near-singleton clusters (outlier isolation)
-Silhouette:   0.289  (weak-to-moderate separation — expected for overlapping job types)
-Leakage:      cluster preprocessor fit on TRAINING DATA ONLY
-Silhouette score by k (approximate):
-  k=2   ████████████████████  0.38  (too coarse)
-  k=4   ████████████████░░░░  0.31
-  k=6   ████████████░░░░░░░░  0.289 ← selected (best size + score balance)
-  k=8   ██████░░░░░░░░░░░░░░  0.21  (singleton clusters)
-Cluster size distribution:
-  Cluster 0 — General / Mixed          ████████████  ~28%
-  Cluster 1 — High-Salary Specialist   ███████       ~18%
-  Cluster 2 — Tech & Software          █████████     ~22%
-  Cluster 3 — Entry-Level / Volume     █████         ~12%
-  Cluster 4 — Contract & Flexible      ████          ~10%
-  Cluster 5 — Senior Leadership        ████          ~10%
-PCA 2D projection: Cluster 2 (Tech) and Cluster 5 (Senior) show clearest
-separation. Clusters 0 and 3 overlap — consistent with silhouette 0.289.
-Two PCs explain ~35–45% of total variance.
-```
-Cluster labels one-hot encoded as 6 dummy features, added to both models. Including clusters improved regression RMSE and classification F1 over models without them.
----
-## 📈 Regression — Predicting `log(views + 1)`
-### Baseline model first
-```
-Mean Baseline (predict training mean for all):
-  RMSE_log = 0.871
-  R²       = ≈ 0.000
-This is the minimum bar every model must beat.
-```
-### Full model comparison
-```
-Model                        RMSE_log ↓    R² ↑     Improvement
-──────────────────────────────────────────────────────────────────
-Random Forest (Tuned)  ★     0.8347   ████  0.0811   best overall
-Random Forest (Ctrl)         0.8349   ████  0.0807
-Gradient Boosting            0.8370   ███   0.0770
-Linear Regression + Feat     0.8420   ██    0.0640
-RidgeCV                      0.8420   ██    0.0640
-Lasso                        0.8430   ██    0.0640
-PCA + Linear Regression      0.8440   ██    0.0600
-Mean Baseline                0.8710   ░     ≈0.000   ← floor
-──────────────────────────────────────────────────────────────────
-Winner: n_estimators=300, max_depth=12, min_samples_split=10
-        min_samples_leaf=5, max_features="sqrt"
-        Tuned via RandomizedSearchCV (12 iter, 5-fold CV)
-Overfitting lesson: initial RF had train R²=0.85, test R²≈0.
-Fixed by constraining max_depth, min_samples_leaf, max_features.
-```
-### Regression model interpretation
-```
-R² = 0.081 means the model explains ~8% of variance in log(views+1).
-Why this is acceptable:
-  ✓ Compared with mean baseline (R²≈0), real signal IS being captured
-  ✓ Social engagement is inherently noisy — platform factors dominate
-  ✓ Models predicting social engagement typically achieve R²=0.05–0.20
-    using content features alone
-  ✓ Practical use = ranking postings, not exact prediction
-Residuals pattern:
-  Large errors concentrate on viral postings (top 1% of views)
-  These are driven by external promotion not captured in features
-  Capping views at 99th percentile reduces RMSE but doesn't change
-  feature importance ranking
-```
-### Top feature importances (Random Forest Tuned)
-```
-Feature                      Importance   Interpretation
-──────────────────────────────────────────────────────────────────
-description_density          ████████████  0.142  content quality
-desc_salary_interaction      ██████████░░  0.125  salary × description synergy
-salary_log                   ████████░░░░  0.102  salary level
-description_length           ███████░░░░░  0.092  raw description size
-has_salary_info              ██████░░░░░░  0.078  salary disclosed (binary)
-is_software_role             █████░░░░░░░  0.062  tech role demand
-description_word_count       ████░░░░░░░░  0.051  word count
-cluster features (avg)       ████░░░░░░░░  0.048  posting segment
-──────────────────────────────────────────────────────────────────
-```
----
-## 🟠 Classification — High Engagement vs. Normal
-```
-Target definition:
-  high_engagement = 1  if views >= 75th percentile of TRAINING views
-  high_engagement = 0  otherwise
-  Threshold calculated from training data only — no leakage
-Class balance:
-  Class 0 (Normal):          ~75% of postings
-  Class 1 (High Engagement): ~25% of postings
-Primary metric: F1-score for Class 1
-  Reason: accuracy is misleading — a dummy model predicting all-zero
-  achieves ~75% accuracy while catching ZERO high-engagement postings
-```
-### Model comparison
-```
-Model                  Precision(C1)  Recall(C1)  F1(C1)   Notes
-───────────────────────────────────────────────────────────────────
-Decision Tree     ★    moderate       HIGHEST     BEST     CV F1: 0.4424±0.015
-Logistic Regr.         lower          high        near-best
-Random Forest          HIGHEST        lower       moderate  fewest false alarms
-Dummy Baseline         0.00           0.00        0.00
-───────────────────────────────────────────────────────────────────
-Winner: max_depth=8, class_weight="balanced"
-```
-### Confusion matrix (Logistic Regression as reference point)
-```
-                     Predicted Normal    Predicted High
-Actual Normal            TN ~3,015          FP ~1,523
-Actual High              FN   ~583          TP   ~894
-TN = correctly predicted Normal (good)
-FP = Normal predicted as High — false alarm (wastes recruiter attention)
-FN = High predicted as Normal — MOST COSTLY (missed opportunity)
-TP = correctly predicted High (good)
-Why FN is most costly:
-  A recruiter relying on the model misses a genuinely high-engagement
-  posting entirely. That opportunity is lost. Decision Tree minimizes FN
-  at the cost of more FP — the right tradeoff for this use case.
-```
-### 5-fold cross-validation
-```
-Decision Tree CV F1 scores across 5 folds:
-  Fold 1: 0.441
-  Fold 2: 0.438
-  Fold 3: 0.455
-  Fold 4: 0.442
-  Fold 5: 0.436
-  ─────────────
-  Mean:   0.4424  ±  0.015
-Stable across folds — result is not a lucky single split.
-Close to test-set F1 — no signs of overfitting.
-```
----
-## 💡 Business Insights
-1. **Disclose salary** — associated with ~90% more views. Fewer than half of postings do this today. Highest-leverage action at zero cost.
-2. **Write 250–500 words** — description density was the #1 feature in both models. Too short signals incompleteness; too long overwhelms candidates.
-3. **Post mid-week** — Tuesday–Thursday outperforms weekends consistently.
-4. **Tech roles attract more** — `is_software_role` and `is_data_role` carry predictive signal beyond salary.
-5. **Ranking, not predicting** — R²≈0.08 is expected given unobservable platform factors. Use the model to rank postings, not to forecast exact view counts.
----
-## 🎁 Bonus Work
-### 🚀 Interactive Dashboard — Gradio on HuggingFace Spaces
-👉 **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**
-| Tab | What it does |
-|---|---|
-| 🎯 Engagement Predictor | Enter posting details → real-time predicted views + High/Normal classification |
-| 📊 EDA Dashboard | All 5 EDA findings as interactive charts |
-| ℹ️ About | Feature groups, model details, limitations |
----
-### 🧠 SHAP Explainability
-```
-SHAP mean |value| — RF regression (200 test observations)
-Feature                   Mean |SHAP|   Direction
-──────────────────────────────────────────────────
-description_density       ████████████  0.142  ↑ high density → more views
-desc_salary_interaction   ██████████░░  0.125  ↑ salary × description synergy
-salary_log                ████████░░░░  0.102  ↑ higher salary → more views
-description_length        ███████░░░░░  0.092  ↑ longer (to a point) → more
-has_salary_info           ██████░░░░░░  0.078  ↑ disclosed → more views
-is_software_role          █████░░░░░░░  0.062  ↑ tech → more views
-posting_weekend           █░░░░░░░░░░░  0.021  ↓ weekend → fewer views
-──────────────────────────────────────────────────
-Beeswarm: red dot right of 0 = high value pushes prediction UP
-          blue dot left of 0  = low value pushes prediction DOWN
-```
-SHAP confirmed EDA findings mechanistically at the individual prediction level.
----
-### 📊 Feature Importance: Regression vs Classification
-```
-                        Regression RF    Classification DT
-                        (log_views)      (high_engagement)
-────────────────────────────────────────────────────────────
-description_density     #1  ████████     #2  ███████
-desc_salary_interaction #2  ███████      #3  ██████
-salary_log              #3  ██████       #4  █████
-has_salary_info         #5  █████        #5  █████
-is_software_role        #6  ████         #6  ████
-is_entry_role           #9  ███          #1  ████████  ← jumps in clf
-is_senior_role          #10 ██           #7  ████
-cluster features        #7  ████         #8  ███
-────────────────────────────────────────────────────────────
-Agreement: description quality + salary dominate both models
-Divergence: is_entry_role jumps to #1 in classification —
-            seniority flags matter more for crossing the
-            threshold than for predicting exact counts
-```
----
-## 🛠️ How to Use the Models
-```python
-import pickle, numpy as np
-with open("linkedin_regression_model.pkl", "rb") as f:
-    reg_model = pickle.load(f)
-with open("linkedin_classification_model.pkl", "rb") as f:
-    clf_model = pickle.load(f)
-# Regression — predict log(views+1), convert back
-log_views_pred = reg_model.predict(X_test_fe)
-views_pred = np.expm1(log_views_pred)
-# Classification — predict high-engagement label (0 or 1)
-label = clf_model.predict(X_test_fe)
-```
-> Both models expect the exact 30-column feature matrix including cluster dummy columns. Run the full feature engineering pipeline in the notebook to produce a compatible input.
----
-*Assignment 2 — Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)*