---
---
tags:
  - regression
  - classification
  - clustering
  - tabular
  - linkedin
  - job-postings
  - sklearn
  - random-forest
  - decision-tree
  - kmeans
  - shap
license: mit
---

# 📊 LinkedIn Job Posting Engagement Analysis

> **Which LinkedIn job posting characteristics predict candidate engagement (views) — and how well can engagement be predicted or classified using only posting-level features?**

**Personal motivation:** As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions.

---

## 📹 Presentation Video

<video src=["https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4](https://www.loom.com/share/c7d9b89a54234f699204b16a9a313c7d)" controls style="max-width:720px;"></video>

---

## 🚀 Interactive Dashboard

👉 **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**

| Tab | Description |
|---|---|
| 🎯 Engagement Predictor | Enter posting details → get predicted views + High/Normal classification in real time |
| 📊 EDA Dashboard | All 5 EDA findings as interactive charts |
| ℹ️ About | Feature groups, model details, limitations |

---

## 📋 Dataset at a Glance

| Property | Value |
|---|---|
| **Source** | [LinkedIn Job Postings — arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) |
| **Original size** | 123,850 rows × 49 columns |
| **Working sample** | 30,000 rows · `random_state=42` |
| **After join with companies** | 30,000 rows × 40 columns |
| **After cleaning** | 29,572 rows × 51 columns (in `df_model`) |
| **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) |
| **Regression target** | `log_views = log1p(views)` — log-transformed to handle right skew |
| **Classification target** | `high_engagement` — top 25% of training views (threshold derived from training set only) |

---

## ⚠️ Scope & Limitations

> LinkedIn's algorithm, sponsored status, and company follower counts drive the **majority of view variance** and are **unobservable** in this dataset. Models use posting-level features only. The practical goal is **ranking postings by predicted engagement**, not exact point prediction. Results show associations, not causal relationships.

---

## 🗂️ Repository Files

| File | Description |
|---|---|
| `notebook.ipynb` | Full pipeline: Cleaning → EDA → Feature Engineering → Clustering → Regression → Classification → Bonus |
| `linkedin_regression_model.pkl` | Winning regression model: Random Forest (Tuned via RandomizedSearchCV) |
| `linkedin_classification_model.pkl` | Winning classification model: Decision Tree (max_depth=8, class_weight="balanced") |
| `regression_model_results.csv` | Full regression model comparison table |
| `classification_model_results.csv` | Full classification model comparison table |

---

## 🧹 Data Cleaning Pipeline

**7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:**

```
Step 1 — Reproducible sampling
        123,850 rows → sample(n=30,000, random_state=42)
        Joined with companies.csv on company_id (left join, rows preserved)
        Result: 30,000 rows × 40 columns

Step 2 — Duplicate & missing target removal
        Removed duplicate rows
        Dropped rows where views is NaN or negative
        Result: 29,572 usable rows

Step 3 — Date parsing
        listed_time, original_listed_time, expiry, closed_time → parsed to datetime
        Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend

Step 4 — Missing value analysis & column dropping
        Threshold: >70% missing → drop
        Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
                 remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)

Step 5 — Leakage columns excluded
        expiry, applies → removed (post-publication outcomes)
        views → kept as target only, never as a feature

Step 6 — Salary imputation strategy
        has_salary_info = 1 if salary present, else 0
        salary_midpoint computed from min/max salary where available
        Missing salary → imputed inside sklearn Pipeline on training data only

Step 7 — Log transformation of target
        Raw views: mean=14.9, std=98.8, max=9,949 — heavily right-skewed
        log_views = log1p(views) — compresses scale, improves regression fit
        Predictions converted back via expm1() for interpretation
        Outliers (IQR method): 4,074 (13.8%) — kept, not removed
```

---

## 🔍 EDA — 5 Research Questions

> **Note on notebook ordering:** Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact.

---

### 💰 Q2 — Salary Transparency vs Views

```
No salary info   ████████████░░░░░░░░░░░░░  ~12 avg views   (70.1% of postings)
Has salary info  ████████████████████████░  ~21 avg views   (29.9% of postings)
                                             +74.3% lift ✓
```

> Only **8,562 of 29,572 postings (29.9%)** disclose salary. Transparent postings attract **74.3% more views** on average. This is the highest-leverage, lowest-cost recruiter action available.

---

### 📝 Q3 — Description Length vs Views

```
< 100 words    ██████░░░░░░░░░░░░░░  ~8 avg views   — signals incomplete posting
100–250 words  █████████░░░░░░░░░░░  ~13 avg views
250–500 words  ████████████████████  ~24 avg views  PEAK ★ — sweet spot
500–750 words  ████████████████░░░░  ~18 avg views
> 1000 words   ███████░░░░░░░░░░░░░  ~10 avg views  — overwhelms candidates
```

> Non-linear relationship confirmed. Sweet spot: **250–500 words**. This motivated `description_density` — the **#1 feature** in the winning regression model.

---

### 📅 Q4 — Day of Week vs Views

```
Monday    ████████████████████  39 avg views  ★ best day  (n=1,837)
Tuesday   █████████████████░░░  25 avg views
Wednesday ████████████████░░░░  22 avg views
Thursday  ███████████████░░░░░  18 avg views
Friday    ███░░░░░░░░░░░░░░░░░   7 avg views  ✗ worst day (n=10,076)
Saturday  ████████████░░░░░░░░  28 avg views  (weekend — n=2,116 total, noisier)
Sunday    ████████████░░░░░░░░  28 avg views  (weekend — noisier)
```

> **Counterintuitive finding:** Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features.

---

### 💼 Q1 — Work Type vs Views

```
Contract    ████████████████████  29.97 avg views  median: 7.0
Internship  █████████████████░░░  25.71 avg views  median: 5.0
Full-time   ████████░░░░░░░░░░░░  13.70 avg views  median: 4.0  ← 80% of volume
Other       ███████░░░░░░░░░░░░░  11.27 avg views  median: 4.0
Part-time   ██████░░░░░░░░░░░░░░   9.59 avg views  median: 4.0
```

> Contract and Internship roles show the highest engagement. However, **Full-time dominates volume** (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal.

---

### 🎓 Q5 — Seniority Level vs Views

```
Entry-level  ████████████████████  18 avg views  n=792
Senior-level ████████████░░░░░░░░  16 avg views  n=3,577
Other/Mid    ██████████░░░░░░░░░░  15 avg views  n=25,203

Entry vs Senior: +12.4% more views
Entry vs Other:  +18.9% more views
```

> Supply-side effect — more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for **candidate pool size**, not intrinsic desirability.

---

### 🔥 Feature Correlation with log(views+1)

```
Feature                      Corr    Direction   Note
─────────────────────────────────────────────────────────────────────
desc_salary_interaction      +0.18   ↑ views     strongest single predictor
has_salary_info              +0.14   ↑ views     salary transparency
salary_log                   +0.12   ↑ views     salary level
description_density          +0.10   ↑ views     content quality
description_word_count       +0.08   ↑ views     description length
is_software_role             +0.08   ↑ views     tech role demand
is_data_role                 +0.07   ↑ views     data role demand
is_entry_role                +0.06   ↑ views     larger candidate pool
posting_weekend              -0.04   ↓ views     small negative signal
is_senior_role               -0.03   ↓ views     smaller candidate pool
─────────────────────────────────────────────────────────────────────
Internal correlations (structural — not data leakage):
salary_log ↔ salary_midpoint  +0.96  log transform of same variable
desc_wc ↔ desc_density        +0.55  density uses length in formula
is_software ↔ is_data         +0.35  often co-occur in job titles
is_senior ↔ is_entry          -0.28  mutually exclusive by construction
```

> Most features show **weak linear correlation** — no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.

### 🌡️ Correlation Heatmap (feature-to-feature + target)

```
                          log   desc  has   sal   desc  is_   is_   is_   post  is_
                          views dens  sal   log   wc    soft  data  entr  wknd  snr
──────────────────────────────────────────────────────────────────────────────────────
log_views              │  1.00  0.10  0.14  0.12  0.08  0.08  0.07  0.06 -0.04 -0.03
description_density    │  0.10  1.00  0.02  0.04  0.55  0.01  0.01 -0.01  0.00  0.00
has_salary_info        │  0.14  0.02  1.00  0.72  0.03  0.06  0.07 -0.03 -0.01 -0.02
salary_log             │  0.12  0.04  0.72  1.00  0.04  0.05  0.06 -0.02 -0.01 -0.01
description_word_count │  0.08  0.55  0.03  0.04  1.00  0.01  0.01 -0.01  0.00  0.00
is_software_role       │  0.08  0.01  0.06  0.05  0.01  1.00  0.35 -0.08  0.00 -0.05
is_data_role           │  0.07  0.01  0.07  0.06  0.01  0.35  1.00 -0.06  0.00 -0.04
is_entry_role          │  0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06  1.00  0.01 -0.28
posting_weekend        │ -0.04  0.00 -0.01 -0.01  0.00  0.00  0.00  0.01  1.00  0.00
is_senior_role         │ -0.03  0.00 -0.02 -0.01  0.00 -0.05 -0.04 -0.28  0.00  1.00
──────────────────────────────────────────────────────────────────────────────────────
Key structural correlations:
  salary_log ↔ has_salary_info  +0.72  same underlying signal, different form
  desc_wc    ↔ desc_density     +0.55  density formula uses word count
  is_software ↔ is_data         +0.35  frequently co-occur in job titles
  is_entry   ↔ is_senior        -0.28  mutually exclusive flags
```

> The heatmap confirms no multicollinearity crisis — the highest inter-feature correlation (salary_log ↔ has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models.

---

## ⚙️ Feature Engineering — 20 Base + 6 Cluster = 30 Total Features

| Group | Features |
|---|---|
| Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
| Text structure | `description_density` ★, `title_desc_ratio` |
| Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
| Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` |
| Interactions | `desc_salary_interaction` ★, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
| Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` |

**Missing value strategy:**
- Columns with >70% missing → dropped
- Salary → `has_salary_info` flag + `salary_midpoint` where available; remaining NaN imputed inside sklearn Pipeline on training data only
- Remaining numeric → `SimpleImputer(strategy="median")` inside Pipeline

---

## 🔵 Clustering — KMeans k=6

**Features used for clustering (12 total, leakage-checked):**
`title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`

**Methods used to select k:**
1. Elbow method — inconclusive, no sharp elbow
2. KMeans silhouette scores on full training matrix
3. Cluster-size stability table
4. Interactive K-Means widget (visualization aid — uses sample)
5. Hierarchical clustering dendrogram (Ward linkage, 300 obs)
6. Agglomerative clustering comparison (k=2–10)

```
Silhouette scores by k (full training matrix):

  k=2   ████████░░░░░░░░░░░░  0.198  smallest cluster: 6,830 (28.9%)
  k=3   █████████░░░░░░░░░░░  0.221  smallest cluster: 2,100 (8.9%)
  k=4   ████████████████░░░░  0.312  ← strong BUT largest=72% of data
  k=5   ██████████░░░░░░░░░░  0.250  smallest: 526 (unstable)
  k=6   ████████████░░░░░░░░  0.290  ← SELECTED ★  smallest: 583 (2.5%)
  k=7   ████████████░░░░░░░░  0.286  singleton cluster appeared
  k=8+                               singleton clusters appeared

Why NOT k=10 (highest score): singleton cluster (1 observation)
Why NOT k=4 (strong score):   largest cluster = 72% — not meaningful separation
Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290
```

**Cluster profiles at k=6 (training set n=23,657):**

| Cluster | Label | Size | Share | Key Signal |
|---|---|---|---|---|
| 0 | Manager-focused | 4,571 | 19% | `is_manager_role=1.00` |
| 1 | General / Mixed | 13,055 | 55% | No dominant role signal |
| 2 | Salary-transparent | 1,940 | 8% | `has_salary_info=1.00` |
| 3 | Data roles | 1,451 | 6% | `is_data_role=1.00` |
| 4 | Software roles | 2,057 | 9% | `is_software_role=1.00` |
| 5 | Entry / low salary | 583 | 2% | Smallest cluster |

**Official final silhouette score: 0.290** (full training matrix)

Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.

---

## 📈 Regression — Predicting `log1p(views)`

### Baseline

```
Mean Baseline (predict training mean for all observations):
  RMSE_log = 0.8708   R² = -0.0002   ← floor every model must beat
  MAE_views ≈ 10.64

Baseline Linear Regression (20 features, no clustering):
  RMSE_log = 0.8425   R² = 0.0639
```

### Full Model Comparison (after feature engineering + clustering)

| Model | RMSE_log ↓ | R² ↑ | Notes |
|---|---|---|---|
| **Random Forest (Tuned) ★** | **0.8347** | **0.0811** | RandomizedSearchCV winner |
| Random Forest (Controlled) | 0.8349 | 0.0807 | Manual constraints |
| Gradient Boosting | 0.8370 | 0.0770 | — |
| Linear Regression + Features | 0.8420 | 0.0640 | — |
| RidgeCV | 0.8420 | 0.0640 | — |
| Lasso Regression | 0.8430 | 0.0640 | — |
| PCA + Linear Regression | 0.8440 | 0.0600 | 15 components, 96.3% variance |
| Mean Baseline | 0.8708 | -0.0002 | Floor |

**Key lessons:**
- Unrestricted RF → train R²=0.854, test R²=0.003 (massive overfit). Fixed by `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features` constraints.
- 3-fold CV mean RMSE_log: 0.8747 (±0.0125) — stable across folds
- Outlier robustness test: capping views at 99th pct → RMSE_log 0.8147, R²=0.0812

### Top Feature Importances (RF Tuned)

```
description_density          ████████████  #1 — content quality proxy
description_length           ██████████░░  #2 — raw description size
description_word_count       ████████░░░░  #3 — word count
title-description interaction████████░░░░  #4 — combined text signal
is_software_role             ██████░░░░░░  #5 — tech role demand
is_data_role                 █████░░░░░░░  #6 — data role demand
salary_log / has_salary_info ████░░░░░░░░  #7+ — salary signals
```

> `desc_salary_interaction` ranks #2 in SHAP analysis but further down in Gini importance — both agree on description quality and salary as top drivers.

### Why R² = 0.081 Is Acceptable

```
R² = 0.081 → model explains ~8% of variance in log(views+1)

✓ Beats mean baseline (R²≈0) — real posting-level signal captured
✓ Social engagement inherently noisy — platform factors dominate
✓ 92% of variance from unobservable sources (algorithm, followers, ads)
✓ Practical use = ranking postings, not forecasting exact counts
```

---

## 🟠 Classification — High Engagement vs. Normal

```
Target: high_engagement = 1 if views ≥ 75th percentile of TRAINING views
Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
Feature matrix: X_clf uses 24 features (see notebook cell 207)
Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
```

### Model Comparison

| Model | F1 (Class 1) | Recall (Class 1) | Notes |
|---|---|---|---|
| **Decision Tree ★** | **HIGHEST** | **HIGHEST** | max_depth=8, class_weight="balanced" |
| Logistic Regression | near-best | high | Close to DT in F1 |
| Random Forest | moderate | lower | Lowest FP count |
| Dummy Baseline | 0.00 | 0.00 | Always predicts Class 0 |

**5-fold CV F1: 0.4424 ± 0.0152** — stable, no lucky split

### Error Cost Analysis

```
FN (missed high-engagement) = most costly error
  → Company fails to prioritize, promote, or learn from a strong posting

FP (false alarm) = also costly
  → Recruiter wastes time and budget on a posting that won't perform
```

Decision Tree minimises FN (catches most high-engagement postings) but produces more FP.
Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings.

---

## 💡 Business Insights

1. **Salary transparency is the single highest-leverage action** — 74.3% more views for free. Fewer than 30% of postings disclose salary today.
2. **Description structure matters** — `description_density` was the #1 feature in both models. Sweet spot: 250–500 words.
3. **Tech roles attract disproportionate engagement** — `is_software_role` and `is_data_role` carry real signal beyond salary.
4. **Work type is associated with engagement** — contract roles lead, but full-time dominates volume (80%).
5. **Platform factors dominate** — R²≈0.08 is expected and acceptable. Model value is in **ranking** postings, not exact prediction.

---

## 🎁 Bonus Work

### 🧠 SHAP Explainability

```
SHAP mean |value| — RF Tuned regression (test observations)

description_density      ████████████  strongest positive impact ↑
desc_salary_interaction  ██████████░░  salary × description synergy ↑
salary_log               ████████░░░░  salary level ↑
has_salary_info          ██████░░░░░░  disclosed → more views ↑
posting_weekend          ██░░░░░░░░░░  weekend → fewer views ↓
```

`desc_salary_interaction` ranks #2 in SHAP but lower in Gini — confirms it captures genuine non-linear interaction that neither feature achieves alone.

### 📊 Feature Importance: Regression vs Classification

```
                        Regression RF    Classification DT
description_density     #1               #2
desc_salary_interaction #2 (SHAP)        varies
salary_log              #7+              varies
is_entry_role           lower            rises in classification
is_data_role            #6               varies
──────────────────────────────────────────────────────────
Agreement:  description quality + salary dominate both models
Divergence: seniority/role flags matter more for threshold-crossing
            (classification) than for predicting exact counts (regression)
```

### 🔬 Additional Extras

- **Interactive K-Means Widget** — explore different k values visually (notebook cell 4.11)
- **Hierarchical Clustering Dendrogram** — Ward linkage, 300 obs sample (cell 4.12)
- **Agglomerative Clustering Diagnostic** — k=2–10 comparison (cell 4.13)
- **Outlier Robustness Test** — views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
- **3-fold CV for regression** — mean RMSE_log 0.8747 ± 0.0125

---

## 🛠️ How to Use the Models

```python
import pickle, numpy as np

with open("linkedin_regression_model.pkl", "rb") as f:
    reg_model = pickle.load(f)
with open("linkedin_classification_model.pkl", "rb") as f:
    clf_model = pickle.load(f)

# Regression — predict log(views+1), convert back to raw view estimate
log_views_pred = reg_model.predict(X_test_fe)
views_pred = np.expm1(log_views_pred)

# Classification — predict high-engagement label (0 = Normal, 1 = High)
label = clf_model.predict(X_clf)
```

> Regression model expects **30-column** `X_test_fe` (including cluster dummies).
> Classification model expects **24-column** `X_clf` (see notebook cell 207).
> Run the full pipeline in the notebook to produce compatible feature matrices.

---

*Assignment 2 — Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)*