File size: 23,149 Bytes
9230249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
---
tags:
  - regression
  - classification
  - clustering
  - tabular
  - linkedin
  - job-postings
  - sklearn
license: mit
---

# πŸ“Š LinkedIn Job Posting Engagement Analysis

> **Which LinkedIn job posting characteristics predict candidate engagement (views) β€” and how well can engagement be predicted or classified using only posting-level features?**

**Personal motivation:** As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions.

---

## πŸ“Ή Presentation Video

<video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;"></video>

---

## πŸ“‹ Dataset at a Glance

| Property | Value |
|---|---|
| **Source** | [LinkedIn Job Postings β€” arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) |
| **Original size** | 123,850 rows Γ— 49 columns |
| **Working sample** | 30,000 rows Β· `random_state=42` |
| **After join with companies** | 30,000 rows Γ— 40 columns |
| **After cleaning** | 29,572 rows Γ— 51 columns (in df_model) |
| **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) |
| **Regression target** | `log_views = log1p(views)` β€” log-transformed to handle right skew |
| **Classification target** | `high_engagement` β€” top 25% of training views (threshold from training only) |

---

## ⚠️ Scope & Limitations

> LinkedIn's algorithm, sponsored status, and company follower counts drive the **majority of view variance** and are **unobservable** in this dataset. Models use posting-level features only. The practical goal is **ranking postings by predicted engagement**, not exact point prediction. Results show associations, not causal relationships.

---

## πŸ—‚οΈ Repository Files

| File | Description |
|---|---|
| `notebook.ipynb` | Full pipeline: Cleaning β†’ EDA β†’ Features β†’ Clustering β†’ Regression β†’ Classification β†’ Bonus |
| `linkedin_regression_model.pkl` | Winning model: Random Forest (Tuned) |
| `linkedin_classification_model.pkl` | Winning model: Decision Tree |
| `regression_model_results.csv` | Full regression model comparison |
| `classification_model_results.csv` | Full classification model comparison |

---

## 🧹 Data Cleaning Pipeline

```
Step 1 β€” Reproducible sampling
        123,850 rows β†’ sample(n=30,000, random_state=42)
        Joined with companies.csv on company_id (left join, rows preserved)
        Result: 30,000 rows Γ— 40 columns

Step 2 β€” Duplicate & missing target removal
        Removed duplicate rows
        Dropped rows where views is NaN or negative
        Result: 29,572 usable rows

Step 3 β€” Date parsing
        listed_time, original_listed_time, expiry, closed_time β†’ parsed to datetime
        Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend

Step 4 β€” Missing value analysis & column dropping
        Threshold: >70% missing β†’ drop
        Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
                 remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)
        Protected columns: salary fields kept for feature engineering

Step 5 β€” Leakage columns excluded
        expiry, applies β†’ removed (post-publication outcomes)
        views β†’ kept as target only, not as feature

Step 6 β€” Salary imputation strategy
        has_salary_info = 1 if salary present, else 0
        salary_midpoint computed from min/max salary where available
        Missing salary β†’ imputed inside sklearn Pipeline on training data only

Step 7 β€” Log transformation of target
        Raw views: mean=14.9, std=98.8, max=9,949 β€” heavily right-skewed
        log_views = log1p(views) β€” compresses scale, improves regression fit
        Predictions converted back via expm1() for interpretation
        Outliers (IQR method): 4,074 outliers (13.8%) β€” kept, not removed
```

---

## πŸ” EDA β€” 5 Questions + Correlation Heatmap

**Note:** EDA question numbers in the notebook differ from intuitive order. Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented here in order of impact.

### Salary Transparency vs Views (Notebook Q2)

```
No salary info   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~12 avg views   (70.1% of postings)
Has salary info  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘  ~21 avg views   (29.9% of postings)
                                             +74.3% lift βœ“
```

> Only 8,562 of 29,572 postings (29.9%) disclose salary. **74.3% more views** for transparent postings. Highest-leverage, lowest-cost recruiter action.

---

### Description Length vs Views (Notebook Q3)

```
< 100 words    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  low    β€” signals incomplete posting
100–250 words  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  medium
250–500 words  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  PEAK β˜… β€” sweet spot
500–750 words  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  high
> 1000 words   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  drop-off β€” overwhelms candidates
```

> Non-linear relationship confirmed. Sweet spot: **250–500 words**. Motivated `description_density` β€” the #1 feature in the winning regression model.

---

### Day of Week vs Views (Notebook Q4)

```
Monday    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  39 avg views  β˜… best day (n=1,837)
Tuesday   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  (weekday)
Wednesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  (weekday)
Thursday  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘  (weekday)
Friday    β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘   7 avg views  βœ— worst day (n=10,076)
Saturday  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  (weekend β€” noisier, n=2,116 total)
Sunday    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  (weekend β€” noisier)

Weekend average: 28 views vs Weekday average: 22 views
Note: Weekend sample is much smaller (2,116 total) β€” estimates are noisier.
Weekday postings averaged 21.8% LOWER views than weekend in this dataset.
```

> **Counterintuitive finding:** Weekend postings showed higher average views than weekdays in this sample, BUT weekend volume is very small (2,116 obs) making these estimates unreliable. The day-of-week signal is modest and should not override content features.

---

### Work Type vs Views (Notebook Q1)

```
Contract    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  29.97 avg views  7.0 median
Internship  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  25.71 avg views  5.0 median
Full-time   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  13.70 avg views  4.0 median
Other       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  11.27 avg views  4.0 median
Part-time   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘   9.59 avg views  4.0 median
```

> Contract and Internship roles show the highest engagement. However, Full-time dominates volume (23,674 of 29,572 postings). Work type is a useful feature but should not be interpreted as causal.

---

### Seniority Level vs Views (Notebook Q5)

```
Entry-level  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  18 avg views  n=792
Senior-level β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  16 avg views  n=3,577
Other/Mid    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  15 avg views  n=25,203

Entry vs Senior: +12.4% more views
Entry vs Other:  +18.9% more views
```

> Supply-side effect β€” more candidates qualify for junior roles so the pool is larger. Entry-level advantage is modest (+12.4% vs senior). `is_entry_role` carries predictive signal because it proxies for candidate pool size.

---

### πŸ”₯ Feature Correlation with log(views+1)

```
Feature                      Corr    Direction   Note
─────────────────────────────────────────────────────────────────────
desc_salary_interaction      +0.18   ↑ views     strongest predictor
has_salary_info              +0.14   ↑ views     salary transparency
salary_log                   +0.12   ↑ views     salary level
description_density          +0.10   ↑ views     content quality
description_word_count       +0.08   ↑ views     description length
is_software_role             +0.08   ↑ views     tech role demand
is_data_role                 +0.07   ↑ views     data role demand
is_entry_role                +0.06   ↑ views     larger candidate pool
posting_weekend              -0.04   ↓ views     (small negative)
is_senior_role               -0.03   ↓ views     smaller candidate pool
─────────────────────────────────────────────────────────────────────
Internal correlations (structural):
salary_log ↔ salary_midpoint  +0.96  log transform of same variable
desc_wc ↔ desc_density        +0.55  density uses length in formula
is_software ↔ is_data         +0.35  often co-occur in job titles
is_senior ↔ is_entry          -0.28  mutually exclusive by construction
─────────────────────────────────────────────────────────────────────
```

> Most features show **weak linear correlation** β€” no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.

---

## βš™οΈ Feature Engineering β€” 20 base + 10 cluster = 30 Total Features

**Note:** The notebook creates 20 engineered features before clustering, then adds 6 cluster dummy columns for a total of 30 in the final feature matrix (X_train_fe shape: 23,657 Γ— 30).

| Group | Features |
|---|---|
| Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
| Text structure | `description_density`, `title_desc_ratio` |
| Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
| Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` |
| Interactions | `desc_salary_interaction`, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
| Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` |

**Missing value strategy:**
- Columns with >70% missing β†’ dropped (closed_time, skills_desc, med_salary, remote_allowed, applies, salary min/max, compensation fields)
- Salary β†’ `has_salary_info` flag + `salary_midpoint` computed where possible; remaining salary NaN imputed inside sklearn Pipeline on training data only
- Remaining numeric β†’ `SimpleImputer(strategy="median")` inside Pipeline

---

## πŸ”΅ Clustering β€” KMeans k=6

**Clustering features used (12 total, leakage-checked):**
`title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`

**Methods used to select k:**
1. Elbow method (inertia k=2–10) β€” inconclusive, no sharp elbow
2. K-Means silhouette scores on full training matrix
3. Cluster-size stability table (smallest/largest cluster per k)
4. Interactive K-Means widget (visualization aid only β€” uses sample)
5. Hierarchical clustering dendrogram (Ward linkage, 300 obs sample)
6. Agglomerative Clustering diagnostic comparison (k=2–10 on sample)

```
Chart 1 β€” Actual silhouette scores by k (full training matrix)

  k=2   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.198  smallest cluster: 6,830 (28.9%)
  k=3   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.221  smallest cluster: 2,100 (8.9%)
  k=4   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  0.312  ← strong score BUT largest=72%
  k=5   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.250  smallest: 526 (unstable)
  k=6   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.290  ← SELECTED β˜… smallest: 583 (2.5%)
  k=7   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.286  singleton cluster appeared
  k=8   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.315  singleton cluster appeared
  k=9   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.314  singleton cluster appeared
  k=10  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  0.350  singleton cluster appeared

Why NOT k=10 (highest score): singleton cluster (1 observation)
Why NOT k=4 (strong score):   largest cluster = 72% of observations
Why k=6: no singletons, stable sizes, silhouette 0.290, interpretable profiles

Note: Elbow method was inconclusive (inertia 255,430 at k=2 β†’ 98,508 at k=10,
no sharp elbow). Agglomerative diagnostic best at k=2 (score 0.467 on sample)
β€” too coarse. k=6 selected as practical compromise across all methods.

Chart 2 β€” Actual cluster sizes at k=6 (training set n=23,657)

  Cluster 0 β€” Manager-focused       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  4,571  (19%)  is_manager_role=1.00
  Cluster 1 β€” General / Mixed       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 13,055 (55%)  no dominant role signal
  Cluster 2 β€” Salary-transparent    β–ˆβ–ˆβ–ˆβ–ˆ          1,940   (8%)  has_salary_info=1.00
  Cluster 3 β€” Data roles            β–ˆβ–ˆβ–ˆ           1,451   (6%)  is_data_role=1.00
  Cluster 4 β€” Software roles        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ         2,057   (9%)  is_software_role=1.00
  Cluster 5 β€” Entry / low salary    β–ˆβ–ˆ              583   (2%)  smallest cluster

Official final silhouette score: 0.290 (full training matrix)
```

Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.

---

## πŸ“ˆ Regression β€” Predicting `log1p(views)`

### Baseline

```
Mean Baseline (predict training mean for all observations):
  RMSE_log = 0.8708   RΒ² = -0.0002   ← floor every model must beat
  MAE_views β‰ˆ 10.64

Baseline Linear Regression (20 features, no clustering):
  RMSE_log = 0.8425   RΒ² = 0.0639
  MAE_views β‰ˆ 10.54
```

### Full model comparison (after feature engineering + clustering)

```
Model                        RMSE_log ↓    RΒ² ↑
─────────────────────────────────────────────────────
Random Forest (Tuned)  β˜…     0.8347        0.0811
Random Forest (Ctrl)         0.8349        0.0807
Gradient Boosting            0.8370        0.0770
Linear Regression + Feat     0.8420        0.0640
RidgeCV                      0.8420        0.0640
Lasso Regression             0.8430        0.0640
PCA + Linear Regression      0.8440        0.0600
Mean Baseline                0.8708       -0.0002
─────────────────────────────────────────────────────
Winner: RandomizedSearchCV tuned RF
Improvement over manually controlled RF: 0.0002 RMSE_log (practically negligible)
3-fold CV mean RMSE_log: 0.8747 (Β±0.0125) β€” stable across folds
Overfitting lesson: unrestricted RF β†’ train RΒ²=0.854, test RΒ²=0.003
Fixed by: max_depth, min_samples_split, min_samples_leaf, max_features constraints
Outlier robustness test: capping views at 99th pct β†’ RMSE_log 0.8147, RΒ²=0.0812
```

### Top feature importances (RF Tuned)

```
description_density          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  #1 β€” content quality
description_length           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  #2 β€” raw description size
description_word_count       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  #3 β€” word count
title-description interactionβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  #4 β€” combined signal
is_software_role             β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  #5 β€” tech role demand
is_data_role                 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘  #6 β€” data role demand
salary_log / has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  #7+ β€” salary signals
```

> **Note:** desc_salary_interaction ranked #2 in SHAP analysis but further down in Gini importance. Both agree on description quality and salary as top drivers.

### Regression interpretation

```
RΒ² = 0.081 β†’ model explains ~8% of variance in log(views+1)

Why acceptable:
  βœ“ Beats mean baseline (RΒ²β‰ˆ0) β€” real posting-level signal captured
  βœ“ Social engagement inherently noisy β€” platform factors dominate
  βœ“ 92% of variance from unobservable sources (algorithm, followers, ads)
  βœ“ Practical use = ranking postings, not forecasting exact counts

PCA + Linear: reduced to 15 components (96.3% variance preserved) β€” no improvement
Gradient Boosting marginally worse than RF β€” non-linear models help but modestly
```

---

## 🟠 Classification β€” High Engagement vs. Normal

```
Target: high_engagement = 1 if views β‰₯ 75th percentile of TRAINING views
Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
Feature matrix: X_clf uses 24 features (not the full 30 β€” see notebook cell 207)
Training: ~24,000 obs | Test: ~6,000 obs
Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
```

### Model comparison

```
Model                  F1 (C1)    Recall (C1)   Notes
──────────────────────────────────────────────────────────────
Decision Tree     β˜…    HIGHEST    HIGHEST       lowest FN count
Logistic Regr.         near-best  high          close to DT
Random Forest          moderate   lower         lowest FP count
Dummy Baseline         0.00       0.00          always predicts Class 0
──────────────────────────────────────────────────────────────
Winner: max_depth=8, class_weight="balanced"
5-fold CV F1: 0.4424 Β± 0.0152 β€” stable, no lucky split
```

### Confusion matrix (all models β€” from notebook)

```
Decision Tree:     lowest FN (catches most high-engagement) β€” most false positives
Random Forest:     lowest FP (fewest false alarms) β€” misses most high-engagement
Logistic Regr.:    between the two β€” close to DT in F1

FN (missed high-engagement) = most costly error:
  Company fails to prioritize, promote, or learn from a valuable listing.
FP (false alarm) = also costly:
  Recruiters waste attention on postings that are not actually strong.
```

---

## πŸ’‘ Business Insights (from notebook cell 242)

1. **Salary transparency is associated with higher engagement** β€” 74.3% more views. Fewer than 30% of postings disclose salary today.
2. **Description structure matters** β€” density was the #1 feature in both models. Sweet spot: 250–500 words.
3. **Tech roles attract more engagement** β€” software and data role flags carry signal beyond salary.
4. **Work type is associated with engagement** β€” contract roles lead, but full-time dominates volume.
5. **Platform factors dominate** β€” RΒ²β‰ˆ0.08 is expected. Model value is in ranking, not exact prediction.

---

## 🎁 Bonus Work

### πŸš€ Interactive Dashboard

πŸ‘‰ **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**

| Tab | Description |
|---|---|
| 🎯 Engagement Predictor | Real-time predicted views + High/Normal classification |
| πŸ“Š EDA Dashboard | All 5 EDA findings as interactive charts |
| ℹ️ About | Feature groups, model details, limitations |

### 🧠 SHAP Explainability

```
SHAP mean |value| β€” RF Tuned regression (test observations)

description_density      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  strongest ↑
desc_salary_interaction  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  salary Γ— description synergy ↑
salary_log               β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  salary level ↑
has_salary_info          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  disclosed β†’ more views ↑
posting_weekend          β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  weekend β†’ fewer views ↓

Key finding: desc_salary_interaction ranks #2 in SHAP but lower in Gini β€”
confirms it captures genuine non-linear interaction beyond individual features.
```

### πŸ“Š Feature Importance: Regression vs Classification

```
                        Regression RF    Classification DT
description_density     #1               #2
desc_salary_interaction varies           varies
salary_log              #7+              varies
is_entry_role           lower            rises in classification
is_data_role            #6               varies
─────────────────────────────────────────────────────────
Agreement: description quality + salary dominate both models
Divergence: seniority/role flags matter more for threshold-crossing
            (classification) than for predicting exact counts (regression)
```

### πŸ”¬ Additional Bonus Items

- **Interactive K-Means Widget** β€” explore different k values visually in notebook (cell 4.11)
- **Hierarchical Clustering Dendrogram** β€” Ward linkage, 300 obs sample (cell 4.12)
- **Agglomerative Clustering Diagnostic** β€” k=2–10 comparison (cell 4.13)
- **Outlier Robustness Test** β€” views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
- **3-fold CV for regression** β€” mean RMSE_log 0.8747 Β± 0.0125

---

## πŸ› οΈ How to Use the Models

```python
import pickle, numpy as np

with open("linkedin_regression_model.pkl", "rb") as f:
    reg_model = pickle.load(f)
with open("linkedin_classification_model.pkl", "rb") as f:
    clf_model = pickle.load(f)

# Regression β€” predict log(views+1), convert back
log_views_pred = reg_model.predict(X_test_fe)
views_pred = np.expm1(log_views_pred)

# Classification β€” predict high-engagement label (0 or 1)
label = clf_model.predict(X_clf)
```

> Regression model expects 30-column X_test_fe (with cluster dummies). Classification model expects 24-column X_clf. Run the full pipeline in the notebook to produce compatible inputs.

---

*Assignment 2 β€” Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings Β· arshkon/linkedin-job-postings (Kaggle)*