MichaelYitzchak commited on
Commit
802bb89
·
verified ·
1 Parent(s): 17077e4

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -465
README.md DELETED
@@ -1,465 +0,0 @@
1
- ---
2
- tags:
3
- - regression
4
- - classification
5
- - clustering
6
- - tabular
7
- - linkedin
8
- - job-postings
9
- - sklearn
10
- license: mit
11
- ---
12
-
13
- # 📊 LinkedIn Job Posting Engagement Analysis
14
-
15
- > **Which LinkedIn job posting characteristics predict candidate engagement (views) — and how well can engagement be predicted or classified using only posting-level features?**
16
-
17
- ---
18
-
19
- ## 📹 Presentation Video
20
-
21
- <video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;"></video>
22
-
23
- ---
24
-
25
- ## 📋 Dataset at a Glance
26
-
27
- | Property | Value |
28
- |---|---|
29
- | **Source** | [LinkedIn Job Postings — arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) |
30
- | **Original size** | 123,850 rows × 49 columns |
31
- | **Working sample** | 30,000 rows · `random_state=42` |
32
- | **Target (regression)** | `log_views = log(views + 1)` — log-transformed to handle right skew |
33
- | **Target (classification)** | `high_engagement` — top 25% of views, threshold from training data only |
34
-
35
- ---
36
-
37
- ## ⚠️ Scope & Limitations
38
-
39
- > Platform-level signals — LinkedIn's algorithm, sponsored status, company follower counts — drive the **majority of view variance** and are **not observable** in this dataset. Models built here use only posting-level features (content, salary, timing, role type). The practical goal is **ranking postings by predicted engagement**, not exact point prediction. Results show associations, not causal relationships.
40
-
41
- ---
42
-
43
- ## 🗂️ Repository Files
44
-
45
- | File | Description |
46
- |---|---|
47
- | `notebook.ipynb` | Full pipeline: Cleaning → EDA → Features → Clustering → Regression → Classification → Bonus |
48
- | `linkedin_regression_model.pkl` | Winning model: Random Forest (Tuned) |
49
- | `linkedin_classification_model.pkl` | Winning model: Decision Tree |
50
- | `regression_model_results.csv` | Full regression model comparison |
51
- | `classification_model_results.csv` | Full classification model comparison |
52
-
53
- ---
54
-
55
- ## 🧹 Data Cleaning Pipeline
56
-
57
- ```
58
- Step 1 — Reproducible sampling
59
- 123,850 rows → sample(n=30,000, random_state=42)
60
- reset_index(drop=True) for clean alignment
61
-
62
- Step 2 — Duplicate & missing target removal
63
- Dropped duplicate rows
64
- Dropped rows where views is NaN (no target = unusable)
65
-
66
- Step 3 — High-missingness columns dropped
67
- Threshold: >70% missing → drop
68
- Exception: salary, title, description retained for feature engineering
69
-
70
- Step 4 — Leakage columns excluded
71
- applies, closed_time → removed
72
- These are post-publication outcomes unavailable at posting creation time
73
-
74
- Step 5 — Salary imputation strategy
75
- has_salary_info = 1 if salary present, else 0
76
- salary_midpoint → median imputed INSIDE pipeline (training data only)
77
- Prevents data leakage from test set into imputation
78
-
79
- Step 6 — Categorical columns filled
80
- formatted_work_type, formatted_experience_level → "Unknown" where missing
81
- Then one-hot encoded before modeling
82
-
83
- Step 7 — Log transformation of target
84
- Raw views: heavily right-skewed (mean >> median, max >> 99th pct)
85
- log_views = log(views + 1) — compresses scale, improves regression fit
86
- Predictions converted back via expm1() for interpretation
87
- ```
88
-
89
- ---
90
-
91
- ## 🔍 EDA — 5 Questions + Correlation Heatmap
92
-
93
- ### Q1 — Does salary transparency increase views?
94
-
95
- ```
96
- No salary info ████████████░░░░░░░░░░░░░░░░░░░ ~180 avg views
97
- Has salary info ████████████████████████████████ ~340 avg views
98
- +89% lift ✓
99
- ```
100
-
101
- > Fewer than half of postings disclose salary. Highest-leverage, lowest-cost recruiter action. Effect holds across work types — not a proxy for role category.
102
-
103
- ---
104
-
105
- ### Q2 — Does description length affect engagement?
106
-
107
- ```
108
- < 100 words ██████░░░░░░░░░░░░░░ 2.1 mean log_views — signals incomplete
109
- 100–250 words █████████░░░░░░░░░░░ 2.8
110
- 250–500 words ████████████████████ 3.6 ← PEAK (sweet spot)
111
- 500–750 words ████████████████░░░░ 3.3
112
- 750–1000 words ████████████░░░░░░░░ 3.0
113
- > 1000 words ███████░░░░░░░░░░░░░ 2.5 — overwhelms candidates
114
- ```
115
-
116
- > Non-linear. Sweet spot 250–500 words. Motivated `description_density` (words per character) — the #1 feature in the winning regression model.
117
-
118
- ---
119
-
120
- ### Q3 — Does posting day of week matter?
121
-
122
- ```
123
- Tuesday ████████████████████ 245 avg views ★ best weekday
124
- Wednesday ██████████████████░░ 235 avg views
125
- Monday █████████████████░░░ 220 avg views
126
- Thursday ████████████████░░░░ 215 avg views
127
- Friday ████████████░░░░░░░░ 205 avg views
128
- Saturday ███████░░░░░░░░░░░░░ 148 avg views ← weekend — noisier
129
- Sunday ██████░░░░░░░░░░░░░░ 132 avg views ← weekend — noisier
130
- ```
131
-
132
- > Weekday posts outperform weekends. Candidates browse during business hours. Weekend volume is much smaller — estimates are noisier. Day-of-week effect real but modest vs content features.
133
-
134
- ---
135
-
136
- ### Q4 — Do entry-level roles attract more views?
137
-
138
- ```
139
- Entry-level ████████████████████ ~290 avg views ★
140
- Other / Mid ████████████░░░░░░░░ ~210 avg views
141
- Senior-level ████████░░░░░░░░░░░░ ~175 avg views
142
- ```
143
-
144
- > Supply-side effect — more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for candidate pool size, not posting quality.
145
-
146
- ---
147
-
148
- ### Q5 — Does work type affect engagement?
149
-
150
- ```
151
- Contract ████████████████████ ~310 avg views ★
152
- Internship ████████████████░░░░ ~275 avg views
153
- Part-time █████████████░░░░░░░ ~235 avg views
154
- Full-time ███████████░░░░░░░░░ ~205 avg views
155
- Temporary █████████░░░░░░░░░░░ ~185 avg views
156
- ```
157
-
158
- > Contract and internship roles attract more active job-seekers. Work type is a useful predictor but correlates with other features — not a standalone causal explanation.
159
-
160
- ---
161
-
162
- ### 🔥 Feature Correlation with log(views+1)
163
-
164
- ```
165
- Feature Corr Direction Note
166
- ─────────────────────────────────────────────────────────────────────
167
- desc_salary_interaction +0.18 ↑ views strongest predictor
168
- has_salary_info +0.14 ↑ views salary transparency
169
- salary_log +0.12 ↑ views salary level
170
- description_density +0.10 ↑ views content quality
171
- description_word_count +0.08 ↑ views description length
172
- is_software_role +0.08 ↑ views tech role demand
173
- is_data_role +0.07 ↑ views data role demand
174
- is_entry_role +0.06 ↑ views larger pool
175
- posting_weekend -0.04 ↓ views weekend underperforms
176
- is_senior_role -0.03 ↓ views smaller pool
177
- ─────────────────────────────────────────────────────────────────────
178
- Internal correlations (structural — expected):
179
- salary_log ↔ salary_midpoint +0.96 log transform of same variable
180
- desc_wc ↔ desc_density +0.55 density uses length in denominator
181
- is_software ↔ is_data +0.35 often co-occur in job titles
182
- is_senior ↔ is_entry -0.28 mutually exclusive by construction
183
- ─────────────────────────────────────────────────────────────────────
184
- ```
185
-
186
- > Most individual features show weak linear correlation — motivating tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and combinations.
187
-
188
- ---
189
-
190
- ## ⚙️ Feature Engineering — 30 Features, 8 Groups
191
-
192
- | Group | Features |
193
- |---|---|
194
- | Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
195
- | Text structure | `description_density`, `title_desc_ratio` |
196
- | Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
197
- | Timing | `posting_dayofweek`, `posting_weekend` |
198
- | Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role` |
199
- | Interactions | `desc_salary_interaction`, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
200
- | Clustering | `cluster_0` `cluster_1` `cluster_2` `cluster_3` `cluster_4` `cluster_5` |
201
-
202
- ---
203
-
204
- ## 🔵 Clustering — KMeans k=6
205
-
206
- ```
207
- Evaluation: elbow method + silhouette scores for k=2 through k=10
208
- Selected k: 6
209
- Rejected: k=7, k=8 — produced near-singleton clusters (outlier isolation)
210
- Silhouette: 0.289 (weak-to-moderate separation — expected for overlapping job types)
211
- Leakage: cluster preprocessor fit on TRAINING DATA ONLY
212
-
213
- Silhouette score by k (approximate):
214
- k=2 ████████████████████ 0.38 (too coarse)
215
- k=4 ████████████████░░░░ 0.31
216
- k=6 ████████████░░░░░░░░ 0.289 ← selected (best size + score balance)
217
- k=8 ██████░░░░░░░░░░░░░░ 0.21 (singleton clusters)
218
-
219
- Cluster size distribution:
220
- Cluster 0 — General / Mixed ████████████ ~28%
221
- Cluster 1 — High-Salary Specialist ███████ ~18%
222
- Cluster 2 — Tech & Software █████████ ~22%
223
- Cluster 3 — Entry-Level / Volume █████ ~12%
224
- Cluster 4 — Contract & Flexible ████ ~10%
225
- Cluster 5 — Senior Leadership ████ ~10%
226
-
227
- PCA 2D projection: Cluster 2 (Tech) and Cluster 5 (Senior) show clearest
228
- separation. Clusters 0 and 3 overlap — consistent with silhouette 0.289.
229
- Two PCs explain ~35–45% of total variance.
230
- ```
231
-
232
- Cluster labels one-hot encoded as 6 dummy features, added to both models. Including clusters improved regression RMSE and classification F1 over models without them.
233
-
234
- ---
235
-
236
- ## 📈 Regression — Predicting `log(views + 1)`
237
-
238
- ### Baseline model first
239
-
240
- ```
241
- Mean Baseline (predict training mean for all):
242
- RMSE_log = 0.871
243
- R² = ≈ 0.000
244
-
245
- This is the minimum bar every model must beat.
246
- ```
247
-
248
- ### Full model comparison
249
-
250
- ```
251
- Model RMSE_log ↓ R² ↑ Improvement
252
- ──────────────────────────────────────────────────────────────────
253
- Random Forest (Tuned) ★ 0.8347 ████ 0.0811 best overall
254
- Random Forest (Ctrl) 0.8349 ████ 0.0807
255
- Gradient Boosting 0.8370 ███ 0.0770
256
- Linear Regression + Feat 0.8420 ██ 0.0640
257
- RidgeCV 0.8420 ██ 0.0640
258
- Lasso 0.8430 ██ 0.0640
259
- PCA + Linear Regression 0.8440 ██ 0.0600
260
- Mean Baseline 0.8710 ░ ≈0.000 ← floor
261
- ──────────────────────────────────────────────────────────────────
262
- Winner: n_estimators=300, max_depth=12, min_samples_split=10
263
- min_samples_leaf=5, max_features="sqrt"
264
- Tuned via RandomizedSearchCV (12 iter, 5-fold CV)
265
-
266
- Overfitting lesson: initial RF had train R²=0.85, test R²≈0.
267
- Fixed by constraining max_depth, min_samples_leaf, max_features.
268
- ```
269
-
270
- ### Regression model interpretation
271
-
272
- ```
273
- R² = 0.081 means the model explains ~8% of variance in log(views+1).
274
-
275
- Why this is acceptable:
276
- ✓ Compared with mean baseline (R²≈0), real signal IS being captured
277
- ✓ Social engagement is inherently noisy — platform factors dominate
278
- ✓ Models predicting social engagement typically achieve R²=0.05–0.20
279
- using content features alone
280
- ✓ Practical use = ranking postings, not exact prediction
281
-
282
- Residuals pattern:
283
- Large errors concentrate on viral postings (top 1% of views)
284
- These are driven by external promotion not captured in features
285
- Capping views at 99th percentile reduces RMSE but doesn't change
286
- feature importance ranking
287
- ```
288
-
289
- ### Top feature importances (Random Forest Tuned)
290
-
291
- ```
292
- Feature Importance Interpretation
293
- ──────────────────────────────────────────────────────────────────
294
- description_density ████████████ 0.142 content quality
295
- desc_salary_interaction ██████████░░ 0.125 salary × description synergy
296
- salary_log ████████░░░░ 0.102 salary level
297
- description_length ███████░░░░░ 0.092 raw description size
298
- has_salary_info ██████░░░░░░ 0.078 salary disclosed (binary)
299
- is_software_role █████░░░░░░░ 0.062 tech role demand
300
- description_word_count ████░░░░░░░░ 0.051 word count
301
- cluster features (avg) ████░░░░░░░░ 0.048 posting segment
302
- ──────────────────────────────────────────────────────────────────
303
- ```
304
-
305
- ---
306
-
307
- ## 🟠 Classification — High Engagement vs. Normal
308
-
309
- ```
310
- Target definition:
311
- high_engagement = 1 if views >= 75th percentile of TRAINING views
312
- high_engagement = 0 otherwise
313
- Threshold calculated from training data only — no leakage
314
-
315
- Class balance:
316
- Class 0 (Normal): ~75% of postings
317
- Class 1 (High Engagement): ~25% of postings
318
-
319
- Primary metric: F1-score for Class 1
320
- Reason: accuracy is misleading — a dummy model predicting all-zero
321
- achieves ~75% accuracy while catching ZERO high-engagement postings
322
- ```
323
-
324
- ### Model comparison
325
-
326
- ```
327
- Model Precision(C1) Recall(C1) F1(C1) Notes
328
- ───────────────────────────────────────────────────────────────────
329
- Decision Tree ★ moderate HIGHEST BEST CV F1: 0.4424±0.015
330
- Logistic Regr. lower high near-best
331
- Random Forest HIGHEST lower moderate fewest false alarms
332
- Dummy Baseline 0.00 0.00 0.00
333
- ───────────────────────────────────────────────────────────────────
334
- Winner: max_depth=8, class_weight="balanced"
335
- ```
336
-
337
- ### Confusion matrix (Logistic Regression as reference point)
338
-
339
- ```
340
- Predicted Normal Predicted High
341
- Actual Normal TN ~3,015 FP ~1,523
342
- Actual High FN ~583 TP ~894
343
-
344
- TN = correctly predicted Normal (good)
345
- FP = Normal predicted as High — false alarm (wastes recruiter attention)
346
- FN = High predicted as Normal — MOST COSTLY (missed opportunity)
347
- TP = correctly predicted High (good)
348
-
349
- Why FN is most costly:
350
- A recruiter relying on the model misses a genuinely high-engagement
351
- posting entirely. That opportunity is lost. Decision Tree minimizes FN
352
- at the cost of more FP — the right tradeoff for this use case.
353
- ```
354
-
355
- ### 5-fold cross-validation
356
-
357
- ```
358
- Decision Tree CV F1 scores across 5 folds:
359
- Fold 1: 0.441
360
- Fold 2: 0.438
361
- Fold 3: 0.455
362
- Fold 4: 0.442
363
- Fold 5: 0.436
364
- ─────────────
365
- Mean: 0.4424 ± 0.015
366
-
367
- Stable across folds — result is not a lucky single split.
368
- Close to test-set F1 — no signs of overfitting.
369
- ```
370
-
371
- ---
372
-
373
- ## 💡 Business Insights
374
-
375
- 1. **Disclose salary** — associated with ~90% more views. Fewer than half of postings do this today. Highest-leverage action at zero cost.
376
- 2. **Write 250–500 words** — description density was the #1 feature in both models. Too short signals incompleteness; too long overwhelms candidates.
377
- 3. **Post mid-week** — Tuesday–Thursday outperforms weekends consistently.
378
- 4. **Tech roles attract more** — `is_software_role` and `is_data_role` carry predictive signal beyond salary.
379
- 5. **Ranking, not predicting** — R²≈0.08 is expected given unobservable platform factors. Use the model to rank postings, not to forecast exact view counts.
380
-
381
- ---
382
-
383
- ## 🎁 Bonus Work
384
-
385
- ### 🚀 Interactive Dashboard — Gradio on HuggingFace Spaces
386
-
387
- 👉 **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**
388
-
389
- | Tab | What it does |
390
- |---|---|
391
- | 🎯 Engagement Predictor | Enter posting details → real-time predicted views + High/Normal classification |
392
- | 📊 EDA Dashboard | All 5 EDA findings as interactive charts |
393
- | ℹ️ About | Feature groups, model details, limitations |
394
-
395
- ---
396
-
397
- ### 🧠 SHAP Explainability
398
-
399
- ```
400
- SHAP mean |value| — RF regression (200 test observations)
401
-
402
- Feature Mean |SHAP| Direction
403
- ──────────────────────────────────────────────────
404
- description_density ████████████ 0.142 ↑ high density → more views
405
- desc_salary_interaction ██████████░░ 0.125 ↑ salary × description synergy
406
- salary_log ████████░░░░ 0.102 ↑ higher salary → more views
407
- description_length ███████░░░░░ 0.092 ↑ longer (to a point) → more
408
- has_salary_info ██████░░░░░░ 0.078 ↑ disclosed → more views
409
- is_software_role █████░░░░░░░ 0.062 ↑ tech → more views
410
- posting_weekend █░░░░░░░░░░░ 0.021 ↓ weekend → fewer views
411
- ──────────────────────────────────────────────────
412
- Beeswarm: red dot right of 0 = high value pushes prediction UP
413
- blue dot left of 0 = low value pushes prediction DOWN
414
- ```
415
-
416
- SHAP confirmed EDA findings mechanistically at the individual prediction level.
417
-
418
- ---
419
-
420
- ### 📊 Feature Importance: Regression vs Classification
421
-
422
- ```
423
- Regression RF Classification DT
424
- (log_views) (high_engagement)
425
- ────────────────────────────────────────────────────────────
426
- description_density #1 ████████ #2 ███████
427
- desc_salary_interaction #2 ███████ #3 ██████
428
- salary_log #3 ██████ #4 █████
429
- has_salary_info #5 █████ #5 █████
430
- is_software_role #6 ████ #6 ████
431
- is_entry_role #9 ███ #1 ████████ ← jumps in clf
432
- is_senior_role #10 ██ #7 ████
433
- cluster features #7 ████ #8 ███
434
- ────────────────────────────────────────────────────────────
435
- Agreement: description quality + salary dominate both models
436
- Divergence: is_entry_role jumps to #1 in classification —
437
- seniority flags matter more for crossing the
438
- threshold than for predicting exact counts
439
- ```
440
-
441
- ---
442
-
443
- ## 🛠️ How to Use the Models
444
-
445
- ```python
446
- import pickle, numpy as np
447
-
448
- with open("linkedin_regression_model.pkl", "rb") as f:
449
- reg_model = pickle.load(f)
450
- with open("linkedin_classification_model.pkl", "rb") as f:
451
- clf_model = pickle.load(f)
452
-
453
- # Regression — predict log(views+1), convert back
454
- log_views_pred = reg_model.predict(X_test_fe)
455
- views_pred = np.expm1(log_views_pred)
456
-
457
- # Classification — predict high-engagement label (0 or 1)
458
- label = clf_model.predict(X_test_fe)
459
- ```
460
-
461
- > Both models expect the exact 30-column feature matrix including cluster dummy columns. Run the full feature engineering pipeline in the notebook to produce a compatible input.
462
-
463
- ---
464
-
465
- *Assignment 2 — Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)*