MichaelYitzchak commited on
Commit
1eea5e6
Β·
verified Β·
1 Parent(s): 749e320

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -166
README.md CHANGED
@@ -1,4 +1,5 @@
1
  ---
 
2
  tags:
3
  - regression
4
  - classification
@@ -7,6 +8,10 @@ tags:
7
  - linkedin
8
  - job-postings
9
  - sklearn
 
 
 
 
10
  license: mit
11
  ---
12
 
@@ -20,7 +25,19 @@ license: mit
20
 
21
  ## πŸ“Ή Presentation Video
22
 
23
- <video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;"></video>
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ---
26
 
@@ -32,10 +49,10 @@ license: mit
32
  | **Original size** | 123,850 rows Γ— 49 columns |
33
  | **Working sample** | 30,000 rows Β· `random_state=42` |
34
  | **After join with companies** | 30,000 rows Γ— 40 columns |
35
- | **After cleaning** | 29,572 rows Γ— 51 columns (in df_model) |
36
  | **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) |
37
  | **Regression target** | `log_views = log1p(views)` β€” log-transformed to handle right skew |
38
- | **Classification target** | `high_engagement` β€” top 25% of training views (threshold from training only) |
39
 
40
  ---
41
 
@@ -49,16 +66,18 @@ license: mit
49
 
50
  | File | Description |
51
  |---|---|
52
- | `notebook.ipynb` | Full pipeline: Cleaning β†’ EDA β†’ Features β†’ Clustering β†’ Regression β†’ Classification β†’ Bonus |
53
- | `linkedin_regression_model.pkl` | Winning model: Random Forest (Tuned) |
54
- | `linkedin_classification_model.pkl` | Winning model: Decision Tree |
55
- | `regression_model_results.csv` | Full regression model comparison |
56
- | `classification_model_results.csv` | Full classification model comparison |
57
 
58
  ---
59
 
60
  ## 🧹 Data Cleaning Pipeline
61
 
 
 
62
  ```
63
  Step 1 β€” Reproducible sampling
64
  123,850 rows β†’ sample(n=30,000, random_state=42)
@@ -78,11 +97,10 @@ Step 4 β€” Missing value analysis & column dropping
78
  Threshold: >70% missing β†’ drop
79
  Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
80
  remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)
81
- Protected columns: salary fields kept for feature engineering
82
 
83
  Step 5 β€” Leakage columns excluded
84
  expiry, applies β†’ removed (post-publication outcomes)
85
- views β†’ kept as target only, not as feature
86
 
87
  Step 6 β€” Salary imputation strategy
88
  has_salary_info = 1 if salary present, else 0
@@ -93,16 +111,18 @@ Step 7 β€” Log transformation of target
93
  Raw views: mean=14.9, std=98.8, max=9,949 β€” heavily right-skewed
94
  log_views = log1p(views) β€” compresses scale, improves regression fit
95
  Predictions converted back via expm1() for interpretation
96
- Outliers (IQR method): 4,074 outliers (13.8%) β€” kept, not removed
97
  ```
98
 
99
  ---
100
 
101
- ## πŸ” EDA β€” 5 Questions + Correlation Heatmap
 
 
102
 
103
- **Note:** EDA question numbers in the notebook differ from intuitive order. Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented here in order of impact.
104
 
105
- ### Salary Transparency vs Views (Notebook Q2)
106
 
107
  ```
108
  No salary info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~12 avg views (70.1% of postings)
@@ -110,59 +130,55 @@ Has salary info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
110
  +74.3% lift βœ“
111
  ```
112
 
113
- > Only 8,562 of 29,572 postings (29.9%) disclose salary. **74.3% more views** for transparent postings. Highest-leverage, lowest-cost recruiter action.
114
 
115
  ---
116
 
117
- ### Description Length vs Views (Notebook Q3)
118
 
119
  ```
120
- < 100 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ low β€” signals incomplete posting
121
- 100–250 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ medium
122
- 250–500 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ PEAK β˜… β€” sweet spot
123
- 500–750 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ high
124
- > 1000 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ drop-off β€” overwhelms candidates
125
  ```
126
 
127
- > Non-linear relationship confirmed. Sweet spot: **250–500 words**. Motivated `description_density` β€” the #1 feature in the winning regression model.
128
 
129
  ---
130
 
131
- ### Day of Week vs Views (Notebook Q4)
132
 
133
  ```
134
- Monday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 39 avg views β˜… best day (n=1,837)
135
- Tuesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘ (weekday)
136
- Wednesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ (weekday)
137
- Thursday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ (weekday)
138
  Friday β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 7 avg views βœ— worst day (n=10,076)
139
- Saturday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ (weekend β€” noisier, n=2,116 total)
140
- Sunday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ (weekend β€” noisier)
141
-
142
- Weekend average: 28 views vs Weekday average: 22 views
143
- Note: Weekend sample is much smaller (2,116 total) β€” estimates are noisier.
144
- Weekday postings averaged 21.8% LOWER views than weekend in this dataset.
145
  ```
146
 
147
- > **Counterintuitive finding:** Weekend postings showed higher average views than weekdays in this sample, BUT weekend volume is very small (2,116 obs) making these estimates unreliable. The day-of-week signal is modest and should not override content features.
148
 
149
  ---
150
 
151
- ### Work Type vs Views (Notebook Q1)
152
 
153
  ```
154
- Contract β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 29.97 avg views 7.0 median
155
- Internship β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘ 25.71 avg views 5.0 median
156
- Full-time β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 13.70 avg views 4.0 median
157
- Other β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 11.27 avg views 4.0 median
158
- Part-time β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 9.59 avg views 4.0 median
159
  ```
160
 
161
- > Contract and Internship roles show the highest engagement. However, Full-time dominates volume (23,674 of 29,572 postings). Work type is a useful feature but should not be interpreted as causal.
162
 
163
  ---
164
 
165
- ### Seniority Level vs Views (Notebook Q5)
166
 
167
  ```
168
  Entry-level β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 18 avg views n=792
@@ -173,7 +189,7 @@ Entry vs Senior: +12.4% more views
173
  Entry vs Other: +18.9% more views
174
  ```
175
 
176
- > Supply-side effect β€” more candidates qualify for junior roles so the pool is larger. Entry-level advantage is modest (+12.4% vs senior). `is_entry_role` carries predictive signal because it proxies for candidate pool size.
177
 
178
  ---
179
 
@@ -182,7 +198,7 @@ Entry vs Other: +18.9% more views
182
  ```
183
  Feature Corr Direction Note
184
  ─────────────────────────────────────────────────────────────────────
185
- desc_salary_interaction +0.18 ↑ views strongest predictor
186
  has_salary_info +0.14 ↑ views salary transparency
187
  salary_log +0.12 ↑ views salary level
188
  description_density +0.10 ↑ views content quality
@@ -190,86 +206,105 @@ description_word_count +0.08 ↑ views description length
190
  is_software_role +0.08 ↑ views tech role demand
191
  is_data_role +0.07 ↑ views data role demand
192
  is_entry_role +0.06 ↑ views larger candidate pool
193
- posting_weekend -0.04 ↓ views (small negative)
194
  is_senior_role -0.03 ↓ views smaller candidate pool
195
  ─────────────────────────────────────────────────────────────────────
196
- Internal correlations (structural):
197
  salary_log ↔ salary_midpoint +0.96 log transform of same variable
198
  desc_wc ↔ desc_density +0.55 density uses length in formula
199
  is_software ↔ is_data +0.35 often co-occur in job titles
200
  is_senior ↔ is_entry -0.28 mutually exclusive by construction
201
- ─────────────────────────────────────────────────────────────────────
202
  ```
203
 
204
  > Most features show **weak linear correlation** β€” no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.
205
 
206
- ---
207
 
208
- ## βš™οΈ Feature Engineering β€” 20 base + 10 cluster = 30 Total Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
 
210
- **Note:** The notebook creates 20 engineered features before clustering, then adds 6 cluster dummy columns for a total of 30 in the final feature matrix (X_train_fe shape: 23,657 Γ— 30).
211
 
212
  | Group | Features |
213
  |---|---|
214
  | Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
215
- | Text structure | `description_density`, `title_desc_ratio` |
216
  | Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
217
  | Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` |
218
- | Interactions | `desc_salary_interaction`, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
219
  | Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` |
220
 
221
  **Missing value strategy:**
222
- - Columns with >70% missing β†’ dropped (closed_time, skills_desc, med_salary, remote_allowed, applies, salary min/max, compensation fields)
223
- - Salary β†’ `has_salary_info` flag + `salary_midpoint` computed where possible; remaining salary NaN imputed inside sklearn Pipeline on training data only
224
  - Remaining numeric β†’ `SimpleImputer(strategy="median")` inside Pipeline
225
 
226
  ---
227
 
228
  ## πŸ”΅ Clustering β€” KMeans k=6
229
 
230
- **Clustering features used (12 total, leakage-checked):**
231
  `title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`
232
 
233
  **Methods used to select k:**
234
- 1. Elbow method (inertia k=2–10) β€” inconclusive, no sharp elbow
235
- 2. K-Means silhouette scores on full training matrix
236
- 3. Cluster-size stability table (smallest/largest cluster per k)
237
- 4. Interactive K-Means widget (visualization aid only β€” uses sample)
238
- 5. Hierarchical clustering dendrogram (Ward linkage, 300 obs sample)
239
- 6. Agglomerative Clustering diagnostic comparison (k=2–10 on sample)
240
 
241
  ```
242
- Chart 1 β€” Actual silhouette scores by k (full training matrix)
243
 
244
  k=2 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.198 smallest cluster: 6,830 (28.9%)
245
  k=3 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.221 smallest cluster: 2,100 (8.9%)
246
- k=4 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 0.312 ← strong score BUT largest=72%
247
  k=5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.250 smallest: 526 (unstable)
248
- k=6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.290 ← SELECTED β˜… smallest: 583 (2.5%)
249
  k=7 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.286 singleton cluster appeared
250
- k=8 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.315 singleton cluster appeared
251
- k=9 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.314 singleton cluster appeared
252
- k=10 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ 0.350 singleton cluster appeared
253
 
254
  Why NOT k=10 (highest score): singleton cluster (1 observation)
255
- Why NOT k=4 (strong score): largest cluster = 72% of observations
256
- Why k=6: no singletons, stable sizes, silhouette 0.290, interpretable profiles
257
-
258
- Note: Elbow method was inconclusive (inertia 255,430 at k=2 β†’ 98,508 at k=10,
259
- no sharp elbow). Agglomerative diagnostic best at k=2 (score 0.467 on sample)
260
- β€” too coarse. k=6 selected as practical compromise across all methods.
261
 
262
- Chart 2 β€” Actual cluster sizes at k=6 (training set n=23,657)
263
 
264
- Cluster 0 β€” Manager-focused β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 4,571 (19%) is_manager_role=1.00
265
- Cluster 1 β€” General / Mixed β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 13,055 (55%) no dominant role signal
266
- Cluster 2 β€” Salary-transparent β–ˆβ–ˆβ–ˆβ–ˆ 1,940 (8%) has_salary_info=1.00
267
- Cluster 3 β€” Data roles β–ˆβ–ˆβ–ˆ 1,451 (6%) is_data_role=1.00
268
- Cluster 4 β€” Software roles β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 2,057 (9%) is_software_role=1.00
269
- Cluster 5 β€” Entry / low salary β–ˆβ–ˆ 583 (2%) smallest cluster
 
 
270
 
271
- Official final silhouette score: 0.290 (full training matrix)
272
- ```
273
 
274
  Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.
275
 
@@ -286,58 +321,49 @@ Mean Baseline (predict training mean for all observations):
286
 
287
  Baseline Linear Regression (20 features, no clustering):
288
  RMSE_log = 0.8425 RΒ² = 0.0639
289
- MAE_views β‰ˆ 10.54
290
  ```
291
 
292
- ### Full model comparison (after feature engineering + clustering)
293
 
294
- ```
295
- Model RMSE_log ↓ RΒ² ↑
296
- ─────────────────────────────────────────────────────
297
- Random Forest (Tuned) β˜… 0.8347 0.0811
298
- Random Forest (Ctrl) 0.8349 0.0807
299
- Gradient Boosting 0.8370 0.0770
300
- Linear Regression + Feat 0.8420 0.0640
301
- RidgeCV 0.8420 0.0640
302
- Lasso Regression 0.8430 0.0640
303
- PCA + Linear Regression 0.8440 0.0600
304
- Mean Baseline 0.8708 -0.0002
305
- ─────────────────────────────────────────────────────
306
- Winner: RandomizedSearchCV tuned RF
307
- Improvement over manually controlled RF: 0.0002 RMSE_log (practically negligible)
308
- 3-fold CV mean RMSE_log: 0.8747 (Β±0.0125) β€” stable across folds
309
- Overfitting lesson: unrestricted RF β†’ train RΒ²=0.854, test RΒ²=0.003
310
- Fixed by: max_depth, min_samples_split, min_samples_leaf, max_features constraints
311
- Outlier robustness test: capping views at 99th pct β†’ RMSE_log 0.8147, RΒ²=0.0812
312
- ```
313
 
314
- ### Top feature importances (RF Tuned)
315
 
316
  ```
317
- description_density β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ #1 β€” content quality
318
  description_length β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ #2 β€” raw description size
319
  description_word_count β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ #3 β€” word count
320
- title-description interactionβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ #4 β€” combined signal
321
  is_software_role β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ #5 β€” tech role demand
322
  is_data_role β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘ #6 β€” data role demand
323
  salary_log / has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ #7+ β€” salary signals
324
  ```
325
 
326
- > **Note:** desc_salary_interaction ranked #2 in SHAP analysis but further down in Gini importance. Both agree on description quality and salary as top drivers.
327
 
328
- ### Regression interpretation
329
 
330
  ```
331
  RΒ² = 0.081 β†’ model explains ~8% of variance in log(views+1)
332
 
333
- Why acceptable:
334
- βœ“ Beats mean baseline (RΒ²β‰ˆ0) β€” real posting-level signal captured
335
- βœ“ Social engagement inherently noisy β€” platform factors dominate
336
- βœ“ 92% of variance from unobservable sources (algorithm, followers, ads)
337
- βœ“ Practical use = ranking postings, not forecasting exact counts
338
-
339
- PCA + Linear: reduced to 15 components (96.3% variance preserved) β€” no improvement
340
- Gradient Boosting marginally worse than RF β€” non-linear models help but modestly
341
  ```
342
 
343
  ---
@@ -347,95 +373,80 @@ Gradient Boosting marginally worse than RF β€” non-linear models help but modest
347
  ```
348
  Target: high_engagement = 1 if views β‰₯ 75th percentile of TRAINING views
349
  Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
350
- Feature matrix: X_clf uses 24 features (not the full 30 β€” see notebook cell 207)
351
- Training: ~24,000 obs | Test: ~6,000 obs
352
  Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
353
  ```
354
 
355
- ### Model comparison
356
 
357
- ```
358
- Model F1 (C1) Recall (C1) Notes
359
- ──────────────────────────────────────────────────────────────
360
- Decision Tree β˜… HIGHEST HIGHEST lowest FN count
361
- Logistic Regr. near-best high close to DT
362
- Random Forest moderate lower lowest FP count
363
- Dummy Baseline 0.00 0.00 always predicts Class 0
364
- ──────────────────────────────────────────────────────────────
365
- Winner: max_depth=8, class_weight="balanced"
366
- 5-fold CV F1: 0.4424 Β± 0.0152 β€” stable, no lucky split
367
- ```
368
 
369
- ### Confusion matrix (all models β€” from notebook)
 
 
370
 
371
  ```
372
- Decision Tree: lowest FN (catches most high-engagement) β€” most false positives
373
- Random Forest: lowest FP (fewest false alarms) β€” misses most high-engagement
374
- Logistic Regr.: between the two β€” close to DT in F1
375
 
376
- FN (missed high-engagement) = most costly error:
377
- Company fails to prioritize, promote, or learn from a valuable listing.
378
- FP (false alarm) = also costly:
379
- Recruiters waste attention on postings that are not actually strong.
380
  ```
381
 
 
 
 
382
  ---
383
 
384
- ## πŸ’‘ Business Insights (from notebook cell 242)
385
 
386
- 1. **Salary transparency is associated with higher engagement** β€” 74.3% more views. Fewer than 30% of postings disclose salary today.
387
- 2. **Description structure matters** β€” density was the #1 feature in both models. Sweet spot: 250–500 words.
388
- 3. **Tech roles attract more engagement** β€” software and data role flags carry signal beyond salary.
389
- 4. **Work type is associated with engagement** β€” contract roles lead, but full-time dominates volume.
390
- 5. **Platform factors dominate** β€” RΒ²β‰ˆ0.08 is expected. Model value is in ranking, not exact prediction.
391
 
392
  ---
393
 
394
  ## 🎁 Bonus Work
395
 
396
- ### πŸš€ Interactive Dashboard
397
-
398
- πŸ‘‰ **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**
399
-
400
- | Tab | Description |
401
- |---|---|
402
- | 🎯 Engagement Predictor | Real-time predicted views + High/Normal classification |
403
- | πŸ“Š EDA Dashboard | All 5 EDA findings as interactive charts |
404
- | ℹ️ About | Feature groups, model details, limitations |
405
-
406
  ### 🧠 SHAP Explainability
407
 
408
  ```
409
  SHAP mean |value| β€” RF Tuned regression (test observations)
410
 
411
- description_density β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ strongest ↑
412
  desc_salary_interaction β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ salary Γ— description synergy ↑
413
  salary_log β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ salary level ↑
414
  has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ disclosed β†’ more views ↑
415
  posting_weekend β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ weekend β†’ fewer views ↓
416
-
417
- Key finding: desc_salary_interaction ranks #2 in SHAP but lower in Gini β€”
418
- confirms it captures genuine non-linear interaction beyond individual features.
419
  ```
420
 
 
 
421
  ### πŸ“Š Feature Importance: Regression vs Classification
422
 
423
  ```
424
  Regression RF Classification DT
425
  description_density #1 #2
426
- desc_salary_interaction varies varies
427
  salary_log #7+ varies
428
  is_entry_role lower rises in classification
429
  is_data_role #6 varies
430
- ─────────────────────────────────────────────────────────
431
- Agreement: description quality + salary dominate both models
432
  Divergence: seniority/role flags matter more for threshold-crossing
433
  (classification) than for predicting exact counts (regression)
434
  ```
435
 
436
- ### πŸ”¬ Additional Bonus Items
437
 
438
- - **Interactive K-Means Widget** β€” explore different k values visually in notebook (cell 4.11)
439
  - **Hierarchical Clustering Dendrogram** β€” Ward linkage, 300 obs sample (cell 4.12)
440
  - **Agglomerative Clustering Diagnostic** β€” k=2–10 comparison (cell 4.13)
441
  - **Outlier Robustness Test** β€” views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
@@ -453,16 +464,19 @@ with open("linkedin_regression_model.pkl", "rb") as f:
453
  with open("linkedin_classification_model.pkl", "rb") as f:
454
  clf_model = pickle.load(f)
455
 
456
- # Regression β€” predict log(views+1), convert back
457
  log_views_pred = reg_model.predict(X_test_fe)
458
  views_pred = np.expm1(log_views_pred)
459
 
460
- # Classification β€” predict high-engagement label (0 or 1)
461
  label = clf_model.predict(X_clf)
462
  ```
463
 
464
- > Regression model expects 30-column X_test_fe (with cluster dummies). Classification model expects 24-column X_clf. Run the full pipeline in the notebook to produce compatible inputs.
 
 
465
 
466
  ---
467
 
468
  *Assignment 2 β€” Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings Β· arshkon/linkedin-job-postings (Kaggle)*
 
 
1
  ---
2
+ ---
3
  tags:
4
  - regression
5
  - classification
 
8
  - linkedin
9
  - job-postings
10
  - sklearn
11
+ - random-forest
12
+ - decision-tree
13
+ - kmeans
14
+ - shap
15
  license: mit
16
  ---
17
 
 
25
 
26
  ## πŸ“Ή Presentation Video
27
 
28
+ <video src=["https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4](https://www.loom.com/share/c7d9b89a54234f699204b16a9a313c7d)" controls style="max-width:720px;"></video>
29
+
30
+ ---
31
+
32
+ ## πŸš€ Interactive Dashboard
33
+
34
+ πŸ‘‰ **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**
35
+
36
+ | Tab | Description |
37
+ |---|---|
38
+ | 🎯 Engagement Predictor | Enter posting details β†’ get predicted views + High/Normal classification in real time |
39
+ | πŸ“Š EDA Dashboard | All 5 EDA findings as interactive charts |
40
+ | ℹ️ About | Feature groups, model details, limitations |
41
 
42
  ---
43
 
 
49
  | **Original size** | 123,850 rows Γ— 49 columns |
50
  | **Working sample** | 30,000 rows Β· `random_state=42` |
51
  | **After join with companies** | 30,000 rows Γ— 40 columns |
52
+ | **After cleaning** | 29,572 rows Γ— 51 columns (in `df_model`) |
53
  | **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) |
54
  | **Regression target** | `log_views = log1p(views)` β€” log-transformed to handle right skew |
55
+ | **Classification target** | `high_engagement` β€” top 25% of training views (threshold derived from training set only) |
56
 
57
  ---
58
 
 
66
 
67
  | File | Description |
68
  |---|---|
69
+ | `notebook.ipynb` | Full pipeline: Cleaning β†’ EDA β†’ Feature Engineering β†’ Clustering β†’ Regression β†’ Classification β†’ Bonus |
70
+ | `linkedin_regression_model.pkl` | Winning regression model: Random Forest (Tuned via RandomizedSearchCV) |
71
+ | `linkedin_classification_model.pkl` | Winning classification model: Decision Tree (max_depth=8, class_weight="balanced") |
72
+ | `regression_model_results.csv` | Full regression model comparison table |
73
+ | `classification_model_results.csv` | Full classification model comparison table |
74
 
75
  ---
76
 
77
  ## 🧹 Data Cleaning Pipeline
78
 
79
+ **7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:**
80
+
81
  ```
82
  Step 1 β€” Reproducible sampling
83
  123,850 rows β†’ sample(n=30,000, random_state=42)
 
97
  Threshold: >70% missing β†’ drop
98
  Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
99
  remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)
 
100
 
101
  Step 5 β€” Leakage columns excluded
102
  expiry, applies β†’ removed (post-publication outcomes)
103
+ views β†’ kept as target only, never as a feature
104
 
105
  Step 6 β€” Salary imputation strategy
106
  has_salary_info = 1 if salary present, else 0
 
111
  Raw views: mean=14.9, std=98.8, max=9,949 β€” heavily right-skewed
112
  log_views = log1p(views) β€” compresses scale, improves regression fit
113
  Predictions converted back via expm1() for interpretation
114
+ Outliers (IQR method): 4,074 (13.8%) β€” kept, not removed
115
  ```
116
 
117
  ---
118
 
119
+ ## πŸ” EDA β€” 5 Research Questions
120
+
121
+ > **Note on notebook ordering:** Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact.
122
 
123
+ ---
124
 
125
+ ### πŸ’° Q2 β€” Salary Transparency vs Views
126
 
127
  ```
128
  No salary info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~12 avg views (70.1% of postings)
 
130
  +74.3% lift βœ“
131
  ```
132
 
133
+ > Only **8,562 of 29,572 postings (29.9%)** disclose salary. Transparent postings attract **74.3% more views** on average. This is the highest-leverage, lowest-cost recruiter action available.
134
 
135
  ---
136
 
137
+ ### πŸ“ Q3 β€” Description Length vs Views
138
 
139
  ```
140
+ < 100 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~8 avg views β€” signals incomplete posting
141
+ 100–250 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~13 avg views
142
+ 250–500 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~24 avg views PEAK β˜… β€” sweet spot
143
+ 500–750 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ ~18 avg views
144
+ > 1000 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~10 avg views β€” overwhelms candidates
145
  ```
146
 
147
+ > Non-linear relationship confirmed. Sweet spot: **250–500 words**. This motivated `description_density` β€” the **#1 feature** in the winning regression model.
148
 
149
  ---
150
 
151
+ ### πŸ“… Q4 β€” Day of Week vs Views
152
 
153
  ```
154
+ Monday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 39 avg views β˜… best day (n=1,837)
155
+ Tuesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘ 25 avg views
156
+ Wednesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 22 avg views
157
+ Thursday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ 18 avg views
158
  Friday β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 7 avg views βœ— worst day (n=10,076)
159
+ Saturday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 28 avg views (weekend β€” n=2,116 total, noisier)
160
+ Sunday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 28 avg views (weekend β€” noisier)
 
 
 
 
161
  ```
162
 
163
+ > **Counterintuitive finding:** Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features.
164
 
165
  ---
166
 
167
+ ### πŸ’Ό Q1 β€” Work Type vs Views
168
 
169
  ```
170
+ Contract β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 29.97 avg views median: 7.0
171
+ Internship β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘ 25.71 avg views median: 5.0
172
+ Full-time β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 13.70 avg views median: 4.0 ← 80% of volume
173
+ Other β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 11.27 avg views median: 4.0
174
+ Part-time β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 9.59 avg views median: 4.0
175
  ```
176
 
177
+ > Contract and Internship roles show the highest engagement. However, **Full-time dominates volume** (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal.
178
 
179
  ---
180
 
181
+ ### πŸŽ“ Q5 β€” Seniority Level vs Views
182
 
183
  ```
184
  Entry-level β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 18 avg views n=792
 
189
  Entry vs Other: +18.9% more views
190
  ```
191
 
192
+ > Supply-side effect β€” more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for **candidate pool size**, not intrinsic desirability.
193
 
194
  ---
195
 
 
198
  ```
199
  Feature Corr Direction Note
200
  ─────────────────────────────────────────────────────────────────────
201
+ desc_salary_interaction +0.18 ↑ views strongest single predictor
202
  has_salary_info +0.14 ↑ views salary transparency
203
  salary_log +0.12 ↑ views salary level
204
  description_density +0.10 ↑ views content quality
 
206
  is_software_role +0.08 ↑ views tech role demand
207
  is_data_role +0.07 ↑ views data role demand
208
  is_entry_role +0.06 ↑ views larger candidate pool
209
+ posting_weekend -0.04 ↓ views small negative signal
210
  is_senior_role -0.03 ↓ views smaller candidate pool
211
  ─────────────────────────────────────────────────────────────────────
212
+ Internal correlations (structural β€” not data leakage):
213
  salary_log ↔ salary_midpoint +0.96 log transform of same variable
214
  desc_wc ↔ desc_density +0.55 density uses length in formula
215
  is_software ↔ is_data +0.35 often co-occur in job titles
216
  is_senior ↔ is_entry -0.28 mutually exclusive by construction
 
217
  ```
218
 
219
  > Most features show **weak linear correlation** β€” no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.
220
 
221
+ ### 🌑️ Correlation Heatmap (feature-to-feature + target)
222
 
223
+ ```
224
+ log desc has sal desc is_ is_ is_ post is_
225
+ views dens sal log wc soft data entr wknd snr
226
+ ──────────────────────────────────────────────────────────────────────────────────────
227
+ log_views β”‚ 1.00 0.10 0.14 0.12 0.08 0.08 0.07 0.06 -0.04 -0.03
228
+ description_density β”‚ 0.10 1.00 0.02 0.04 0.55 0.01 0.01 -0.01 0.00 0.00
229
+ has_salary_info β”‚ 0.14 0.02 1.00 0.72 0.03 0.06 0.07 -0.03 -0.01 -0.02
230
+ salary_log β”‚ 0.12 0.04 0.72 1.00 0.04 0.05 0.06 -0.02 -0.01 -0.01
231
+ description_word_count β”‚ 0.08 0.55 0.03 0.04 1.00 0.01 0.01 -0.01 0.00 0.00
232
+ is_software_role β”‚ 0.08 0.01 0.06 0.05 0.01 1.00 0.35 -0.08 0.00 -0.05
233
+ is_data_role β”‚ 0.07 0.01 0.07 0.06 0.01 0.35 1.00 -0.06 0.00 -0.04
234
+ is_entry_role β”‚ 0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06 1.00 0.01 -0.28
235
+ posting_weekend β”‚ -0.04 0.00 -0.01 -0.01 0.00 0.00 0.00 0.01 1.00 0.00
236
+ is_senior_role β”‚ -0.03 0.00 -0.02 -0.01 0.00 -0.05 -0.04 -0.28 0.00 1.00
237
+ ──────────────────────────────────────────────────────────────────────────────────────
238
+ Key structural correlations:
239
+ salary_log ↔ has_salary_info +0.72 same underlying signal, different form
240
+ desc_wc ↔ desc_density +0.55 density formula uses word count
241
+ is_software ↔ is_data +0.35 frequently co-occur in job titles
242
+ is_entry ↔ is_senior -0.28 mutually exclusive flags
243
+ ```
244
+
245
+ > The heatmap confirms no multicollinearity crisis β€” the highest inter-feature correlation (salary_log ↔ has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models.
246
+
247
+ ---
248
 
249
+ ## βš™οΈ Feature Engineering β€” 20 Base + 6 Cluster = 30 Total Features
250
 
251
  | Group | Features |
252
  |---|---|
253
  | Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
254
+ | Text structure | `description_density` β˜…, `title_desc_ratio` |
255
  | Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
256
  | Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` |
257
+ | Interactions | `desc_salary_interaction` β˜…, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
258
  | Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` |
259
 
260
  **Missing value strategy:**
261
+ - Columns with >70% missing β†’ dropped
262
+ - Salary β†’ `has_salary_info` flag + `salary_midpoint` where available; remaining NaN imputed inside sklearn Pipeline on training data only
263
  - Remaining numeric β†’ `SimpleImputer(strategy="median")` inside Pipeline
264
 
265
  ---
266
 
267
  ## πŸ”΅ Clustering β€” KMeans k=6
268
 
269
+ **Features used for clustering (12 total, leakage-checked):**
270
  `title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`
271
 
272
  **Methods used to select k:**
273
+ 1. Elbow method β€” inconclusive, no sharp elbow
274
+ 2. KMeans silhouette scores on full training matrix
275
+ 3. Cluster-size stability table
276
+ 4. Interactive K-Means widget (visualization aid β€” uses sample)
277
+ 5. Hierarchical clustering dendrogram (Ward linkage, 300 obs)
278
+ 6. Agglomerative clustering comparison (k=2–10)
279
 
280
  ```
281
+ Silhouette scores by k (full training matrix):
282
 
283
  k=2 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.198 smallest cluster: 6,830 (28.9%)
284
  k=3 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.221 smallest cluster: 2,100 (8.9%)
285
+ k=4 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 0.312 ← strong BUT largest=72% of data
286
  k=5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.250 smallest: 526 (unstable)
287
+ k=6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.290 ← SELECTED β˜… smallest: 583 (2.5%)
288
  k=7 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.286 singleton cluster appeared
289
+ k=8+ singleton clusters appeared
 
 
290
 
291
  Why NOT k=10 (highest score): singleton cluster (1 observation)
292
+ Why NOT k=4 (strong score): largest cluster = 72% β€” not meaningful separation
293
+ Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290
294
+ ```
 
 
 
295
 
296
+ **Cluster profiles at k=6 (training set n=23,657):**
297
 
298
+ | Cluster | Label | Size | Share | Key Signal |
299
+ |---|---|---|---|---|
300
+ | 0 | Manager-focused | 4,571 | 19% | `is_manager_role=1.00` |
301
+ | 1 | General / Mixed | 13,055 | 55% | No dominant role signal |
302
+ | 2 | Salary-transparent | 1,940 | 8% | `has_salary_info=1.00` |
303
+ | 3 | Data roles | 1,451 | 6% | `is_data_role=1.00` |
304
+ | 4 | Software roles | 2,057 | 9% | `is_software_role=1.00` |
305
+ | 5 | Entry / low salary | 583 | 2% | Smallest cluster |
306
 
307
+ **Official final silhouette score: 0.290** (full training matrix)
 
308
 
309
  Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.
310
 
 
321
 
322
  Baseline Linear Regression (20 features, no clustering):
323
  RMSE_log = 0.8425 RΒ² = 0.0639
 
324
  ```
325
 
326
+ ### Full Model Comparison (after feature engineering + clustering)
327
 
328
+ | Model | RMSE_log ↓ | RΒ² ↑ | Notes |
329
+ |---|---|---|---|
330
+ | **Random Forest (Tuned) β˜…** | **0.8347** | **0.0811** | RandomizedSearchCV winner |
331
+ | Random Forest (Controlled) | 0.8349 | 0.0807 | Manual constraints |
332
+ | Gradient Boosting | 0.8370 | 0.0770 | β€” |
333
+ | Linear Regression + Features | 0.8420 | 0.0640 | β€” |
334
+ | RidgeCV | 0.8420 | 0.0640 | β€” |
335
+ | Lasso Regression | 0.8430 | 0.0640 | β€” |
336
+ | PCA + Linear Regression | 0.8440 | 0.0600 | 15 components, 96.3% variance |
337
+ | Mean Baseline | 0.8708 | -0.0002 | Floor |
338
+
339
+ **Key lessons:**
340
+ - Unrestricted RF β†’ train RΒ²=0.854, test RΒ²=0.003 (massive overfit). Fixed by `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features` constraints.
341
+ - 3-fold CV mean RMSE_log: 0.8747 (Β±0.0125) β€” stable across folds
342
+ - Outlier robustness test: capping views at 99th pct β†’ RMSE_log 0.8147, RΒ²=0.0812
 
 
 
 
343
 
344
+ ### Top Feature Importances (RF Tuned)
345
 
346
  ```
347
+ description_density β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ #1 β€” content quality proxy
348
  description_length β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ #2 β€” raw description size
349
  description_word_count β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ #3 β€” word count
350
+ title-description interactionβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ #4 β€” combined text signal
351
  is_software_role β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ #5 β€” tech role demand
352
  is_data_role β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘ #6 β€” data role demand
353
  salary_log / has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ #7+ β€” salary signals
354
  ```
355
 
356
+ > `desc_salary_interaction` ranks #2 in SHAP analysis but further down in Gini importance β€” both agree on description quality and salary as top drivers.
357
 
358
+ ### Why RΒ² = 0.081 Is Acceptable
359
 
360
  ```
361
  RΒ² = 0.081 β†’ model explains ~8% of variance in log(views+1)
362
 
363
+ βœ“ Beats mean baseline (RΒ²β‰ˆ0) β€” real posting-level signal captured
364
+ βœ“ Social engagement inherently noisy β€” platform factors dominate
365
+ βœ“ 92% of variance from unobservable sources (algorithm, followers, ads)
366
+ βœ“ Practical use = ranking postings, not forecasting exact counts
 
 
 
 
367
  ```
368
 
369
  ---
 
373
  ```
374
  Target: high_engagement = 1 if views β‰₯ 75th percentile of TRAINING views
375
  Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
376
+ Feature matrix: X_clf uses 24 features (see notebook cell 207)
 
377
  Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
378
  ```
379
 
380
+ ### Model Comparison
381
 
382
+ | Model | F1 (Class 1) | Recall (Class 1) | Notes |
383
+ |---|---|---|---|
384
+ | **Decision Tree β˜…** | **HIGHEST** | **HIGHEST** | max_depth=8, class_weight="balanced" |
385
+ | Logistic Regression | near-best | high | Close to DT in F1 |
386
+ | Random Forest | moderate | lower | Lowest FP count |
387
+ | Dummy Baseline | 0.00 | 0.00 | Always predicts Class 0 |
 
 
 
 
 
388
 
389
+ **5-fold CV F1: 0.4424 Β± 0.0152** β€” stable, no lucky split
390
+
391
+ ### Error Cost Analysis
392
 
393
  ```
394
+ FN (missed high-engagement) = most costly error
395
+ β†’ Company fails to prioritize, promote, or learn from a strong posting
 
396
 
397
+ FP (false alarm) = also costly
398
+ β†’ Recruiter wastes time and budget on a posting that won't perform
 
 
399
  ```
400
 
401
+ Decision Tree minimises FN (catches most high-engagement postings) but produces more FP.
402
+ Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings.
403
+
404
  ---
405
 
406
+ ## πŸ’‘ Business Insights
407
 
408
+ 1. **Salary transparency is the single highest-leverage action** β€” 74.3% more views for free. Fewer than 30% of postings disclose salary today.
409
+ 2. **Description structure matters** β€” `description_density` was the #1 feature in both models. Sweet spot: 250–500 words.
410
+ 3. **Tech roles attract disproportionate engagement** β€” `is_software_role` and `is_data_role` carry real signal beyond salary.
411
+ 4. **Work type is associated with engagement** β€” contract roles lead, but full-time dominates volume (80%).
412
+ 5. **Platform factors dominate** β€” RΒ²β‰ˆ0.08 is expected and acceptable. Model value is in **ranking** postings, not exact prediction.
413
 
414
  ---
415
 
416
  ## 🎁 Bonus Work
417
 
 
 
 
 
 
 
 
 
 
 
418
  ### 🧠 SHAP Explainability
419
 
420
  ```
421
  SHAP mean |value| β€” RF Tuned regression (test observations)
422
 
423
+ description_density β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ strongest positive impact ↑
424
  desc_salary_interaction β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ salary Γ— description synergy ↑
425
  salary_log β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ salary level ↑
426
  has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ disclosed β†’ more views ↑
427
  posting_weekend β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ weekend β†’ fewer views ↓
 
 
 
428
  ```
429
 
430
+ `desc_salary_interaction` ranks #2 in SHAP but lower in Gini β€” confirms it captures genuine non-linear interaction that neither feature achieves alone.
431
+
432
  ### πŸ“Š Feature Importance: Regression vs Classification
433
 
434
  ```
435
  Regression RF Classification DT
436
  description_density #1 #2
437
+ desc_salary_interaction #2 (SHAP) varies
438
  salary_log #7+ varies
439
  is_entry_role lower rises in classification
440
  is_data_role #6 varies
441
+ ──────────────────────────────────────────────────────────
442
+ Agreement: description quality + salary dominate both models
443
  Divergence: seniority/role flags matter more for threshold-crossing
444
  (classification) than for predicting exact counts (regression)
445
  ```
446
 
447
+ ### πŸ”¬ Additional Extras
448
 
449
+ - **Interactive K-Means Widget** β€” explore different k values visually (notebook cell 4.11)
450
  - **Hierarchical Clustering Dendrogram** β€” Ward linkage, 300 obs sample (cell 4.12)
451
  - **Agglomerative Clustering Diagnostic** β€” k=2–10 comparison (cell 4.13)
452
  - **Outlier Robustness Test** β€” views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
 
464
  with open("linkedin_classification_model.pkl", "rb") as f:
465
  clf_model = pickle.load(f)
466
 
467
+ # Regression β€” predict log(views+1), convert back to raw view estimate
468
  log_views_pred = reg_model.predict(X_test_fe)
469
  views_pred = np.expm1(log_views_pred)
470
 
471
+ # Classification β€” predict high-engagement label (0 = Normal, 1 = High)
472
  label = clf_model.predict(X_clf)
473
  ```
474
 
475
+ > Regression model expects **30-column** `X_test_fe` (including cluster dummies).
476
+ > Classification model expects **24-column** `X_clf` (see notebook cell 207).
477
+ > Run the full pipeline in the notebook to produce compatible feature matrices.
478
 
479
  ---
480
 
481
  *Assignment 2 β€” Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings Β· arshkon/linkedin-job-postings (Kaggle)*
482
+