MichaelYitzchak
/

Linkedin_Job_Engagement

@@ -1,4 +1,5 @@
 ---
 tags:
   - regression
   - classification
@@ -7,6 +8,10 @@ tags:
   - linkedin
   - job-postings
   - sklearn
 license: mit
 ---
@@ -20,7 +25,19 @@ license: mit
 ## 📹 Presentation Video
-<video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;"></video>
 ---
@@ -32,10 +49,10 @@ license: mit
 | **Original size** | 123,850 rows × 49 columns |
 | **Working sample** | 30,000 rows · `random_state=42` |
 | **After join with companies** | 30,000 rows × 40 columns |
-| **After cleaning** | 29,572 rows × 51 columns (in df_model) |
 | **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) |
 | **Regression target** | `log_views = log1p(views)` — log-transformed to handle right skew |
-| **Classification target** | `high_engagement` — top 25% of training views (threshold from training only) |
 ---
@@ -49,16 +66,18 @@ license: mit
 | File | Description |
 |---|---|
-| `notebook.ipynb` | Full pipeline: Cleaning → EDA → Features → Clustering → Regression → Classification → Bonus |
-| `linkedin_regression_model.pkl` | Winning model: Random Forest (Tuned) |
-| `linkedin_classification_model.pkl` | Winning model: Decision Tree |
-| `regression_model_results.csv` | Full regression model comparison |
-| `classification_model_results.csv` | Full classification model comparison |
 ---
 ## 🧹 Data Cleaning Pipeline
 ```
 Step 1 — Reproducible sampling
         123,850 rows → sample(n=30,000, random_state=42)
@@ -78,11 +97,10 @@ Step 4 — Missing value analysis & column dropping
         Threshold: >70% missing → drop
         Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
                  remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)
-        Protected columns: salary fields kept for feature engineering
 Step 5 — Leakage columns excluded
         expiry, applies → removed (post-publication outcomes)
-        views → kept as target only, not as feature
 Step 6 — Salary imputation strategy
         has_salary_info = 1 if salary present, else 0
@@ -93,16 +111,18 @@ Step 7 — Log transformation of target
         Raw views: mean=14.9, std=98.8, max=9,949 — heavily right-skewed
         log_views = log1p(views) — compresses scale, improves regression fit
         Predictions converted back via expm1() for interpretation
-        Outliers (IQR method): 4,074 outliers (13.8%) — kept, not removed
 ```
 ---
-## 🔍 EDA — 5 Questions + Correlation Heatmap
-**Note:** EDA question numbers in the notebook differ from intuitive order. Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented here in order of impact.
-### Salary Transparency vs Views (Notebook Q2)
 ```
 No salary info   ████████████░░░░░░░░░░░░░  ~12 avg views   (70.1% of postings)
@@ -110,59 +130,55 @@ Has salary info  █████████████████████
                                              +74.3% lift ✓
 ```
-> Only 8,562 of 29,572 postings (29.9%) disclose salary. **74.3% more views** for transparent postings. Highest-leverage, lowest-cost recruiter action.
 ---
-### Description Length vs Views (Notebook Q3)
 ```
-< 100 words    ██████░░░░░░░░░░░░░░  low    — signals incomplete posting
-100–250 words  █████████░░░░░░░░░░░  medium
-250–500 words  ████████████████████  PEAK ★ — sweet spot
-500–750 words  ████████████████░░░░  high
-> 1000 words   ███████░░░░░░░░░░░░░  drop-off — overwhelms candidates
 ```
-> Non-linear relationship confirmed. Sweet spot: **250–500 words**. Motivated `description_density` — the #1 feature in the winning regression model.
 ---
-### Day of Week vs Views (Notebook Q4)
 ```
-Monday    ████████████████████  39 avg views  ★ best day (n=1,837)
-Tuesday   █████████████████░░░  (weekday)
-Wednesday ████████████████░░░░  (weekday)
-Thursday  ███████████████░░░░░  (weekday)
 Friday    ███░░░░░░░░░░░░░░░░░   7 avg views  ✗ worst day (n=10,076)
-Saturday  ████████████░░░░░░░░  (weekend — noisier, n=2,116 total)
-Sunday    ████████████░░░░░░░░  (weekend — noisier)
-Weekend average: 28 views vs Weekday average: 22 views
-Note: Weekend sample is much smaller (2,116 total) — estimates are noisier.
-Weekday postings averaged 21.8% LOWER views than weekend in this dataset.
 ```
-> **Counterintuitive finding:** Weekend postings showed higher average views than weekdays in this sample, BUT weekend volume is very small (2,116 obs) making these estimates unreliable. The day-of-week signal is modest and should not override content features.
 ---
-### Work Type vs Views (Notebook Q1)
 ```
-Contract    ████████████████████  29.97 avg views  7.0 median
-Internship  █████████████████░░░  25.71 avg views  5.0 median
-Full-time   ████████░░░░░░░░░░░░  13.70 avg views  4.0 median
-Other       ███████░░░░░░░░░░░░░  11.27 avg views  4.0 median
-Part-time   ██████░░░░░░░░░░░░░░   9.59 avg views  4.0 median
 ```
-> Contract and Internship roles show the highest engagement. However, Full-time dominates volume (23,674 of 29,572 postings). Work type is a useful feature but should not be interpreted as causal.
 ---
-### Seniority Level vs Views (Notebook Q5)
 ```
 Entry-level  ████████████████████  18 avg views  n=792
@@ -173,7 +189,7 @@ Entry vs Senior: +12.4% more views
 Entry vs Other:  +18.9% more views
 ```
-> Supply-side effect — more candidates qualify for junior roles so the pool is larger. Entry-level advantage is modest (+12.4% vs senior). `is_entry_role` carries predictive signal because it proxies for candidate pool size.
 ---
@@ -182,7 +198,7 @@ Entry vs Other:  +18.9% more views
 ```
 Feature                      Corr    Direction   Note
 ─────────────────────────────────────────────────────────────────────
-desc_salary_interaction      +0.18   ↑ views     strongest predictor
 has_salary_info              +0.14   ↑ views     salary transparency
 salary_log                   +0.12   ↑ views     salary level
 description_density          +0.10   ↑ views     content quality
@@ -190,86 +206,105 @@ description_word_count       +0.08   ↑ views     description length
 is_software_role             +0.08   ↑ views     tech role demand
 is_data_role                 +0.07   ↑ views     data role demand
 is_entry_role                +0.06   ↑ views     larger candidate pool
-posting_weekend              -0.04   ↓ views     (small negative)
 is_senior_role               -0.03   ↓ views     smaller candidate pool
 ─────────────────────────────────────────────────────────────────────
-Internal correlations (structural):
 salary_log ↔ salary_midpoint  +0.96  log transform of same variable
 desc_wc ↔ desc_density        +0.55  density uses length in formula
 is_software ↔ is_data         +0.35  often co-occur in job titles
 is_senior ↔ is_entry          -0.28  mutually exclusive by construction
-─────────────────────────────────────────────────────────────────────
 ```
 > Most features show **weak linear correlation** — no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.
----
-## ⚙️ Feature Engineering — 20 base + 10 cluster = 30 Total Features
-**Note:** The notebook creates 20 engineered features before clustering, then adds 6 cluster dummy columns for a total of 30 in the final feature matrix (X_train_fe shape: 23,657 × 30).
 | Group | Features |
 |---|---|
 | Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
-| Text structure | `description_density`, `title_desc_ratio` |
 | Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
 | Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` |
-| Interactions | `desc_salary_interaction`, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
 | Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` |
 **Missing value strategy:**
-- Columns with >70% missing → dropped (closed_time, skills_desc, med_salary, remote_allowed, applies, salary min/max, compensation fields)
-- Salary → `has_salary_info` flag + `salary_midpoint` computed where possible; remaining salary NaN imputed inside sklearn Pipeline on training data only
 - Remaining numeric → `SimpleImputer(strategy="median")` inside Pipeline
 ---
 ## 🔵 Clustering — KMeans k=6
-**Clustering features used (12 total, leakage-checked):**
 `title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`
 **Methods used to select k:**
-1. Elbow method (inertia k=2–10) — inconclusive, no sharp elbow
-2. K-Means silhouette scores on full training matrix
-3. Cluster-size stability table (smallest/largest cluster per k)
-4. Interactive K-Means widget (visualization aid only — uses sample)
-5. Hierarchical clustering dendrogram (Ward linkage, 300 obs sample)
-6. Agglomerative Clustering diagnostic comparison (k=2–10 on sample)
 ```
-Chart 1 — Actual silhouette scores by k (full training matrix)
   k=2   ████████░░░░░░░░░░░░  0.198  smallest cluster: 6,830 (28.9%)
   k=3   █████████░░░░░░░░░░░  0.221  smallest cluster: 2,100 (8.9%)
-  k=4   ████████████████░░░░  0.312  ← strong score BUT largest=72%
   k=5   ██████████░░░░░░░░░░  0.250  smallest: 526 (unstable)
-  k=6   ████████████░░░░░░░░  0.290  ← SELECTED ★ smallest: 583 (2.5%)
   k=7   ████████████░░░░░░░░  0.286  singleton cluster appeared
-  k=8   █████████████░░░░░░░  0.315  singleton cluster appeared
-  k=9   █████████████░░░░░░░  0.314  singleton cluster appeared
-  k=10  ██████████████░░░░░░  0.350  singleton cluster appeared
 Why NOT k=10 (highest score): singleton cluster (1 observation)
-Why NOT k=4 (strong score):   largest cluster = 72% of observations
-Why k=6: no singletons, stable sizes, silhouette 0.290, interpretable profiles
-Note: Elbow method was inconclusive (inertia 255,430 at k=2 → 98,508 at k=10,
-no sharp elbow). Agglomerative diagnostic best at k=2 (score 0.467 on sample)
-— too coarse. k=6 selected as practical compromise across all methods.
-Chart 2 — Actual cluster sizes at k=6 (training set n=23,657)
-  Cluster 0 — Manager-focused       ████████████  4,571  (19%)  is_manager_role=1.00
-  Cluster 1 — General / Mixed       ████████████████████ 13,055 (55%)  no dominant role signal
-  Cluster 2 — Salary-transparent    ████          1,940   (8%)  has_salary_info=1.00
-  Cluster 3 — Data roles            ███           1,451   (6%)  is_data_role=1.00
-  Cluster 4 — Software roles        █████         2,057   (9%)  is_software_role=1.00
-  Cluster 5 — Entry / low salary    ██              583   (2%)  smallest cluster
-Official final silhouette score: 0.290 (full training matrix)
-```
 Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.
@@ -286,58 +321,49 @@ Mean Baseline (predict training mean for all observations):
 Baseline Linear Regression (20 features, no clustering):
   RMSE_log = 0.8425   R² = 0.0639
-  MAE_views ≈ 10.54
 ```
-### Full model comparison (after feature engineering + clustering)
-```
-Model                        RMSE_log ↓    R² ↑
-─────────────────────────────────────────────────────
-Random Forest (Tuned)  ★     0.8347        0.0811
-Random Forest (Ctrl)         0.8349        0.0807
-Gradient Boosting            0.8370        0.0770
-Linear Regression + Feat     0.8420        0.0640
-RidgeCV                      0.8420        0.0640
-Lasso Regression             0.8430        0.0640
-PCA + Linear Regression      0.8440        0.0600
-Mean Baseline                0.8708       -0.0002
-─────────────────────────────────────────────────────
-Winner: RandomizedSearchCV tuned RF
-Improvement over manually controlled RF: 0.0002 RMSE_log (practically negligible)
-3-fold CV mean RMSE_log: 0.8747 (±0.0125) — stable across folds
-Overfitting lesson: unrestricted RF → train R²=0.854, test R²=0.003
-Fixed by: max_depth, min_samples_split, min_samples_leaf, max_features constraints
-Outlier robustness test: capping views at 99th pct → RMSE_log 0.8147, R²=0.0812
-```
-### Top feature importances (RF Tuned)
 ```
-description_density          ████████████  #1 — content quality
 description_length           ██████████░░  #2 — raw description size
 description_word_count       ████████░░░░  #3 — word count
-title-description interaction████████░░░░  #4 — combined signal
 is_software_role             ██████░░░░░░  #5 — tech role demand
 is_data_role                 █████░░░░░░░  #6 — data role demand
 salary_log / has_salary_info ████░░░░░░░░  #7+ — salary signals
 ```
-> **Note:** desc_salary_interaction ranked #2 in SHAP analysis but further down in Gini importance. Both agree on description quality and salary as top drivers.
-### Regression interpretation
 ```
 R² = 0.081 → model explains ~8% of variance in log(views+1)
-Why acceptable:
-  ✓ Beats mean baseline (R²≈0) — real posting-level signal captured
-  ✓ Social engagement inherently noisy — platform factors dominate
-  ✓ 92% of variance from unobservable sources (algorithm, followers, ads)
-  ✓ Practical use = ranking postings, not forecasting exact counts
-PCA + Linear: reduced to 15 components (96.3% variance preserved) — no improvement
-Gradient Boosting marginally worse than RF — non-linear models help but modestly
 ```
 ---
@@ -347,95 +373,80 @@ Gradient Boosting marginally worse than RF — non-linear models help but modest
 ```
 Target: high_engagement = 1 if views ≥ 75th percentile of TRAINING views
 Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
-Feature matrix: X_clf uses 24 features (not the full 30 — see notebook cell 207)
-Training: ~24,000 obs | Test: ~6,000 obs
 Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
 ```
-### Model comparison
-```
-Model                  F1 (C1)    Recall (C1)   Notes
-──────────────────────────────────────────────────────────────
-Decision Tree     ★    HIGHEST    HIGHEST       lowest FN count
-Logistic Regr.         near-best  high          close to DT
-Random Forest          moderate   lower         lowest FP count
-Dummy Baseline         0.00       0.00          always predicts Class 0
-──────────────────────────────────────────────────────────────
-Winner: max_depth=8, class_weight="balanced"
-5-fold CV F1: 0.4424 ± 0.0152 — stable, no lucky split
-```
-### Confusion matrix (all models — from notebook)
 ```
-Decision Tree:     lowest FN (catches most high-engagement) — most false positives
-Random Forest:     lowest FP (fewest false alarms) — misses most high-engagement
-Logistic Regr.:    between the two — close to DT in F1
-FN (missed high-engagement) = most costly error:
-  Company fails to prioritize, promote, or learn from a valuable listing.
-FP (false alarm) = also costly:
-  Recruiters waste attention on postings that are not actually strong.
 ```
 ---
-## 💡 Business Insights (from notebook cell 242)
-1. **Salary transparency is associated with higher engagement** — 74.3% more views. Fewer than 30% of postings disclose salary today.
-2. **Description structure matters** — density was the #1 feature in both models. Sweet spot: 250–500 words.
-3. **Tech roles attract more engagement** — software and data role flags carry signal beyond salary.
-4. **Work type is associated with engagement** — contract roles lead, but full-time dominates volume.
-5. **Platform factors dominate** — R²≈0.08 is expected. Model value is in ranking, not exact prediction.
 ---
 ## 🎁 Bonus Work
-### 🚀 Interactive Dashboard
-👉 **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**
-| Tab | Description |
-|---|---|
-| 🎯 Engagement Predictor | Real-time predicted views + High/Normal classification |
-| 📊 EDA Dashboard | All 5 EDA findings as interactive charts |
-| ℹ️ About | Feature groups, model details, limitations |
 ### 🧠 SHAP Explainability
 ```
 SHAP mean |value| — RF Tuned regression (test observations)
-description_density      ████████████  strongest ↑
 desc_salary_interaction  ██████████░░  salary × description synergy ↑
 salary_log               ████████░░░░  salary level ↑
 has_salary_info          ██████░░░░░░  disclosed → more views ↑
 posting_weekend          ██░░░░░░░░░░  weekend → fewer views ↓
-Key finding: desc_salary_interaction ranks #2 in SHAP but lower in Gini —
-confirms it captures genuine non-linear interaction beyond individual features.
 ```
 ### 📊 Feature Importance: Regression vs Classification
 ```
                         Regression RF    Classification DT
 description_density     #1               #2
-desc_salary_interaction varies           varies
 salary_log              #7+              varies
 is_entry_role           lower            rises in classification
 is_data_role            #6               varies
-─────────────────────────────────────────────────────────
-Agreement: description quality + salary dominate both models
 Divergence: seniority/role flags matter more for threshold-crossing
             (classification) than for predicting exact counts (regression)
 ```
-### 🔬 Additional Bonus Items
-- **Interactive K-Means Widget** — explore different k values visually in notebook (cell 4.11)
 - **Hierarchical Clustering Dendrogram** — Ward linkage, 300 obs sample (cell 4.12)
 - **Agglomerative Clustering Diagnostic** — k=2–10 comparison (cell 4.13)
 - **Outlier Robustness Test** — views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
@@ -453,16 +464,19 @@ with open("linkedin_regression_model.pkl", "rb") as f:
 with open("linkedin_classification_model.pkl", "rb") as f:
     clf_model = pickle.load(f)
-# Regression — predict log(views+1), convert back
 log_views_pred = reg_model.predict(X_test_fe)
 views_pred = np.expm1(log_views_pred)
-# Classification — predict high-engagement label (0 or 1)
 label = clf_model.predict(X_clf)
 ```
-> Regression model expects 30-column X_test_fe (with cluster dummies). Classification model expects 24-column X_clf. Run the full pipeline in the notebook to produce compatible inputs.
 ---
 *Assignment 2 — Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)*

 ---
+---
 tags:
   - regression
   - classification
   - linkedin
   - job-postings
   - sklearn
+  - random-forest
+  - decision-tree
+  - kmeans
+  - shap
 license: mit
 ---
 ## 📹 Presentation Video
+<video src=["https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4](https://www.loom.com/share/c7d9b89a54234f699204b16a9a313c7d)" controls style="max-width:720px;"></video>
+---
+## 🚀 Interactive Dashboard
+👉 **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)**
+| Tab | Description |
+|---|---|
+| 🎯 Engagement Predictor | Enter posting details → get predicted views + High/Normal classification in real time |
+| 📊 EDA Dashboard | All 5 EDA findings as interactive charts |
+| ℹ️ About | Feature groups, model details, limitations |
 ---
 | **Original size** | 123,850 rows × 49 columns |
 | **Working sample** | 30,000 rows · `random_state=42` |
 | **After join with companies** | 30,000 rows × 40 columns |
+| **After cleaning** | 29,572 rows × 51 columns (in `df_model`) |
 | **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) |
 | **Regression target** | `log_views = log1p(views)` — log-transformed to handle right skew |
+| **Classification target** | `high_engagement` — top 25% of training views (threshold derived from training set only) |
 ---
 | File | Description |
 |---|---|
+| `notebook.ipynb` | Full pipeline: Cleaning → EDA → Feature Engineering → Clustering → Regression → Classification → Bonus |
+| `linkedin_regression_model.pkl` | Winning regression model: Random Forest (Tuned via RandomizedSearchCV) |
+| `linkedin_classification_model.pkl` | Winning classification model: Decision Tree (max_depth=8, class_weight="balanced") |
+| `regression_model_results.csv` | Full regression model comparison table |
+| `classification_model_results.csv` | Full classification model comparison table |
 ---
 ## 🧹 Data Cleaning Pipeline
+**7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:**
 ```
 Step 1 — Reproducible sampling
         123,850 rows → sample(n=30,000, random_state=42)
         Threshold: >70% missing → drop
         Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
                  remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)
 Step 5 — Leakage columns excluded
         expiry, applies → removed (post-publication outcomes)
+        views → kept as target only, never as a feature
 Step 6 — Salary imputation strategy
         has_salary_info = 1 if salary present, else 0
         Raw views: mean=14.9, std=98.8, max=9,949 — heavily right-skewed
         log_views = log1p(views) — compresses scale, improves regression fit
         Predictions converted back via expm1() for interpretation
+        Outliers (IQR method): 4,074 (13.8%) — kept, not removed
 ```
 ---
+## 🔍 EDA — 5 Research Questions
+> **Note on notebook ordering:** Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact.
+---
+### 💰 Q2 — Salary Transparency vs Views
 ```
 No salary info   ████████████░░░░░░░░░░░░░  ~12 avg views   (70.1% of postings)
                                              +74.3% lift ✓
 ```
+> Only **8,562 of 29,572 postings (29.9%)** disclose salary. Transparent postings attract **74.3% more views** on average. This is the highest-leverage, lowest-cost recruiter action available.
 ---
+### 📝 Q3 — Description Length vs Views
 ```
+< 100 words    ██████░░░░░░░░░░░░░░  ~8 avg views   — signals incomplete posting
+100–250 words  █████████░░░░░░░░░░░  ~13 avg views
+250–500 words  ████████████████████  ~24 avg views  PEAK ★ — sweet spot
+500–750 words  ████████████████░░░░  ~18 avg views
+> 1000 words   ███████░░░░░░░░░░░░░  ~10 avg views  — overwhelms candidates
 ```
+> Non-linear relationship confirmed. Sweet spot: **250–500 words**. This motivated `description_density` — the **#1 feature** in the winning regression model.
 ---
+### 📅 Q4 — Day of Week vs Views
 ```
+Monday    ████████████████████  39 avg views  ★ best day  (n=1,837)
+Tuesday   █████████████████░░░  25 avg views
+Wednesday ████████████████░░░░  22 avg views
+Thursday  ███████████████░░░░░  18 avg views
 Friday    ███░░░░░░░░░░░░░░░░░   7 avg views  ✗ worst day (n=10,076)
+Saturday  ████████████░░░░░░░░  28 avg views  (weekend — n=2,116 total, noisier)
+Sunday    ████████████░░░░░░░░  28 avg views  (weekend — noisier)
 ```
+> **Counterintuitive finding:** Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features.
 ---
+### 💼 Q1 — Work Type vs Views
 ```
+Contract    ████████████████████  29.97 avg views  median: 7.0
+Internship  █████████████████░░░  25.71 avg views  median: 5.0
+Full-time   ████████░░░░░░░░░░░░  13.70 avg views  median: 4.0  ← 80% of volume
+Other       ███████░░░░░░░░░░░░░  11.27 avg views  median: 4.0
+Part-time   ██████░░░░░░░░░░░░░░   9.59 avg views  median: 4.0
 ```
+> Contract and Internship roles show the highest engagement. However, **Full-time dominates volume** (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal.
 ---
+### 🎓 Q5 — Seniority Level vs Views
 ```
 Entry-level  ████████████████████  18 avg views  n=792
 Entry vs Other:  +18.9% more views
 ```
+> Supply-side effect — more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for **candidate pool size**, not intrinsic desirability.
 ---
 ```
 Feature                      Corr    Direction   Note
 ─────────────────────────────────────────────────────────────────────
+desc_salary_interaction      +0.18   ↑ views     strongest single predictor
 has_salary_info              +0.14   ↑ views     salary transparency
 salary_log                   +0.12   ↑ views     salary level
 description_density          +0.10   ↑ views     content quality
 is_software_role             +0.08   ↑ views     tech role demand
 is_data_role                 +0.07   ↑ views     data role demand
 is_entry_role                +0.06   ↑ views     larger candidate pool
+posting_weekend              -0.04   ↓ views     small negative signal
 is_senior_role               -0.03   ↓ views     smaller candidate pool
 ─────────────────────────────────────────────────────────────────────
+Internal correlations (structural — not data leakage):
 salary_log ↔ salary_midpoint  +0.96  log transform of same variable
 desc_wc ↔ desc_density        +0.55  density uses length in formula
 is_software ↔ is_data         +0.35  often co-occur in job titles
 is_senior ↔ is_entry          -0.28  mutually exclusive by construction
 ```
 > Most features show **weak linear correlation** — no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.
+### 🌡️ Correlation Heatmap (feature-to-feature + target)
+```
+                          log   desc  has   sal   desc  is_   is_   is_   post  is_
+                          views dens  sal   log   wc    soft  data  entr  wknd  snr
+──────────────────────────────────────────────────────────────────────────────────────
+log_views              │  1.00  0.10  0.14  0.12  0.08  0.08  0.07  0.06 -0.04 -0.03
+description_density    │  0.10  1.00  0.02  0.04  0.55  0.01  0.01 -0.01  0.00  0.00
+has_salary_info        │  0.14  0.02  1.00  0.72  0.03  0.06  0.07 -0.03 -0.01 -0.02
+salary_log             │  0.12  0.04  0.72  1.00  0.04  0.05  0.06 -0.02 -0.01 -0.01
+description_word_count │  0.08  0.55  0.03  0.04  1.00  0.01  0.01 -0.01  0.00  0.00
+is_software_role       │  0.08  0.01  0.06  0.05  0.01  1.00  0.35 -0.08  0.00 -0.05
+is_data_role           │  0.07  0.01  0.07  0.06  0.01  0.35  1.00 -0.06  0.00 -0.04
+is_entry_role          │  0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06  1.00  0.01 -0.28
+posting_weekend        │ -0.04  0.00 -0.01 -0.01  0.00  0.00  0.00  0.01  1.00  0.00
+is_senior_role         │ -0.03  0.00 -0.02 -0.01  0.00 -0.05 -0.04 -0.28  0.00  1.00
+──────────────────────────────────────────────────────────────────────────────────────
+Key structural correlations:
+  salary_log ↔ has_salary_info  +0.72  same underlying signal, different form
+  desc_wc    ↔ desc_density     +0.55  density formula uses word count
+  is_software ↔ is_data         +0.35  frequently co-occur in job titles
+  is_entry   ↔ is_senior        -0.28  mutually exclusive flags
+```
+> The heatmap confirms no multicollinearity crisis — the highest inter-feature correlation (salary_log ↔ has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models.
+---
+## ⚙️ Feature Engineering — 20 Base + 6 Cluster = 30 Total Features
 | Group | Features |
 |---|---|
 | Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` |
+| Text structure | `description_density` ★, `title_desc_ratio` |
 | Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` |
 | Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` |
+| Interactions | `desc_salary_interaction` ★, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` |
 | Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` |
 **Missing value strategy:**
+- Columns with >70% missing → dropped
+- Salary → `has_salary_info` flag + `salary_midpoint` where available; remaining NaN imputed inside sklearn Pipeline on training data only
 - Remaining numeric → `SimpleImputer(strategy="median")` inside Pipeline
 ---
 ## 🔵 Clustering — KMeans k=6
+**Features used for clustering (12 total, leakage-checked):**
 `title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`
 **Methods used to select k:**
+1. Elbow method — inconclusive, no sharp elbow
+2. KMeans silhouette scores on full training matrix
+3. Cluster-size stability table
+4. Interactive K-Means widget (visualization aid — uses sample)
+5. Hierarchical clustering dendrogram (Ward linkage, 300 obs)
+6. Agglomerative clustering comparison (k=2–10)
 ```
+Silhouette scores by k (full training matrix):
   k=2   ████████░░░░░░░░░░░░  0.198  smallest cluster: 6,830 (28.9%)
   k=3   █████████░░░░░░░░░░░  0.221  smallest cluster: 2,100 (8.9%)
+  k=4   ████████████████░░░░  0.312  ← strong BUT largest=72% of data
   k=5   ██████████░░░░░░░░░░  0.250  smallest: 526 (unstable)
+  k=6   ████████████░░░░░░░░  0.290  ← SELECTED ★  smallest: 583 (2.5%)
   k=7   ████████████░░░░░░░░  0.286  singleton cluster appeared
+  k=8+                               singleton clusters appeared
 Why NOT k=10 (highest score): singleton cluster (1 observation)
+Why NOT k=4 (strong score):   largest cluster = 72% — not meaningful separation
+Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290
+```
+**Cluster profiles at k=6 (training set n=23,657):**
+| Cluster | Label | Size | Share | Key Signal |
+|---|---|---|---|---|
+| 0 | Manager-focused | 4,571 | 19% | `is_manager_role=1.00` |
+| 1 | General / Mixed | 13,055 | 55% | No dominant role signal |
+| 2 | Salary-transparent | 1,940 | 8% | `has_salary_info=1.00` |
+| 3 | Data roles | 1,451 | 6% | `is_data_role=1.00` |
+| 4 | Software roles | 2,057 | 9% | `is_software_role=1.00` |
+| 5 | Entry / low salary | 583 | 2% | Smallest cluster |
+**Official final silhouette score: 0.290** (full training matrix)
 Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.
 Baseline Linear Regression (20 features, no clustering):
   RMSE_log = 0.8425   R² = 0.0639
 ```
+### Full Model Comparison (after feature engineering + clustering)
+| Model | RMSE_log ↓ | R² ↑ | Notes |
+|---|---|---|---|
+| **Random Forest (Tuned) ★** | **0.8347** | **0.0811** | RandomizedSearchCV winner |
+| Random Forest (Controlled) | 0.8349 | 0.0807 | Manual constraints |
+| Gradient Boosting | 0.8370 | 0.0770 | — |
+| Linear Regression + Features | 0.8420 | 0.0640 | — |
+| RidgeCV | 0.8420 | 0.0640 | — |
+| Lasso Regression | 0.8430 | 0.0640 | — |
+| PCA + Linear Regression | 0.8440 | 0.0600 | 15 components, 96.3% variance |
+| Mean Baseline | 0.8708 | -0.0002 | Floor |
+**Key lessons:**
+- Unrestricted RF → train R²=0.854, test R²=0.003 (massive overfit). Fixed by `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features` constraints.
+- 3-fold CV mean RMSE_log: 0.8747 (±0.0125) — stable across folds
+- Outlier robustness test: capping views at 99th pct → RMSE_log 0.8147, R²=0.0812
+### Top Feature Importances (RF Tuned)
 ```
+description_density          ████████████  #1 — content quality proxy
 description_length           ██████████░░  #2 — raw description size
 description_word_count       ████████░░░░  #3 — word count
+title-description interaction████████░░░░  #4 — combined text signal
 is_software_role             ██████░░░░░░  #5 — tech role demand
 is_data_role                 █████░░░░░░░  #6 — data role demand
 salary_log / has_salary_info ████░░░░░░░░  #7+ — salary signals
 ```
+> `desc_salary_interaction` ranks #2 in SHAP analysis but further down in Gini importance — both agree on description quality and salary as top drivers.
+### Why R² = 0.081 Is Acceptable
 ```
 R² = 0.081 → model explains ~8% of variance in log(views+1)
+✓ Beats mean baseline (R²≈0) — real posting-level signal captured
+✓ Social engagement inherently noisy — platform factors dominate
+✓ 92% of variance from unobservable sources (algorithm, followers, ads)
+✓ Practical use = ranking postings, not forecasting exact counts
 ```
 ---
 ```
 Target: high_engagement = 1 if views ≥ 75th percentile of TRAINING views
 Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
+Feature matrix: X_clf uses 24 features (see notebook cell 207)
 Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
 ```
+### Model Comparison
+| Model | F1 (Class 1) | Recall (Class 1) | Notes |
+|---|---|---|---|
+| **Decision Tree ★** | **HIGHEST** | **HIGHEST** | max_depth=8, class_weight="balanced" |
+| Logistic Regression | near-best | high | Close to DT in F1 |
+| Random Forest | moderate | lower | Lowest FP count |
+| Dummy Baseline | 0.00 | 0.00 | Always predicts Class 0 |
+**5-fold CV F1: 0.4424 ± 0.0152** — stable, no lucky split
+### Error Cost Analysis
 ```
+FN (missed high-engagement) = most costly error
+  → Company fails to prioritize, promote, or learn from a strong posting
+FP (false alarm) = also costly
+  → Recruiter wastes time and budget on a posting that won't perform
 ```
+Decision Tree minimises FN (catches most high-engagement postings) but produces more FP.
+Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings.
 ---
+## 💡 Business Insights
+1. **Salary transparency is the single highest-leverage action** — 74.3% more views for free. Fewer than 30% of postings disclose salary today.
+2. **Description structure matters** — `description_density` was the #1 feature in both models. Sweet spot: 250–500 words.
+3. **Tech roles attract disproportionate engagement** — `is_software_role` and `is_data_role` carry real signal beyond salary.
+4. **Work type is associated with engagement** — contract roles lead, but full-time dominates volume (80%).
+5. **Platform factors dominate** — R²≈0.08 is expected and acceptable. Model value is in **ranking** postings, not exact prediction.
 ---
 ## 🎁 Bonus Work
 ### 🧠 SHAP Explainability
 ```
 SHAP mean |value| — RF Tuned regression (test observations)
+description_density      ████████████  strongest positive impact ↑
 desc_salary_interaction  ██████████░░  salary × description synergy ↑
 salary_log               ████████░░░░  salary level ↑
 has_salary_info          ██████░░░░░░  disclosed → more views ↑
 posting_weekend          ██░░░░░░░░░░  weekend → fewer views ↓
 ```
+`desc_salary_interaction` ranks #2 in SHAP but lower in Gini — confirms it captures genuine non-linear interaction that neither feature achieves alone.
 ### 📊 Feature Importance: Regression vs Classification
 ```
                         Regression RF    Classification DT
 description_density     #1               #2
+desc_salary_interaction #2 (SHAP)        varies
 salary_log              #7+              varies
 is_entry_role           lower            rises in classification
 is_data_role            #6               varies
+──────────────────────────────────────────────────────────
+Agreement:  description quality + salary dominate both models
 Divergence: seniority/role flags matter more for threshold-crossing
             (classification) than for predicting exact counts (regression)
 ```
+### 🔬 Additional Extras
+- **Interactive K-Means Widget** — explore different k values visually (notebook cell 4.11)
 - **Hierarchical Clustering Dendrogram** — Ward linkage, 300 obs sample (cell 4.12)
 - **Agglomerative Clustering Diagnostic** — k=2–10 comparison (cell 4.13)
 - **Outlier Robustness Test** — views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
 with open("linkedin_classification_model.pkl", "rb") as f:
     clf_model = pickle.load(f)
+# Regression — predict log(views+1), convert back to raw view estimate
 log_views_pred = reg_model.predict(X_test_fe)
 views_pred = np.expm1(log_views_pred)
+# Classification — predict high-engagement label (0 = Normal, 1 = High)
 label = clf_model.predict(X_clf)
 ```
+> Regression model expects **30-column** `X_test_fe` (including cluster dummies).
+> Classification model expects **24-column** `X_clf` (see notebook cell 207).
+> Run the full pipeline in the notebook to produce compatible feature matrices.
 ---
 *Assignment 2 — Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)*