--- --- tags: - regression - classification - clustering - tabular - linkedin - job-postings - sklearn - random-forest - decision-tree - kmeans - shap license: mit --- # πŸ“Š LinkedIn Job Posting Engagement Analysis > **Which LinkedIn job posting characteristics predict candidate engagement (views) β€” and how well can engagement be predicted or classified using only posting-level features?** **Personal motivation:** As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions. --- ## πŸ“Ή Presentation Video --- ## πŸš€ Interactive Dashboard πŸ‘‰ **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)** | Tab | Description | |---|---| | 🎯 Engagement Predictor | Enter posting details β†’ get predicted views + High/Normal classification in real time | | πŸ“Š EDA Dashboard | All 5 EDA findings as interactive charts | | ℹ️ About | Feature groups, model details, limitations | --- ## πŸ“‹ Dataset at a Glance | Property | Value | |---|---| | **Source** | [LinkedIn Job Postings β€” arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) | | **Original size** | 123,850 rows Γ— 49 columns | | **Working sample** | 30,000 rows Β· `random_state=42` | | **After join with companies** | 30,000 rows Γ— 40 columns | | **After cleaning** | 29,572 rows Γ— 51 columns (in `df_model`) | | **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) | | **Regression target** | `log_views = log1p(views)` β€” log-transformed to handle right skew | | **Classification target** | `high_engagement` β€” top 25% of training views (threshold derived from training set only) | --- ## ⚠️ Scope & Limitations > LinkedIn's algorithm, sponsored status, and company follower counts drive the **majority of view variance** and are **unobservable** in this dataset. Models use posting-level features only. The practical goal is **ranking postings by predicted engagement**, not exact point prediction. Results show associations, not causal relationships. --- ## πŸ—‚οΈ Repository Files | File | Description | |---|---| | `notebook.ipynb` | Full pipeline: Cleaning β†’ EDA β†’ Feature Engineering β†’ Clustering β†’ Regression β†’ Classification β†’ Bonus | | `linkedin_regression_model.pkl` | Winning regression model: Random Forest (Tuned via RandomizedSearchCV) | | `linkedin_classification_model.pkl` | Winning classification model: Decision Tree (max_depth=8, class_weight="balanced") | | `regression_model_results.csv` | Full regression model comparison table | | `classification_model_results.csv` | Full classification model comparison table | --- ## 🧹 Data Cleaning Pipeline **7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:** ``` Step 1 β€” Reproducible sampling 123,850 rows β†’ sample(n=30,000, random_state=42) Joined with companies.csv on company_id (left join, rows preserved) Result: 30,000 rows Γ— 40 columns Step 2 β€” Duplicate & missing target removal Removed duplicate rows Dropped rows where views is NaN or negative Result: 29,572 usable rows Step 3 β€” Date parsing listed_time, original_listed_time, expiry, closed_time β†’ parsed to datetime Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend Step 4 β€” Missing value analysis & column dropping Threshold: >70% missing β†’ drop Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%), remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%) Step 5 β€” Leakage columns excluded expiry, applies β†’ removed (post-publication outcomes) views β†’ kept as target only, never as a feature Step 6 β€” Salary imputation strategy has_salary_info = 1 if salary present, else 0 salary_midpoint computed from min/max salary where available Missing salary β†’ imputed inside sklearn Pipeline on training data only Step 7 β€” Log transformation of target Raw views: mean=14.9, std=98.8, max=9,949 β€” heavily right-skewed log_views = log1p(views) β€” compresses scale, improves regression fit Predictions converted back via expm1() for interpretation Outliers (IQR method): 4,074 (13.8%) β€” kept, not removed ``` --- ## πŸ” EDA β€” 5 Research Questions > **Note on notebook ordering:** Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact. --- ### πŸ’° Q2 β€” Salary Transparency vs Views ``` No salary info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~12 avg views (70.1% of postings) Has salary info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ ~21 avg views (29.9% of postings) +74.3% lift βœ“ ``` > Only **8,562 of 29,572 postings (29.9%)** disclose salary. Transparent postings attract **74.3% more views** on average. This is the highest-leverage, lowest-cost recruiter action available. --- ### πŸ“ Q3 β€” Description Length vs Views ``` < 100 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~8 avg views β€” signals incomplete posting 100–250 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~13 avg views 250–500 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~24 avg views PEAK β˜… β€” sweet spot 500–750 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ ~18 avg views > 1000 words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ ~10 avg views β€” overwhelms candidates ``` > Non-linear relationship confirmed. Sweet spot: **250–500 words**. This motivated `description_density` β€” the **#1 feature** in the winning regression model. --- ### πŸ“… Q4 β€” Day of Week vs Views ``` Monday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 39 avg views β˜… best day (n=1,837) Tuesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘ 25 avg views Wednesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 22 avg views Thursday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ 18 avg views Friday β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 7 avg views βœ— worst day (n=10,076) Saturday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 28 avg views (weekend β€” n=2,116 total, noisier) Sunday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 28 avg views (weekend β€” noisier) ``` > **Counterintuitive finding:** Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features. --- ### πŸ’Ό Q1 β€” Work Type vs Views ``` Contract β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 29.97 avg views median: 7.0 Internship β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘ 25.71 avg views median: 5.0 Full-time β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 13.70 avg views median: 4.0 ← 80% of volume Other β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 11.27 avg views median: 4.0 Part-time β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 9.59 avg views median: 4.0 ``` > Contract and Internship roles show the highest engagement. However, **Full-time dominates volume** (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal. --- ### πŸŽ“ Q5 β€” Seniority Level vs Views ``` Entry-level β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 18 avg views n=792 Senior-level β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 16 avg views n=3,577 Other/Mid β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 15 avg views n=25,203 Entry vs Senior: +12.4% more views Entry vs Other: +18.9% more views ``` > Supply-side effect β€” more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for **candidate pool size**, not intrinsic desirability. --- ### πŸ”₯ Feature Correlation with log(views+1) ``` Feature Corr Direction Note ───────────────────────────────────────────────────────────────────── desc_salary_interaction +0.18 ↑ views strongest single predictor has_salary_info +0.14 ↑ views salary transparency salary_log +0.12 ↑ views salary level description_density +0.10 ↑ views content quality description_word_count +0.08 ↑ views description length is_software_role +0.08 ↑ views tech role demand is_data_role +0.07 ↑ views data role demand is_entry_role +0.06 ↑ views larger candidate pool posting_weekend -0.04 ↓ views small negative signal is_senior_role -0.03 ↓ views smaller candidate pool ───────────────────────────────────────────────────────────────────── Internal correlations (structural β€” not data leakage): salary_log ↔ salary_midpoint +0.96 log transform of same variable desc_wc ↔ desc_density +0.55 density uses length in formula is_software ↔ is_data +0.35 often co-occur in job titles is_senior ↔ is_entry -0.28 mutually exclusive by construction ``` > Most features show **weak linear correlation** β€” no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations. ### 🌑️ Correlation Heatmap (feature-to-feature + target) ``` log desc has sal desc is_ is_ is_ post is_ views dens sal log wc soft data entr wknd snr ────────────────────────────────────────────────────────────────────────────────────── log_views β”‚ 1.00 0.10 0.14 0.12 0.08 0.08 0.07 0.06 -0.04 -0.03 description_density β”‚ 0.10 1.00 0.02 0.04 0.55 0.01 0.01 -0.01 0.00 0.00 has_salary_info β”‚ 0.14 0.02 1.00 0.72 0.03 0.06 0.07 -0.03 -0.01 -0.02 salary_log β”‚ 0.12 0.04 0.72 1.00 0.04 0.05 0.06 -0.02 -0.01 -0.01 description_word_count β”‚ 0.08 0.55 0.03 0.04 1.00 0.01 0.01 -0.01 0.00 0.00 is_software_role β”‚ 0.08 0.01 0.06 0.05 0.01 1.00 0.35 -0.08 0.00 -0.05 is_data_role β”‚ 0.07 0.01 0.07 0.06 0.01 0.35 1.00 -0.06 0.00 -0.04 is_entry_role β”‚ 0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06 1.00 0.01 -0.28 posting_weekend β”‚ -0.04 0.00 -0.01 -0.01 0.00 0.00 0.00 0.01 1.00 0.00 is_senior_role β”‚ -0.03 0.00 -0.02 -0.01 0.00 -0.05 -0.04 -0.28 0.00 1.00 ────────────────────────────────────────────────────────────────────────────────────── Key structural correlations: salary_log ↔ has_salary_info +0.72 same underlying signal, different form desc_wc ↔ desc_density +0.55 density formula uses word count is_software ↔ is_data +0.35 frequently co-occur in job titles is_entry ↔ is_senior -0.28 mutually exclusive flags ``` > The heatmap confirms no multicollinearity crisis β€” the highest inter-feature correlation (salary_log ↔ has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models. --- ## βš™οΈ Feature Engineering β€” 20 Base + 6 Cluster = 30 Total Features | Group | Features | |---|---| | Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` | | Text structure | `description_density` β˜…, `title_desc_ratio` | | Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` | | Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` | | Interactions | `desc_salary_interaction` β˜…, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` | | Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` | **Missing value strategy:** - Columns with >70% missing β†’ dropped - Salary β†’ `has_salary_info` flag + `salary_midpoint` where available; remaining NaN imputed inside sklearn Pipeline on training data only - Remaining numeric β†’ `SimpleImputer(strategy="median")` inside Pipeline --- ## πŸ”΅ Clustering β€” KMeans k=6 **Features used for clustering (12 total, leakage-checked):** `title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role` **Methods used to select k:** 1. Elbow method β€” inconclusive, no sharp elbow 2. KMeans silhouette scores on full training matrix 3. Cluster-size stability table 4. Interactive K-Means widget (visualization aid β€” uses sample) 5. Hierarchical clustering dendrogram (Ward linkage, 300 obs) 6. Agglomerative clustering comparison (k=2–10) ``` Silhouette scores by k (full training matrix): k=2 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.198 smallest cluster: 6,830 (28.9%) k=3 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.221 smallest cluster: 2,100 (8.9%) k=4 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 0.312 ← strong BUT largest=72% of data k=5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.250 smallest: 526 (unstable) k=6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.290 ← SELECTED β˜… smallest: 583 (2.5%) k=7 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.286 singleton cluster appeared k=8+ singleton clusters appeared Why NOT k=10 (highest score): singleton cluster (1 observation) Why NOT k=4 (strong score): largest cluster = 72% β€” not meaningful separation Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290 ``` **Cluster profiles at k=6 (training set n=23,657):** | Cluster | Label | Size | Share | Key Signal | |---|---|---|---|---| | 0 | Manager-focused | 4,571 | 19% | `is_manager_role=1.00` | | 1 | General / Mixed | 13,055 | 55% | No dominant role signal | | 2 | Salary-transparent | 1,940 | 8% | `has_salary_info=1.00` | | 3 | Data roles | 1,451 | 6% | `is_data_role=1.00` | | 4 | Software roles | 2,057 | 9% | `is_software_role=1.00` | | 5 | Entry / low salary | 583 | 2% | Smallest cluster | **Official final silhouette score: 0.290** (full training matrix) Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them. --- ## πŸ“ˆ Regression β€” Predicting `log1p(views)` ### Baseline ``` Mean Baseline (predict training mean for all observations): RMSE_log = 0.8708 RΒ² = -0.0002 ← floor every model must beat MAE_views β‰ˆ 10.64 Baseline Linear Regression (20 features, no clustering): RMSE_log = 0.8425 RΒ² = 0.0639 ``` ### Full Model Comparison (after feature engineering + clustering) | Model | RMSE_log ↓ | RΒ² ↑ | Notes | |---|---|---|---| | **Random Forest (Tuned) β˜…** | **0.8347** | **0.0811** | RandomizedSearchCV winner | | Random Forest (Controlled) | 0.8349 | 0.0807 | Manual constraints | | Gradient Boosting | 0.8370 | 0.0770 | β€” | | Linear Regression + Features | 0.8420 | 0.0640 | β€” | | RidgeCV | 0.8420 | 0.0640 | β€” | | Lasso Regression | 0.8430 | 0.0640 | β€” | | PCA + Linear Regression | 0.8440 | 0.0600 | 15 components, 96.3% variance | | Mean Baseline | 0.8708 | -0.0002 | Floor | **Key lessons:** - Unrestricted RF β†’ train RΒ²=0.854, test RΒ²=0.003 (massive overfit). Fixed by `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features` constraints. - 3-fold CV mean RMSE_log: 0.8747 (Β±0.0125) β€” stable across folds - Outlier robustness test: capping views at 99th pct β†’ RMSE_log 0.8147, RΒ²=0.0812 ### Top Feature Importances (RF Tuned) ``` description_density β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ #1 β€” content quality proxy description_length β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ #2 β€” raw description size description_word_count β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ #3 β€” word count title-description interactionβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ #4 β€” combined text signal is_software_role β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ #5 β€” tech role demand is_data_role β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘ #6 β€” data role demand salary_log / has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ #7+ β€” salary signals ``` > `desc_salary_interaction` ranks #2 in SHAP analysis but further down in Gini importance β€” both agree on description quality and salary as top drivers. ### Why RΒ² = 0.081 Is Acceptable ``` RΒ² = 0.081 β†’ model explains ~8% of variance in log(views+1) βœ“ Beats mean baseline (RΒ²β‰ˆ0) β€” real posting-level signal captured βœ“ Social engagement inherently noisy β€” platform factors dominate βœ“ 92% of variance from unobservable sources (algorithm, followers, ads) βœ“ Practical use = ranking postings, not forecasting exact counts ``` --- ## 🟠 Classification β€” High Engagement vs. Normal ``` Target: high_engagement = 1 if views β‰₯ 75th percentile of TRAINING views Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1) Feature matrix: X_clf uses 24 features (see notebook cell 207) Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance) ``` ### Model Comparison | Model | F1 (Class 1) | Recall (Class 1) | Notes | |---|---|---|---| | **Decision Tree β˜…** | **HIGHEST** | **HIGHEST** | max_depth=8, class_weight="balanced" | | Logistic Regression | near-best | high | Close to DT in F1 | | Random Forest | moderate | lower | Lowest FP count | | Dummy Baseline | 0.00 | 0.00 | Always predicts Class 0 | **5-fold CV F1: 0.4424 Β± 0.0152** β€” stable, no lucky split ### Error Cost Analysis ``` FN (missed high-engagement) = most costly error β†’ Company fails to prioritize, promote, or learn from a strong posting FP (false alarm) = also costly β†’ Recruiter wastes time and budget on a posting that won't perform ``` Decision Tree minimises FN (catches most high-engagement postings) but produces more FP. Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings. --- ## πŸ’‘ Business Insights 1. **Salary transparency is the single highest-leverage action** β€” 74.3% more views for free. Fewer than 30% of postings disclose salary today. 2. **Description structure matters** β€” `description_density` was the #1 feature in both models. Sweet spot: 250–500 words. 3. **Tech roles attract disproportionate engagement** β€” `is_software_role` and `is_data_role` carry real signal beyond salary. 4. **Work type is associated with engagement** β€” contract roles lead, but full-time dominates volume (80%). 5. **Platform factors dominate** β€” RΒ²β‰ˆ0.08 is expected and acceptable. Model value is in **ranking** postings, not exact prediction. --- ## 🎁 Bonus Work ### 🧠 SHAP Explainability ``` SHAP mean |value| β€” RF Tuned regression (test observations) description_density β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ strongest positive impact ↑ desc_salary_interaction β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ salary Γ— description synergy ↑ salary_log β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ salary level ↑ has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ disclosed β†’ more views ↑ posting_weekend β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ weekend β†’ fewer views ↓ ``` `desc_salary_interaction` ranks #2 in SHAP but lower in Gini β€” confirms it captures genuine non-linear interaction that neither feature achieves alone. ### πŸ“Š Feature Importance: Regression vs Classification ``` Regression RF Classification DT description_density #1 #2 desc_salary_interaction #2 (SHAP) varies salary_log #7+ varies is_entry_role lower rises in classification is_data_role #6 varies ────────────────────────────────────────────────────────── Agreement: description quality + salary dominate both models Divergence: seniority/role flags matter more for threshold-crossing (classification) than for predicting exact counts (regression) ``` ### πŸ”¬ Additional Extras - **Interactive K-Means Widget** β€” explore different k values visually (notebook cell 4.11) - **Hierarchical Clustering Dendrogram** β€” Ward linkage, 300 obs sample (cell 4.12) - **Agglomerative Clustering Diagnostic** β€” k=2–10 comparison (cell 4.13) - **Outlier Robustness Test** β€” views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped - **3-fold CV for regression** β€” mean RMSE_log 0.8747 Β± 0.0125 --- ## πŸ› οΈ How to Use the Models ```python import pickle, numpy as np with open("linkedin_regression_model.pkl", "rb") as f: reg_model = pickle.load(f) with open("linkedin_classification_model.pkl", "rb") as f: clf_model = pickle.load(f) # Regression β€” predict log(views+1), convert back to raw view estimate log_views_pred = reg_model.predict(X_test_fe) views_pred = np.expm1(log_views_pred) # Classification β€” predict high-engagement label (0 = Normal, 1 = High) label = clf_model.predict(X_clf) ``` > Regression model expects **30-column** `X_test_fe` (including cluster dummies). > Classification model expects **24-column** `X_clf` (see notebook cell 207). > Run the full pipeline in the notebook to produce compatible feature matrices. --- *Assignment 2 β€” Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings Β· arshkon/linkedin-job-postings (Kaggle)*