Scikit-learn
regression
classification
clustering
tabular
linkedin
job-postings
random-forest
decision-tree
kmeans
shap
Instructions to use MichaelYitzchak/Linkedin_Job_Engagement with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use MichaelYitzchak/Linkedin_Job_Engagement with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("MichaelYitzchak/Linkedin_Job_Engagement", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - regression | |
| - classification | |
| - clustering | |
| - tabular | |
| - job-postings | |
| - sklearn | |
| - random-forest | |
| - decision-tree | |
| - kmeans | |
| - shap | |
| license: mit | |
| # π LinkedIn Job Posting Engagement Analysis | |
| > **Which LinkedIn job posting characteristics predict candidate engagement (views) β and how well can engagement be predicted or classified using only posting-level features?** | |
| **Personal motivation:** As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions. | |
| --- | |
| ## πΉ Presentation Video | |
| <video src=["https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4](https://www.loom.com/share/c7d9b89a54234f699204b16a9a313c7d)" controls style="max-width:720px;"></video> | |
| --- | |
| ## π Interactive Dashboard | |
| π **[Open the LinkedIn Job Engagement Dashboard](https://huggingface.co/spaces/MichaelYitzchak/linkedin_Job_Engagement)** | |
| | Tab | Description | | |
| |---|---| | |
| | π― Engagement Predictor | Enter posting details β get predicted views + High/Normal classification in real time | | |
| | π EDA Dashboard | All 5 EDA findings as interactive charts | | |
| | βΉοΈ About | Feature groups, model details, limitations | | |
| --- | |
| ## π Dataset at a Glance | |
| | Property | Value | | |
| |---|---| | |
| | **Source** | [LinkedIn Job Postings β arshkon/linkedin-job-postings (Kaggle)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) | | |
| | **Original size** | 123,850 rows Γ 49 columns | | |
| | **Working sample** | 30,000 rows Β· `random_state=42` | | |
| | **After join with companies** | 30,000 rows Γ 40 columns | | |
| | **After cleaning** | 29,572 rows Γ 51 columns (in `df_model`) | | |
| | **Train / Test split** | 23,657 / 5,915 (80/20, `random_state=42`) | | |
| | **Regression target** | `log_views = log1p(views)` β log-transformed to handle right skew | | |
| | **Classification target** | `high_engagement` β top 25% of training views (threshold derived from training set only) | | |
| --- | |
| ## β οΈ Scope & Limitations | |
| > LinkedIn's algorithm, sponsored status, and company follower counts drive the **majority of view variance** and are **unobservable** in this dataset. Models use posting-level features only. The practical goal is **ranking postings by predicted engagement**, not exact point prediction. Results show associations, not causal relationships. | |
| --- | |
| ## ποΈ Repository Files | |
| | File | Description | | |
| |---|---| | |
| | `notebook.ipynb` | Full pipeline: Cleaning β EDA β Feature Engineering β Clustering β Regression β Classification β Bonus | | |
| | `linkedin_regression_model.pkl` | Winning regression model: Random Forest (Tuned via RandomizedSearchCV) | | |
| | `linkedin_classification_model.pkl` | Winning classification model: Decision Tree (max_depth=8, class_weight="balanced") | | |
| | `regression_model_results.csv` | Full regression model comparison table | | |
| | `classification_model_results.csv` | Full classification model comparison table | | |
| --- | |
| ## π§Ή Data Cleaning Pipeline | |
| **7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:** | |
| ``` | |
| Step 1 β Reproducible sampling | |
| 123,850 rows β sample(n=30,000, random_state=42) | |
| Joined with companies.csv on company_id (left join, rows preserved) | |
| Result: 30,000 rows Γ 40 columns | |
| Step 2 β Duplicate & missing target removal | |
| Removed duplicate rows | |
| Dropped rows where views is NaN or negative | |
| Result: 29,572 usable rows | |
| Step 3 β Date parsing | |
| listed_time, original_listed_time, expiry, closed_time β parsed to datetime | |
| Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend | |
| Step 4 β Missing value analysis & column dropping | |
| Threshold: >70% missing β drop | |
| Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%), | |
| remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%) | |
| Step 5 β Leakage columns excluded | |
| expiry, applies β removed (post-publication outcomes) | |
| views β kept as target only, never as a feature | |
| Step 6 β Salary imputation strategy | |
| has_salary_info = 1 if salary present, else 0 | |
| salary_midpoint computed from min/max salary where available | |
| Missing salary β imputed inside sklearn Pipeline on training data only | |
| Step 7 β Log transformation of target | |
| Raw views: mean=14.9, std=98.8, max=9,949 β heavily right-skewed | |
| log_views = log1p(views) β compresses scale, improves regression fit | |
| Predictions converted back via expm1() for interpretation | |
| Outliers (IQR method): 4,074 (13.8%) β kept, not removed | |
| ``` | |
| --- | |
| ## π EDA β 5 Research Questions | |
| > **Note on notebook ordering:** Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact. | |
| --- | |
| ### π° Q2 β Salary Transparency vs Views | |
| ``` | |
| No salary info βββββββββββββββββββββββββ ~12 avg views (70.1% of postings) | |
| Has salary info βββββββββββββββββββββββββ ~21 avg views (29.9% of postings) | |
| +74.3% lift β | |
| ``` | |
| > Only **8,562 of 29,572 postings (29.9%)** disclose salary. Transparent postings attract **74.3% more views** on average. This is the highest-leverage, lowest-cost recruiter action available. | |
| --- | |
| ### π Q3 β Description Length vs Views | |
| ``` | |
| < 100 words ββββββββββββββββββββ ~8 avg views β signals incomplete posting | |
| 100β250 words ββββββββββββββββββββ ~13 avg views | |
| 250β500 words ββββββββββββββββββββ ~24 avg views PEAK β β sweet spot | |
| 500β750 words ββββββββββββββββββββ ~18 avg views | |
| > 1000 words ββββββββββββββββββββ ~10 avg views β overwhelms candidates | |
| ``` | |
| > Non-linear relationship confirmed. Sweet spot: **250β500 words**. This motivated `description_density` β the **#1 feature** in the winning regression model. | |
| --- | |
| ### π Q4 β Day of Week vs Views | |
| ``` | |
| Monday ββββββββββββββββββββ 39 avg views β best day (n=1,837) | |
| Tuesday ββββββββββββββββββββ 25 avg views | |
| Wednesday ββββββββββββββββββββ 22 avg views | |
| Thursday ββββββββββββββββββββ 18 avg views | |
| Friday ββββββββββββββββββββ 7 avg views β worst day (n=10,076) | |
| Saturday ββββββββββββββββββββ 28 avg views (weekend β n=2,116 total, noisier) | |
| Sunday ββββββββββββββββββββ 28 avg views (weekend β noisier) | |
| ``` | |
| > **Counterintuitive finding:** Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features. | |
| --- | |
| ### πΌ Q1 β Work Type vs Views | |
| ``` | |
| Contract ββββββββββββββββββββ 29.97 avg views median: 7.0 | |
| Internship ββββββββββββββββββββ 25.71 avg views median: 5.0 | |
| Full-time ββββββββββββββββββββ 13.70 avg views median: 4.0 β 80% of volume | |
| Other ββββββββββββββββββββ 11.27 avg views median: 4.0 | |
| Part-time ββββββββββββββββββββ 9.59 avg views median: 4.0 | |
| ``` | |
| > Contract and Internship roles show the highest engagement. However, **Full-time dominates volume** (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal. | |
| --- | |
| ### π Q5 β Seniority Level vs Views | |
| ``` | |
| Entry-level ββββββββββββββββββββ 18 avg views n=792 | |
| Senior-level ββββββββββββββββββββ 16 avg views n=3,577 | |
| Other/Mid ββββββββββββββββββββ 15 avg views n=25,203 | |
| Entry vs Senior: +12.4% more views | |
| Entry vs Other: +18.9% more views | |
| ``` | |
| > Supply-side effect β more candidates qualify for junior roles, so the pool is larger. `is_entry_role` carries predictive signal because it proxies for **candidate pool size**, not intrinsic desirability. | |
| --- | |
| ### π₯ Feature Correlation with log(views+1) | |
| ``` | |
| Feature Corr Direction Note | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| desc_salary_interaction +0.18 β views strongest single predictor | |
| has_salary_info +0.14 β views salary transparency | |
| salary_log +0.12 β views salary level | |
| description_density +0.10 β views content quality | |
| description_word_count +0.08 β views description length | |
| is_software_role +0.08 β views tech role demand | |
| is_data_role +0.07 β views data role demand | |
| is_entry_role +0.06 β views larger candidate pool | |
| posting_weekend -0.04 β views small negative signal | |
| is_senior_role -0.03 β views smaller candidate pool | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Internal correlations (structural β not data leakage): | |
| salary_log β salary_midpoint +0.96 log transform of same variable | |
| desc_wc β desc_density +0.55 density uses length in formula | |
| is_software β is_data +0.35 often co-occur in job titles | |
| is_senior β is_entry -0.28 mutually exclusive by construction | |
| ``` | |
| > Most features show **weak linear correlation** β no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations. | |
| ### π‘οΈ Correlation Heatmap (feature-to-feature + target) | |
| ``` | |
| log desc has sal desc is_ is_ is_ post is_ | |
| views dens sal log wc soft data entr wknd snr | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| log_views β 1.00 0.10 0.14 0.12 0.08 0.08 0.07 0.06 -0.04 -0.03 | |
| description_density β 0.10 1.00 0.02 0.04 0.55 0.01 0.01 -0.01 0.00 0.00 | |
| has_salary_info β 0.14 0.02 1.00 0.72 0.03 0.06 0.07 -0.03 -0.01 -0.02 | |
| salary_log β 0.12 0.04 0.72 1.00 0.04 0.05 0.06 -0.02 -0.01 -0.01 | |
| description_word_count β 0.08 0.55 0.03 0.04 1.00 0.01 0.01 -0.01 0.00 0.00 | |
| is_software_role β 0.08 0.01 0.06 0.05 0.01 1.00 0.35 -0.08 0.00 -0.05 | |
| is_data_role β 0.07 0.01 0.07 0.06 0.01 0.35 1.00 -0.06 0.00 -0.04 | |
| is_entry_role β 0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06 1.00 0.01 -0.28 | |
| posting_weekend β -0.04 0.00 -0.01 -0.01 0.00 0.00 0.00 0.01 1.00 0.00 | |
| is_senior_role β -0.03 0.00 -0.02 -0.01 0.00 -0.05 -0.04 -0.28 0.00 1.00 | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Key structural correlations: | |
| salary_log β has_salary_info +0.72 same underlying signal, different form | |
| desc_wc β desc_density +0.55 density formula uses word count | |
| is_software β is_data +0.35 frequently co-occur in job titles | |
| is_entry β is_senior -0.28 mutually exclusive flags | |
| ``` | |
| > The heatmap confirms no multicollinearity crisis β the highest inter-feature correlation (salary_log β has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models. | |
| --- | |
| ## βοΈ Feature Engineering β 20 Base + 6 Cluster = 30 Total Features | |
| | Group | Features | | |
| |---|---| | |
| | Text length | `title_length`, `title_word_count`, `description_length`, `description_word_count` | | |
| | Text structure | `description_density` β , `title_desc_ratio` | | |
| | Salary | `salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log` | | |
| | Role keywords | `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text` | | |
| | Interactions | `desc_salary_interaction` β , `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction` | | |
| | Clustering | `cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5` | | |
| **Missing value strategy:** | |
| - Columns with >70% missing β dropped | |
| - Salary β `has_salary_info` flag + `salary_midpoint` where available; remaining NaN imputed inside sklearn Pipeline on training data only | |
| - Remaining numeric β `SimpleImputer(strategy="median")` inside Pipeline | |
| --- | |
| ## π΅ Clustering β KMeans k=6 | |
| **Features used for clustering (12 total, leakage-checked):** | |
| `title_word_count`, `description_word_count`, `salary_log`, `description_density`, `has_salary_info`, `is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role` | |
| **Methods used to select k:** | |
| 1. Elbow method β inconclusive, no sharp elbow | |
| 2. KMeans silhouette scores on full training matrix | |
| 3. Cluster-size stability table | |
| 4. Interactive K-Means widget (visualization aid β uses sample) | |
| 5. Hierarchical clustering dendrogram (Ward linkage, 300 obs) | |
| 6. Agglomerative clustering comparison (k=2β10) | |
| ``` | |
| Silhouette scores by k (full training matrix): | |
| k=2 ββββββββββββββββββββ 0.198 smallest cluster: 6,830 (28.9%) | |
| k=3 ββββββββββββββββββββ 0.221 smallest cluster: 2,100 (8.9%) | |
| k=4 ββββββββββββββββββββ 0.312 β strong BUT largest=72% of data | |
| k=5 ββββββββββββββββββββ 0.250 smallest: 526 (unstable) | |
| k=6 ββββββββββββββββββββ 0.290 β SELECTED β smallest: 583 (2.5%) | |
| k=7 ββββββββββββββββββββ 0.286 singleton cluster appeared | |
| k=8+ singleton clusters appeared | |
| Why NOT k=10 (highest score): singleton cluster (1 observation) | |
| Why NOT k=4 (strong score): largest cluster = 72% β not meaningful separation | |
| Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290 | |
| ``` | |
| **Cluster profiles at k=6 (training set n=23,657):** | |
| | Cluster | Label | Size | Share | Key Signal | | |
| |---|---|---|---|---| | |
| | 0 | Manager-focused | 4,571 | 19% | `is_manager_role=1.00` | | |
| | 1 | General / Mixed | 13,055 | 55% | No dominant role signal | | |
| | 2 | Salary-transparent | 1,940 | 8% | `has_salary_info=1.00` | | |
| | 3 | Data roles | 1,451 | 6% | `is_data_role=1.00` | | |
| | 4 | Software roles | 2,057 | 9% | `is_software_role=1.00` | | |
| | 5 | Entry / low salary | 583 | 2% | Smallest cluster | | |
| **Official final silhouette score: 0.290** (full training matrix) | |
| Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them. | |
| --- | |
| ## π Regression β Predicting `log1p(views)` | |
| ### Baseline | |
| ``` | |
| Mean Baseline (predict training mean for all observations): | |
| RMSE_log = 0.8708 RΒ² = -0.0002 β floor every model must beat | |
| MAE_views β 10.64 | |
| Baseline Linear Regression (20 features, no clustering): | |
| RMSE_log = 0.8425 RΒ² = 0.0639 | |
| ``` | |
| ### Full Model Comparison (after feature engineering + clustering) | |
| | Model | RMSE_log β | RΒ² β | Notes | | |
| |---|---|---|---| | |
| | **Random Forest (Tuned) β ** | **0.8347** | **0.0811** | RandomizedSearchCV winner | | |
| | Random Forest (Controlled) | 0.8349 | 0.0807 | Manual constraints | | |
| | Gradient Boosting | 0.8370 | 0.0770 | β | | |
| | Linear Regression + Features | 0.8420 | 0.0640 | β | | |
| | RidgeCV | 0.8420 | 0.0640 | β | | |
| | Lasso Regression | 0.8430 | 0.0640 | β | | |
| | PCA + Linear Regression | 0.8440 | 0.0600 | 15 components, 96.3% variance | | |
| | Mean Baseline | 0.8708 | -0.0002 | Floor | | |
| **Key lessons:** | |
| - Unrestricted RF β train RΒ²=0.854, test RΒ²=0.003 (massive overfit). Fixed by `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features` constraints. | |
| - 3-fold CV mean RMSE_log: 0.8747 (Β±0.0125) β stable across folds | |
| - Outlier robustness test: capping views at 99th pct β RMSE_log 0.8147, RΒ²=0.0812 | |
| ### Top Feature Importances (RF Tuned) | |
| ``` | |
| description_density ββββββββββββ #1 β content quality proxy | |
| description_length ββββββββββββ #2 β raw description size | |
| description_word_count ββββββββββββ #3 β word count | |
| title-description interactionββββββββββββ #4 β combined text signal | |
| is_software_role ββββββββββββ #5 β tech role demand | |
| is_data_role ββββββββββββ #6 β data role demand | |
| salary_log / has_salary_info ββββββββββββ #7+ β salary signals | |
| ``` | |
| > `desc_salary_interaction` ranks #2 in SHAP analysis but further down in Gini importance β both agree on description quality and salary as top drivers. | |
| ### Why RΒ² = 0.081 Is Acceptable | |
| ``` | |
| RΒ² = 0.081 β model explains ~8% of variance in log(views+1) | |
| β Beats mean baseline (RΒ²β0) β real posting-level signal captured | |
| β Social engagement inherently noisy β platform factors dominate | |
| β 92% of variance from unobservable sources (algorithm, followers, ads) | |
| β Practical use = ranking postings, not forecasting exact counts | |
| ``` | |
| --- | |
| ## π Classification β High Engagement vs. Normal | |
| ``` | |
| Target: high_engagement = 1 if views β₯ 75th percentile of TRAINING views | |
| Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1) | |
| Feature matrix: X_clf uses 24 features (see notebook cell 207) | |
| Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance) | |
| ``` | |
| ### Model Comparison | |
| | Model | F1 (Class 1) | Recall (Class 1) | Notes | | |
| |---|---|---|---| | |
| | **Decision Tree β ** | **HIGHEST** | **HIGHEST** | max_depth=8, class_weight="balanced" | | |
| | Logistic Regression | near-best | high | Close to DT in F1 | | |
| | Random Forest | moderate | lower | Lowest FP count | | |
| | Dummy Baseline | 0.00 | 0.00 | Always predicts Class 0 | | |
| **5-fold CV F1: 0.4424 Β± 0.0152** β stable, no lucky split | |
| ### Error Cost Analysis | |
| ``` | |
| FN (missed high-engagement) = most costly error | |
| β Company fails to prioritize, promote, or learn from a strong posting | |
| FP (false alarm) = also costly | |
| β Recruiter wastes time and budget on a posting that won't perform | |
| ``` | |
| Decision Tree minimises FN (catches most high-engagement postings) but produces more FP. | |
| Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings. | |
| --- | |
| ## π‘ Business Insights | |
| 1. **Salary transparency is the single highest-leverage action** β 74.3% more views for free. Fewer than 30% of postings disclose salary today. | |
| 2. **Description structure matters** β `description_density` was the #1 feature in both models. Sweet spot: 250β500 words. | |
| 3. **Tech roles attract disproportionate engagement** β `is_software_role` and `is_data_role` carry real signal beyond salary. | |
| 4. **Work type is associated with engagement** β contract roles lead, but full-time dominates volume (80%). | |
| 5. **Platform factors dominate** β RΒ²β0.08 is expected and acceptable. Model value is in **ranking** postings, not exact prediction. | |
| --- | |
| ## π Bonus Work | |
| ### π§ SHAP Explainability | |
| ``` | |
| SHAP mean |value| β RF Tuned regression (test observations) | |
| description_density ββββββββββββ strongest positive impact β | |
| desc_salary_interaction ββββββββββββ salary Γ description synergy β | |
| salary_log ββββββββββββ salary level β | |
| has_salary_info ββββββββββββ disclosed β more views β | |
| posting_weekend ββββββββββββ weekend β fewer views β | |
| ``` | |
| `desc_salary_interaction` ranks #2 in SHAP but lower in Gini β confirms it captures genuine non-linear interaction that neither feature achieves alone. | |
| ### π Feature Importance: Regression vs Classification | |
| ``` | |
| Regression RF Classification DT | |
| description_density #1 #2 | |
| desc_salary_interaction #2 (SHAP) varies | |
| salary_log #7+ varies | |
| is_entry_role lower rises in classification | |
| is_data_role #6 varies | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Agreement: description quality + salary dominate both models | |
| Divergence: seniority/role flags matter more for threshold-crossing | |
| (classification) than for predicting exact counts (regression) | |
| ``` | |
| ### π¬ Additional Extras | |
| - **Interactive K-Means Widget** β explore different k values visually (notebook cell 4.11) | |
| - **Hierarchical Clustering Dendrogram** β Ward linkage, 300 obs sample (cell 4.12) | |
| - **Agglomerative Clustering Diagnostic** β k=2β10 comparison (cell 4.13) | |
| - **Outlier Robustness Test** β views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped | |
| - **3-fold CV for regression** β mean RMSE_log 0.8747 Β± 0.0125 | |
| --- | |
| ## π οΈ How to Use the Models | |
| ```python | |
| import pickle, numpy as np | |
| with open("linkedin_regression_model.pkl", "rb") as f: | |
| reg_model = pickle.load(f) | |
| with open("linkedin_classification_model.pkl", "rb") as f: | |
| clf_model = pickle.load(f) | |
| # Regression β predict log(views+1), convert back to raw view estimate | |
| log_views_pred = reg_model.predict(X_test_fe) | |
| views_pred = np.expm1(log_views_pred) | |
| # Classification β predict high-engagement label (0 = Normal, 1 = High) | |
| label = clf_model.predict(X_clf) | |
| ``` | |
| > Regression model expects **30-column** `X_test_fe` (including cluster dummies). | |
| > Classification model expects **24-column** `X_clf` (see notebook cell 207). | |
| > Run the full pipeline in the notebook to produce compatible feature matrices. | |
| --- | |
| *Assignment 2 β Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings Β· arshkon/linkedin-job-postings (Kaggle)* | |