Update README.md
Browse files
README.md
CHANGED
|
@@ -1,147 +1,423 @@
|
|
| 1 |
-
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
-
|
| 5 |
-
> **Student:** Odeya Shmuel
|
| 6 |
-
> **Tools:** Python, Pandas, NumPy, Scikit-Learn, Seaborn, Matplotlib, HuggingFace
|
| 7 |
|
| 8 |
---
|
| 9 |
|
| 10 |
-
##
|
| 11 |
|
| 12 |
-
|
| 13 |
-
> [](<FILL_IN_VIDEO_LINK>)
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
-
## 1
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
- **Feature Engineering** – composite lifestyle scores, interactions, scaling, PCA, cluster-based features
|
| 27 |
-
- **Clustering** – discovering latent lifestyle “profiles”
|
| 28 |
-
- **Classification** – predicting **high vs. low wellbeing**
|
| 29 |
-
- **Regression** – predicting the **continuous wellbeing score**
|
| 30 |
-
- **Evaluation & Interpretation** – using appropriate metrics, comparing models, and interpreting feature importance
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
---
|
| 37 |
|
| 38 |
-
##
|
| 39 |
|
| 40 |
-
|
| 41 |
-
- **Rows:** `<FILL_IN>`
|
| 42 |
-
- **Columns:** `<FILL_IN>`
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
- `wellbeing_score` – continuous target used for regression
|
| 48 |
-
- `wellbeing_label` – derived binary/3-class label (e.g. low / medium / high) used for classification
|
| 49 |
|
| 50 |
-
|
| 51 |
-
- Screen time questions: `screen_time_1` … `screen_time_9`
|
| 52 |
-
- Physical activity: duration / intensity features (e.g. `activity_minutes`, `activity_intensity`, etc.)
|
| 53 |
-
- Work stress: `work_stress_1` … `work_stress_9`
|
| 54 |
-
- Financial stress: `financial_stress_1` … `financial_stress_9`
|
| 55 |
-
- Diet quality: `diet_quality`
|
| 56 |
-
- Sleep quality: `sleep_quality_1` … `sleep_quality_9`
|
| 57 |
|
| 58 |
-
|
| 59 |
-
Age, gender, employment, etc. – used for additional EDA but not all were kept in the final models.
|
| 60 |
|
| 61 |
---
|
| 62 |
|
| 63 |
-
## 3
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
-
2. **Data loading & inspection**
|
| 69 |
-
3. **Data cleaning & EDA**
|
| 70 |
-
4. **Feature engineering**
|
| 71 |
-
5. **Clustering (unsupervised)**
|
| 72 |
-
6. **Train/test split & modeling (classification + regression)**
|
| 73 |
-
7. **Evaluation & model comparison**
|
| 74 |
-
8. **Interpretation & storytelling**
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
---
|
| 79 |
|
| 80 |
-
##
|
| 81 |
|
| 82 |
-
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
-
|
| 87 |
-
- Checked with `df.isna().sum()` and missingness heatmaps.
|
| 88 |
-
- Strategy:
|
| 89 |
-
- Numeric features → **median imputation**
|
| 90 |
-
- Categorical features → **mode imputation**
|
| 91 |
-
- Verified there is no leakage from the target during imputation (fit only on train).
|
| 92 |
|
| 93 |
-
|
| 94 |
-
- Used `df.duplicated().sum()` and dropped exact duplicates when needed.
|
| 95 |
|
| 96 |
-
|
| 97 |
-
- Inspected distributions for key numeric variables with:
|
| 98 |
-
- Histograms / KDE plots
|
| 99 |
-
- Boxplots
|
| 100 |
-
- Used **IQR rule** (Q1 − 1.5·IQR, Q3 + 1.5·IQR) to identify extreme values in:
|
| 101 |
-
- `screen_time_*`
|
| 102 |
-
- activity features
|
| 103 |
-
- stress scores
|
| 104 |
-
- Instead of dropping many rows, I **capped / winsorized** extreme values for some features to reduce their influence without losing information.
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
-
-
|
| 111 |
-
- Histograms for wellbeing, lifestyle scores, and stress levels.
|
| 112 |
-
- Showed that wellbeing is **not perfectly normal**, with slight skew towards `<FILL_IN: e.g. lower wellbeing>`.
|
| 113 |
|
| 114 |
-
|
| 115 |
-
- **Correlation matrix** of numeric features (heatmap).
|
| 116 |
-
- Scatter plots of `wellbeing_score` vs:
|
| 117 |
-
- average sleep quality
|
| 118 |
-
- activity score
|
| 119 |
-
- work & financial stress
|
| 120 |
-
- Boxplots of wellbeing per demographic groups (if present).
|
| 121 |
|
| 122 |
-
|
| 123 |
-
- Higher wellbeing is associated with:
|
| 124 |
-
- **better sleep & diet quality**
|
| 125 |
-
- **higher physical activity**
|
| 126 |
-
- **lower work & financial stress**
|
| 127 |
-
- Some variables are **highly correlated** (e.g. individual work-stress items), motivating **composite scores** and **dimensionality reduction (PCA)** later on.
|
| 128 |
|
| 129 |
-
|
| 130 |
|
| 131 |
---
|
| 132 |
|
| 133 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
-
|
| 136 |
|
| 137 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
| 140 |
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
df["financial_stress_score"]= df[[c for c in df.columns if "financial_stress" in c]].mean(axis=1)
|
| 145 |
-
df["sleep_quality_score"] = df[[c for c in df.columns if "sleep_quality" in c]].mean(axis=1)
|
| 146 |
-
df["activity_score"] = df[[c for c in df.columns if "activity" in c]].mean(axis=1)
|
| 147 |
-
df["diet_quality_score"] = df["diet_quality"]
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- regression
|
| 5 |
+
- classification
|
| 6 |
+
- mental-health
|
| 7 |
+
- wellbeing
|
| 8 |
+
- gradient-boosting
|
| 9 |
+
- sklearn
|
| 10 |
+
- lifestyle
|
| 11 |
+
- clustering
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Mental Health & Wellbeing Prediction
|
| 15 |
+
|
| 16 |
+
## 📹 Video Presentation
|
| 17 |
|
| 18 |
+
[[YOUR VIDEO LINK HERE - Add after recording](https://1drv.ms/f/c/e67fe0aaccf6536c/IgCf0p3QgN9PR6pi1VVDzivQAZY1mD5BqUUdajvKncgdiOg)]
|
|
|
|
|
|
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
+
## 📋 Project Overview
|
| 23 |
|
| 24 |
+
This project predicts *mental wellbeing scores* based on lifestyle and environmental factors. We built both *regression* models (to predict exact scores) and *classification* models (to categorize wellbeing levels as Low vs Medium/High).
|
|
|
|
| 25 |
|
| 26 |
+
| | |
|
| 27 |
+
|---|---|
|
| 28 |
+
| *Dataset* | Synthetic Mental Health, Lifestyle & Wellbeing (Kaggle) |
|
| 29 |
+
| *Size* | 400,000 individuals, 15 features |
|
| 30 |
+
| *Target* | mental_wellbeing_score (0-100) |
|
| 31 |
+
| *Train/Test* | 320,000 / 80,000 (80/20 split) |
|
| 32 |
+
|
| 33 |
+
### Main Question
|
| 34 |
+
Which lifestyle and environmental factors are most strongly associated with mental wellbeing, and how accurately can we predict wellbeing scores from these features?
|
| 35 |
+
|
| 36 |
+
### Goals
|
| 37 |
+
1. Explore relationships between lifestyle factors and mental wellbeing
|
| 38 |
+
2. Build baseline regression model and improve through feature engineering
|
| 39 |
+
3. Apply K-Means clustering to discover lifestyle segments
|
| 40 |
+
4. Convert to binary classification and identify at-risk individuals
|
| 41 |
|
| 42 |
---
|
| 43 |
|
| 44 |
+
## 📊 Part 1-2: Exploratory Data Analysis
|
| 45 |
|
| 46 |
+
### Dataset Features
|
| 47 |
|
| 48 |
+
| Feature Type | Features |
|
| 49 |
+
|--------------|----------|
|
| 50 |
+
| *Target* | mental_wellbeing_score (0-100) |
|
| 51 |
+
| *Lifestyle* | sleep_hours, screen_time, physical_activity, diet_quality, sleep_quality |
|
| 52 |
+
| *Stress* | work_stress, financial_stress |
|
| 53 |
+
| *Social* | social_interactions |
|
| 54 |
+
| *Environment* | air_quality_index, noise_level |
|
| 55 |
+
| *Demographics* | age, gender, city_type |
|
| 56 |
|
| 57 |
+
### Target Distribution
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+

|
| 60 |
|
| 61 |
+
The mental wellbeing score ranges from 0 to 100, with scores concentrated in the 80-100 range.
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
+
### Research Question 1: Screen Time vs Wellbeing
|
| 66 |
|
| 67 |
+

|
|
|
|
|
|
|
| 68 |
|
| 69 |
+
*Finding:* Higher screen time is associated with slightly lower mental wellbeing. The relationship is negative but relatively weak.
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
|
| 73 |
+
### Research Question 2: Physical Activity vs Wellbeing
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
*Finding:* Higher physical activity levels are associated with better mental wellbeing. This is one of the positive lifestyle factors.
|
|
|
|
| 78 |
|
| 79 |
---
|
| 80 |
|
| 81 |
+
### Research Question 3: Work Stress vs Wellbeing
|
| 82 |
|
| 83 |
+

|
| 84 |
|
| 85 |
+
*Finding:* Work stress has a strong negative relationship with mental wellbeing - one of the most impactful factors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
### Research Question 4: Sleep Quality vs Wellbeing
|
| 90 |
+
|
| 91 |
+

|
| 92 |
+
|
| 93 |
+
*Finding:* Better sleep quality strongly correlates with higher mental wellbeing scores. Sleep quality is one of the top positive predictors.
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
+
### Research Question 5: Diet Quality vs Wellbeing
|
| 98 |
|
| 99 |
+

|
| 100 |
|
| 101 |
+
*Finding:* Higher diet quality is associated with better mental wellbeing outcomes.
|
| 102 |
|
| 103 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
+
### Correlation Analysis
|
|
|
|
| 106 |
|
| 107 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
*Key Correlations with Mental Wellbeing:*
|
| 110 |
|
| 111 |
+
| Factor | Correlation | Direction |
|
| 112 |
+
|--------|-------------|-----------|
|
| 113 |
+
| Sleep Quality | Strong Positive | ↑ Better sleep = Higher wellbeing |
|
| 114 |
+
| Diet Quality | Moderate Positive | ↑ Better diet = Higher wellbeing |
|
| 115 |
+
| Physical Activity | Moderate Positive | ↑ More activity = Higher wellbeing |
|
| 116 |
+
| Work Stress | Strong Negative | ↑ More stress = Lower wellbeing |
|
| 117 |
+
| Financial Stress | Moderate Negative | ↑ More stress = Lower wellbeing |
|
| 118 |
+
| Screen Time | Weak Negative | ↑ More screen time = Lower wellbeing |
|
| 119 |
|
| 120 |
+
---
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
### Feature Correlation with Target
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
This visualization shows how each feature correlates with mental wellbeing score. Green bars indicate positive relationships (beneficial factors), while red bars indicate negative relationships (risk factors).
|
| 127 |
|
| 128 |
---
|
| 129 |
|
| 130 |
+
## 📈 Part 3: Baseline Model
|
| 131 |
+
|
| 132 |
+
### Baseline Configuration
|
| 133 |
+
|
| 134 |
+
| Setting | Value |
|
| 135 |
+
|---------|-------|
|
| 136 |
+
| Algorithm | Linear Regression |
|
| 137 |
+
| Features | 6 lifestyle scores |
|
| 138 |
+
| Preprocessing | StandardScaler |
|
| 139 |
+
| Train/Test Split | 80/20 |
|
| 140 |
+
|
| 141 |
+
### Baseline Results
|
| 142 |
+
|
| 143 |
+
| Metric | Value |
|
| 144 |
+
|--------|-------|
|
| 145 |
+
| R² Score | 0.672 |
|
| 146 |
+
| MAE | 4.11 |
|
| 147 |
+
| RMSE | 5.16 |
|
| 148 |
+
|
| 149 |
+
*Interpretation:* The baseline model explains 67.2% of variance in wellbeing scores with an average error of about 4 points on the 0-100 scale. This is a solid baseline.
|
| 150 |
+
|
| 151 |
+
### Baseline: Actual vs Predicted
|
| 152 |
+
|
| 153 |
+

|
| 154 |
+
|
| 155 |
+
### Baseline Feature Importance
|
| 156 |
+
|
| 157 |
+

|
| 158 |
+
|
| 159 |
+
*Top Features (Baseline):*
|
| 160 |
+
|
| 161 |
+
| Rank | Feature | Coefficient | Effect |
|
| 162 |
+
|------|---------|-------------|--------|
|
| 163 |
+
| 1 | Work Stress | -4.04 | Strongest negative |
|
| 164 |
+
| 2 | Sleep Quality | +4.03 | Strongest positive |
|
| 165 |
+
| 3 | Financial Stress | -2.69 | Negative |
|
| 166 |
+
| 4 | Diet Quality | +2.69 | Positive |
|
| 167 |
+
| 5 | Physical Activity | +2.24 | Positive |
|
| 168 |
+
| 6 | Screen Time | -1.56 | Weakest negative |
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## 🔧 Part 4: Feature Engineering
|
| 173 |
+
|
| 174 |
+
### Engineered Features
|
| 175 |
+
|
| 176 |
+
We created additional features to capture more complex relationships:
|
| 177 |
+
|
| 178 |
+
| Feature | Description | Rationale |
|
| 179 |
+
|---------|-------------|-----------|
|
| 180 |
+
| *Weighted Lifestyle Risk* | Composite score combining all risk factors | Captures overall lifestyle health |
|
| 181 |
+
| *Cluster Labels* | K-Means lifestyle segments (k=3) | Non-linear pattern capture |
|
| 182 |
+
| *PCA Components* | Lifestyle_PCA_1, Lifestyle_PCA_2 | Dimensionality reduction |
|
| 183 |
+
|
| 184 |
+
### 4.1 Weighted Lifestyle Risk Score
|
| 185 |
+
|
| 186 |
+

|
| 187 |
+
|
| 188 |
+
We created a weighted lifestyle risk score based on EDA findings:
|
| 189 |
+
|
| 190 |
+
| Factor | Weight | Direction |
|
| 191 |
+
|--------|--------|-----------|
|
| 192 |
+
| Work Stress | 0.30 | Higher = More Risk |
|
| 193 |
+
| Financial Stress | 0.25 | Higher = More Risk |
|
| 194 |
+
| Poor Sleep Quality | 0.20 | Lower quality = More Risk |
|
| 195 |
+
| Poor Diet Quality | 0.15 | Lower quality = More Risk |
|
| 196 |
+
| Low Physical Activity | 0.05 | Lower = More Risk |
|
| 197 |
+
| Screen Time | 0.05 | Higher = More Risk |
|
| 198 |
+
|
| 199 |
+
*Formula:* Higher score = Riskier lifestyle profile (worse for wellbeing)
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
### 4.2 K-Means Clustering (k=3)
|
| 204 |
+
|
| 205 |
+

|
| 206 |
+
|
| 207 |
+
We applied K-Means clustering to identify distinct lifestyle segments:
|
| 208 |
+
|
| 209 |
+
| Cluster | Count | Profile |
|
| 210 |
+
|---------|-------|---------|
|
| 211 |
+
| 0 | 129,119 (32%) | Lifestyle Profile A |
|
| 212 |
+
| 1 | 142,014 (36%) | Lifestyle Profile B |
|
| 213 |
+
| 2 | 128,867 (32%) | Lifestyle Profile C |
|
| 214 |
+
|
| 215 |
+
The cluster label becomes a categorical feature that helps the model capture non-linear relationships between lifestyle factors.
|
| 216 |
+
|
| 217 |
+
### 4.3 PCA Components
|
| 218 |
+
|
| 219 |
+
We added two PCA components (Lifestyle_PCA_1, Lifestyle_PCA_2) that compress the six lifestyle features into orthogonal dimensions capturing the main variance patterns.
|
| 220 |
+
|
| 221 |
+
---
|
| 222 |
+
|
| 223 |
+
## 🎯 Part 5: Improved Regression Models
|
| 224 |
+
|
| 225 |
+

|
| 226 |
+
|
| 227 |
+

|
| 228 |
+
|
| 229 |
+
### Model Comparison Results
|
| 230 |
+
|
| 231 |
+
| Model | MAE | RMSE | R² |
|
| 232 |
+
|-------|-----|------|-----|
|
| 233 |
+
| Baseline Linear Regression | 4.11 | 5.16 | 0.672 |
|
| 234 |
+
| Linear Regression (engineered) | 4.09 | 5.14 | 0.675 |
|
| 235 |
+
| Random Forest (engineered) | 2.28 | 3.61 | 0.839 |
|
| 236 |
+
| *Gradient Boosting (engineered)* | *2.29* | *3.49* | *0.850* |
|
| 237 |
+
|
| 238 |
+
### Improvement Analysis
|
| 239 |
+
|
| 240 |
+
| Comparison | Improvement |
|
| 241 |
+
|------------|-------------|
|
| 242 |
+
| Baseline → Gradient Boosting | +26.5% R² improvement |
|
| 243 |
+
| MAE Reduction | 4.11 → 2.29 (44% reduction) |
|
| 244 |
+
| RMSE Reduction | 5.16 → 3.49 (32% reduction) |
|
| 245 |
+
|
| 246 |
+
---
|
| 247 |
+
|
| 248 |
+
### Feature Importance (Best Model)
|
| 249 |
+
|
| 250 |
+

|
| 251 |
+
|
| 252 |
+
*Key Insights:*
|
| 253 |
+
- Stress factors (work, financial) remain the strongest predictors
|
| 254 |
+
- Sleep quality continues to be the top positive factor
|
| 255 |
+
- Engineered features and cluster labels add predictive value
|
| 256 |
+
|
| 257 |
+
---
|
| 258 |
|
| 259 |
+
## 🏆 Part 6: Regression Winner
|
| 260 |
|
| 261 |
+
### Gradient Boosting Regressor
|
| 262 |
+
|
| 263 |
+
| Metric | Value |
|
| 264 |
+
|--------|-------|
|
| 265 |
+
| R² Score | 0.850 |
|
| 266 |
+
| MAE | 2.29 |
|
| 267 |
+
| RMSE | 3.49 |
|
| 268 |
+
|
| 269 |
+
*Why Gradient Boosting Won:*
|
| 270 |
+
- Captures non-linear relationships between lifestyle factors
|
| 271 |
+
- Handles feature interactions naturally
|
| 272 |
+
- Best balance of accuracy and generalization
|
| 273 |
+
- Lowest RMSE among all models
|
| 274 |
+
|
| 275 |
+
*Saved as:* winning_regressor.pkl
|
| 276 |
+
|
| 277 |
+
---
|
| 278 |
+
|
| 279 |
+
## 🔄 Part 7: Regression to Classification
|
| 280 |
+
|
| 281 |
+
We converted wellbeing scores into *2 binary classes* using quantile thresholds:
|
| 282 |
+
|
| 283 |
+

|
| 284 |
+
|
| 285 |
+
| Class | Wellbeing Level | Threshold | Train Count | Percentage |
|
| 286 |
+
|-------|-----------------|-----------|-------------|------------|
|
| 287 |
+
| 0 | Low Wellbeing | < 92.49 | 105,600 | 33% |
|
| 288 |
+
| 1 | Medium/High Wellbeing | ≥ 92.49 | 214,400 | 67% |
|
| 289 |
+
|
| 290 |
+
*Note:* The classes are imbalanced (33% vs 67%), so we focus on F1-score and recall rather than accuracy alone.
|
| 291 |
+
|
| 292 |
+
---
|
| 293 |
+
|
| 294 |
+
### Precision vs Recall Analysis
|
| 295 |
+
|
| 296 |
+
*For mental health prediction, Recall is more important:*
|
| 297 |
+
|
| 298 |
+
In the context of predicting mental wellbeing, *recall is more important than precision* for the low-wellbeing class. Missing a person who is actually struggling (false negative) is more harmful than flagging someone as "at risk" when they are actually fine (false positive).
|
| 299 |
+
|
| 300 |
+
### False Positive vs False Negative
|
| 301 |
+
|
| 302 |
+
*False Negatives are more critical:*
|
| 303 |
+
|
| 304 |
+
| Error Type | Meaning | Consequence |
|
| 305 |
+
|------------|---------|-------------|
|
| 306 |
+
| False Positive | Predict Low, actually OK | Extra attention to someone who is fine (less harmful) |
|
| 307 |
+
| *False Negative* | Predict OK, actually Low | *Person who needs support is not identified (more harmful)* |
|
| 308 |
+
|
| 309 |
+
A false negative means the model predicts that someone is not in the low-wellbeing group, while in reality they are. This could result in a person who needs support not being identified.
|
| 310 |
+
|
| 311 |
+
*Conclusion:* We prioritize recall for Class 0 (Low Wellbeing) to minimize missed at-risk individuals.
|
| 312 |
+
|
| 313 |
+
---
|
| 314 |
+
|
| 315 |
+
## 📊 Part 8: Classification Models
|
| 316 |
+
|
| 317 |
+
### Classification Results
|
| 318 |
+
|
| 319 |
+
| Model | Accuracy | F1 (macro) |
|
| 320 |
+
|-------|----------|------------|
|
| 321 |
+
| *Logistic Regression* | *90.55%* | *0.893* |
|
| 322 |
+
| Gradient Boosting | 90.47% | 0.892 |
|
| 323 |
+
| Random Forest | 90.39% | 0.891 |
|
| 324 |
+
|
| 325 |
+
### Confusion Matrices
|
| 326 |
+
|
| 327 |
+

|
| 328 |
+
|
| 329 |
+
*Key Observations:*
|
| 330 |
+
- All models achieve ~90% accuracy
|
| 331 |
+
- Most confusion occurs between the two adjacent classes
|
| 332 |
+
- Models rarely completely misclassify (important for identifying at-risk individuals)
|
| 333 |
+
- Logistic Regression achieves the highest F1 score despite being the simplest model
|
| 334 |
+
|
| 335 |
+
### Confusion Matrix Analysis
|
| 336 |
+
|
| 337 |
+
The confusion matrices show that most errors are confusions between "medium" and "high" wellbeing individuals. More importantly, the model rarely confuses class 0 (low wellbeing) with class 1 (high wellbeing), which is good from a practical perspective: it almost never predicts "high wellbeing" for people who are actually in the low group.
|
| 338 |
+
|
| 339 |
+
---
|
| 340 |
+
|
| 341 |
+
## 🏆 Part 8.4: Classification Winner
|
| 342 |
+
|
| 343 |
+
### Logistic Regression
|
| 344 |
+
|
| 345 |
+
| Metric | Value |
|
| 346 |
+
|--------|-------|
|
| 347 |
+
| Accuracy | 90.55% |
|
| 348 |
+
| Macro F1 | 0.893 |
|
| 349 |
+
|
| 350 |
+
*Why Logistic Regression Won:*
|
| 351 |
+
- Highest accuracy and F1 score
|
| 352 |
+
- Simple, interpretable model
|
| 353 |
+
- Fast inference time (trained in 1.48 seconds vs 117-247 seconds for others)
|
| 354 |
+
- Excellent calibrated probabilities
|
| 355 |
+
- Performs well on this linearly-separable problem
|
| 356 |
+
|
| 357 |
+
*Saved as:* winning_classifier.pkl
|
| 358 |
+
|
| 359 |
+
---
|
| 360 |
+
|
| 361 |
+
## 📁 Repository Files
|
| 362 |
+
|
| 363 |
+
| File | Description |
|
| 364 |
+
|------|-------------|
|
| 365 |
+
| winning_regressor.pkl | Gradient Boosting regression model (R²=0.85) |
|
| 366 |
+
| winning_classifier.pkl | Logistic Regression classifier (90.6% accuracy) |
|
| 367 |
+
| notebook.ipynb | Complete Jupyter notebook with all code |
|
| 368 |
+
|
| 369 |
+
---
|
| 370 |
+
|
| 371 |
+
## 💡 Key Takeaways
|
| 372 |
+
|
| 373 |
+
### What Affects Mental Wellbeing Most?
|
| 374 |
+
|
| 375 |
+
*Negative Factors (Risk):*
|
| 376 |
+
1. 🔴 *Work Stress* - Strongest negative impact (coefficient: -4.04)
|
| 377 |
+
2. 🔴 *Financial Stress* - Significant negative impact (coefficient: -2.69)
|
| 378 |
+
3. 🟡 *Screen Time* - Weak negative impact (coefficient: -1.56)
|
| 379 |
+
|
| 380 |
+
*Positive Factors (Protective):*
|
| 381 |
+
1. 🟢 *Sleep Quality* - Strongest positive impact (coefficient: +4.03)
|
| 382 |
+
2. 🟢 *Diet Quality* - Significant positive impact (coefficient: +2.69)
|
| 383 |
+
3. 🟢 *Physical Activity* - Moderate positive impact (coefficient: +2.24)
|
| 384 |
+
|
| 385 |
+
### Model Performance Summary
|
| 386 |
+
|
| 387 |
+
| Task | Best Model | Performance |
|
| 388 |
+
|------|------------|-------------|
|
| 389 |
+
| Regression | Gradient Boosting | R² = 0.850, RMSE = 3.49 |
|
| 390 |
+
| Classification | Logistic Regression | 90.55% accuracy, F1 = 0.893 |
|
| 391 |
+
|
| 392 |
+
### Feature Engineering Impact
|
| 393 |
+
|
| 394 |
+
| Model | MAE | RMSE | R² |
|
| 395 |
+
|-------|-----|------|-----|
|
| 396 |
+
| Baseline (6 features) | 4.11 | 5.16 | 0.672 |
|
| 397 |
+
| Gradient Boosting (engineered) | 2.29 | 3.49 | 0.850 |
|
| 398 |
+
| *Improvement* | *-44%* | *-32%* | *+26.5%* |
|
| 399 |
+
|
| 400 |
+
### Lessons Learned
|
| 401 |
+
|
| 402 |
+
1. *Stress management is crucial* - Work and financial stress are the strongest predictors of low wellbeing
|
| 403 |
+
2. *Sleep quality matters most* among positive lifestyle factors
|
| 404 |
+
3. *Feature engineering helps* - Weighted risk score and cluster features improved predictions
|
| 405 |
+
4. *Simple models can win* - Logistic Regression beat complex models for classification
|
| 406 |
+
5. *Ensemble methods excel for regression* - Gradient Boosting captured non-linear patterns
|
| 407 |
+
6. *Recall matters for mental health* - Don't miss at-risk individuals (minimize false negatives)
|
| 408 |
+
|
| 409 |
+
---
|
| 410 |
+
|
| 411 |
+
## 👤 Author
|
| 412 |
+
|
| 413 |
+
*Odeya*
|
| 414 |
+
|
| 415 |
+
Assignment #2: Classification, Regression, Clustering, Evaluation
|
| 416 |
+
|
| 417 |
+
---
|
| 418 |
|
| 419 |
+
## 📚 References
|
| 420 |
|
| 421 |
+
- *Dataset:* [Synthetic Mental Health, Lifestyle & Wellbeing Dataset - Kaggle](https://www.kaggle.com/datasets/raghavgour/syntheticmental-healthlifestyle-wellbeing-dataset)
|
| 422 |
+
- *Tools:* scikit-learn, pandas, numpy, matplotlib, seaborn
|
| 423 |
+
- *Algorithms:* Linear Regression, Random Forest, Gradient Boosting, Logistic Regression, K-Means
|
|
|
|
|
|
|
|
|
|
|
|