odeyaaa commited on
Commit
1ddd346
·
verified ·
1 Parent(s): ac017bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +374 -98
README.md CHANGED
@@ -1,147 +1,423 @@
1
- # From Lifestyle Patterns to Wellbeing
2
- ### Classification, Regression & Clustering on a Mental-Health & Lifestyle Dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
- > **Course:** Introduction to Data Science Assignment #2
5
- > **Student:** Odeya Shmuel
6
- > **Tools:** Python, Pandas, NumPy, Scikit-Learn, Seaborn, Matplotlib, HuggingFace
7
 
8
  ---
9
 
10
- ## 🎥 Presentation Video
11
 
12
- > ▶️ **Watch the 4–6 min walkthrough:**
13
- > [![Assignment 2 Video](https://img.shields.io/badge/Video-YouTube-red)](<FILL_IN_VIDEO_LINK>)
14
 
15
- The video walks through this repository, the notebook, main visualizations, models, and key takeaways – with screen-share and camera on, as required in the assignment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ---
18
 
19
- ## 1. Project Overview
20
 
21
- This project explores how **daily lifestyle patterns** (screen time, physical activity, work & financial stress, diet, sleep) are related to **mental wellbeing**.
22
 
23
- Using a synthetic mental-health & lifestyle dataset, I built a **full DS pipeline**:
 
 
 
 
 
 
 
24
 
25
- - **EDA** – understanding distributions, relationships, and potential data issues
26
- - **Feature Engineering** – composite lifestyle scores, interactions, scaling, PCA, cluster-based features
27
- - **Clustering** – discovering latent lifestyle “profiles”
28
- - **Classification** – predicting **high vs. low wellbeing**
29
- - **Regression** – predicting the **continuous wellbeing score**
30
- - **Evaluation & Interpretation** – using appropriate metrics, comparing models, and interpreting feature importance
31
 
32
- The main **business/research question**:
33
 
34
- > **“Which lifestyle factors and behavior patterns are most predictive of wellbeing, and can we meaningfully separate low-, medium-, and high-wellbeing individuals?”**
35
 
36
  ---
37
 
38
- ## 2. Dataset
39
 
40
- - **Source:** Synthetic Mental Health, Lifestyle & Wellbeing dataset (Kaggle)
41
- - **Rows:** `<FILL_IN>`
42
- - **Columns:** `<FILL_IN>`
43
 
44
- ### Main variable groups
 
 
45
 
46
- - **Wellbeing target**
47
- - `wellbeing_score` – continuous target used for regression
48
- - `wellbeing_label` – derived binary/3-class label (e.g. low / medium / high) used for classification
49
 
50
- - **Lifestyle & behavior**
51
- - Screen time questions: `screen_time_1` … `screen_time_9`
52
- - Physical activity: duration / intensity features (e.g. `activity_minutes`, `activity_intensity`, etc.)
53
- - Work stress: `work_stress_1` … `work_stress_9`
54
- - Financial stress: `financial_stress_1` … `financial_stress_9`
55
- - Diet quality: `diet_quality`
56
- - Sleep quality: `sleep_quality_1` … `sleep_quality_9`
57
 
58
- - **Demographics or context (if present)**
59
- Age, gender, employment, etc. – used for additional EDA but not all were kept in the final models.
60
 
61
  ---
62
 
63
- ## 3. End-to-End Pipeline
64
 
65
- The notebook follows the standard **data-science lifecycle**:
66
 
67
- 1. **Problem framing** define targets & success criteria
68
- 2. **Data loading & inspection**
69
- 3. **Data cleaning & EDA**
70
- 4. **Feature engineering**
71
- 5. **Clustering (unsupervised)**
72
- 6. **Train/test split & modeling (classification + regression)**
73
- 7. **Evaluation & model comparison**
74
- 8. **Interpretation & storytelling**
75
 
76
- Random seeds (`random_state=<FILL_IN>`) are set to keep results reproducible.
 
 
 
 
 
 
77
 
78
  ---
79
 
80
- ## 4. Data Handling & EDA (20%)
81
 
82
- ### 4.1 Data cleaning
83
 
84
- Steps taken to ensure data quality:
85
 
86
- - **Missing values**
87
- - Checked with `df.isna().sum()` and missingness heatmaps.
88
- - Strategy:
89
- - Numeric features → **median imputation**
90
- - Categorical features → **mode imputation**
91
- - Verified there is no leakage from the target during imputation (fit only on train).
92
 
93
- - **Duplicates**
94
- - Used `df.duplicated().sum()` and dropped exact duplicates when needed.
95
 
96
- - **Outliers**
97
- - Inspected distributions for key numeric variables with:
98
- - Histograms / KDE plots
99
- - Boxplots
100
- - Used **IQR rule** (Q1 − 1.5·IQR, Q3 + 1.5·IQR) to identify extreme values in:
101
- - `screen_time_*`
102
- - activity features
103
- - stress scores
104
- - Instead of dropping many rows, I **capped / winsorized** extreme values for some features to reduce their influence without losing information.
105
 
106
- ### 4.2 Exploratory Data Analysis
107
 
108
- Key EDA elements (with plots shown in the notebook):
 
 
 
 
 
 
 
109
 
110
- - **Univariate distributions**
111
- - Histograms for wellbeing, lifestyle scores, and stress levels.
112
- - Showed that wellbeing is **not perfectly normal**, with slight skew towards `<FILL_IN: e.g. lower wellbeing>`.
113
 
114
- - **Bivariate relationships**
115
- - **Correlation matrix** of numeric features (heatmap).
116
- - Scatter plots of `wellbeing_score` vs:
117
- - average sleep quality
118
- - activity score
119
- - work & financial stress
120
- - Boxplots of wellbeing per demographic groups (if present).
121
 
122
- - **Insights from EDA**
123
- - Higher wellbeing is associated with:
124
- - **better sleep & diet quality**
125
- - **higher physical activity**
126
- - **lower work & financial stress**
127
- - Some variables are **highly correlated** (e.g. individual work-stress items), motivating **composite scores** and **dimensionality reduction (PCA)** later on.
128
 
129
- EDA directly guided the **feature engineering choices** and which variables to prefer in modeling.
130
 
131
  ---
132
 
133
- ## 5. Feature Engineering (20%)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
- The goal was to create **meaningful, low-noise features** reflecting lifestyle patterns, and to reduce multicollinearity.
136
 
137
- ### 5.1 Composite lifestyle scores
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
- I aggregated question-level items into **domain scores**:
140
 
141
- ```python
142
- df["screen_time_score"] = df[[c for c in df.columns if "screen_time" in c]].mean(axis=1)
143
- df["work_stress_score"] = df[[c for c in df.columns if "work_stress" in c]].mean(axis=1)
144
- df["financial_stress_score"]= df[[c for c in df.columns if "financial_stress" in c]].mean(axis=1)
145
- df["sleep_quality_score"] = df[[c for c in df.columns if "sleep_quality" in c]].mean(axis=1)
146
- df["activity_score"] = df[[c for c in df.columns if "activity" in c]].mean(axis=1)
147
- df["diet_quality_score"] = df["diet_quality"]
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - regression
5
+ - classification
6
+ - mental-health
7
+ - wellbeing
8
+ - gradient-boosting
9
+ - sklearn
10
+ - lifestyle
11
+ - clustering
12
+ ---
13
+
14
+ # Mental Health & Wellbeing Prediction
15
+
16
+ ## 📹 Video Presentation
17
 
18
+ [[YOUR VIDEO LINK HERE - Add after recording](https://1drv.ms/f/c/e67fe0aaccf6536c/IgCf0p3QgN9PR6pi1VVDzivQAZY1mD5BqUUdajvKncgdiOg)]
 
 
19
 
20
  ---
21
 
22
+ ## 📋 Project Overview
23
 
24
+ This project predicts *mental wellbeing scores* based on lifestyle and environmental factors. We built both *regression* models (to predict exact scores) and *classification* models (to categorize wellbeing levels as Low vs Medium/High).
 
25
 
26
+ | | |
27
+ |---|---|
28
+ | *Dataset* | Synthetic Mental Health, Lifestyle & Wellbeing (Kaggle) |
29
+ | *Size* | 400,000 individuals, 15 features |
30
+ | *Target* | mental_wellbeing_score (0-100) |
31
+ | *Train/Test* | 320,000 / 80,000 (80/20 split) |
32
+
33
+ ### Main Question
34
+ Which lifestyle and environmental factors are most strongly associated with mental wellbeing, and how accurately can we predict wellbeing scores from these features?
35
+
36
+ ### Goals
37
+ 1. Explore relationships between lifestyle factors and mental wellbeing
38
+ 2. Build baseline regression model and improve through feature engineering
39
+ 3. Apply K-Means clustering to discover lifestyle segments
40
+ 4. Convert to binary classification and identify at-risk individuals
41
 
42
  ---
43
 
44
+ ## 📊 Part 1-2: Exploratory Data Analysis
45
 
46
+ ### Dataset Features
47
 
48
+ | Feature Type | Features |
49
+ |--------------|----------|
50
+ | *Target* | mental_wellbeing_score (0-100) |
51
+ | *Lifestyle* | sleep_hours, screen_time, physical_activity, diet_quality, sleep_quality |
52
+ | *Stress* | work_stress, financial_stress |
53
+ | *Social* | social_interactions |
54
+ | *Environment* | air_quality_index, noise_level |
55
+ | *Demographics* | age, gender, city_type |
56
 
57
+ ### Target Distribution
 
 
 
 
 
58
 
59
+ ![Target Distribution](./01_target_distribution.png)
60
 
61
+ The mental wellbeing score ranges from 0 to 100, with scores concentrated in the 80-100 range.
62
 
63
  ---
64
 
65
+ ### Research Question 1: Screen Time vs Wellbeing
66
 
67
+ ![Screen Time vs Wellbeing](./05_screen_time_vs_wellbeing.png)
 
 
68
 
69
+ *Finding:* Higher screen time is associated with slightly lower mental wellbeing. The relationship is negative but relatively weak.
70
+
71
+ ---
72
 
73
+ ### Research Question 2: Physical Activity vs Wellbeing
 
 
74
 
75
+ ![Physical Activity vs Wellbeing](./06_physical_activity_vs_wellbeing.png)
 
 
 
 
 
 
76
 
77
+ *Finding:* Higher physical activity levels are associated with better mental wellbeing. This is one of the positive lifestyle factors.
 
78
 
79
  ---
80
 
81
+ ### Research Question 3: Work Stress vs Wellbeing
82
 
83
+ ![Work Stress vs Wellbeing](./07_work_stress_vs_wellbeing.png)
84
 
85
+ *Finding:* Work stress has a strong negative relationship with mental wellbeing - one of the most impactful factors.
 
 
 
 
 
 
 
86
 
87
+ ---
88
+
89
+ ### Research Question 4: Sleep Quality vs Wellbeing
90
+
91
+ ![Sleep Quality vs Wellbeing](./08_sleep_quality_vs_wellbeing.png)
92
+
93
+ *Finding:* Better sleep quality strongly correlates with higher mental wellbeing scores. Sleep quality is one of the top positive predictors.
94
 
95
  ---
96
 
97
+ ### Research Question 5: Diet Quality vs Wellbeing
98
 
99
+ ![Diet Quality vs Wellbeing](./09_diet_quality_vs_wellbeing.png)
100
 
101
+ *Finding:* Higher diet quality is associated with better mental wellbeing outcomes.
102
 
103
+ ---
 
 
 
 
 
104
 
105
+ ### Correlation Analysis
 
106
 
107
+ ![Correlation Heatmap](./04_correlation_heatmap.png)
 
 
 
 
 
 
 
 
108
 
109
+ *Key Correlations with Mental Wellbeing:*
110
 
111
+ | Factor | Correlation | Direction |
112
+ |--------|-------------|-----------|
113
+ | Sleep Quality | Strong Positive | ↑ Better sleep = Higher wellbeing |
114
+ | Diet Quality | Moderate Positive | ↑ Better diet = Higher wellbeing |
115
+ | Physical Activity | Moderate Positive | ↑ More activity = Higher wellbeing |
116
+ | Work Stress | Strong Negative | ↑ More stress = Lower wellbeing |
117
+ | Financial Stress | Moderate Negative | ↑ More stress = Lower wellbeing |
118
+ | Screen Time | Weak Negative | ↑ More screen time = Lower wellbeing |
119
 
120
+ ---
 
 
121
 
122
+ ### Feature Correlation with Target
 
 
 
 
 
 
123
 
124
+ ![Feature Correlation](./10_feature_correlation_target.png)
 
 
 
 
 
125
 
126
+ This visualization shows how each feature correlates with mental wellbeing score. Green bars indicate positive relationships (beneficial factors), while red bars indicate negative relationships (risk factors).
127
 
128
  ---
129
 
130
+ ## 📈 Part 3: Baseline Model
131
+
132
+ ### Baseline Configuration
133
+
134
+ | Setting | Value |
135
+ |---------|-------|
136
+ | Algorithm | Linear Regression |
137
+ | Features | 6 lifestyle scores |
138
+ | Preprocessing | StandardScaler |
139
+ | Train/Test Split | 80/20 |
140
+
141
+ ### Baseline Results
142
+
143
+ | Metric | Value |
144
+ |--------|-------|
145
+ | R² Score | 0.672 |
146
+ | MAE | 4.11 |
147
+ | RMSE | 5.16 |
148
+
149
+ *Interpretation:* The baseline model explains 67.2% of variance in wellbeing scores with an average error of about 4 points on the 0-100 scale. This is a solid baseline.
150
+
151
+ ### Baseline: Actual vs Predicted
152
+
153
+ ![Baseline Actual vs Predicted](./11_baseline_actual_vs_predicted.png)
154
+
155
+ ### Baseline Feature Importance
156
+
157
+ ![Baseline Feature Importance](./12_baseline_feature_importance.png)
158
+
159
+ *Top Features (Baseline):*
160
+
161
+ | Rank | Feature | Coefficient | Effect |
162
+ |------|---------|-------------|--------|
163
+ | 1 | Work Stress | -4.04 | Strongest negative |
164
+ | 2 | Sleep Quality | +4.03 | Strongest positive |
165
+ | 3 | Financial Stress | -2.69 | Negative |
166
+ | 4 | Diet Quality | +2.69 | Positive |
167
+ | 5 | Physical Activity | +2.24 | Positive |
168
+ | 6 | Screen Time | -1.56 | Weakest negative |
169
+
170
+ ---
171
+
172
+ ## 🔧 Part 4: Feature Engineering
173
+
174
+ ### Engineered Features
175
+
176
+ We created additional features to capture more complex relationships:
177
+
178
+ | Feature | Description | Rationale |
179
+ |---------|-------------|-----------|
180
+ | *Weighted Lifestyle Risk* | Composite score combining all risk factors | Captures overall lifestyle health |
181
+ | *Cluster Labels* | K-Means lifestyle segments (k=3) | Non-linear pattern capture |
182
+ | *PCA Components* | Lifestyle_PCA_1, Lifestyle_PCA_2 | Dimensionality reduction |
183
+
184
+ ### 4.1 Weighted Lifestyle Risk Score
185
+
186
+ ![Lifestyle Risk Distribution](./13_lifestyle_risk_distribution.png)
187
+
188
+ We created a weighted lifestyle risk score based on EDA findings:
189
+
190
+ | Factor | Weight | Direction |
191
+ |--------|--------|-----------|
192
+ | Work Stress | 0.30 | Higher = More Risk |
193
+ | Financial Stress | 0.25 | Higher = More Risk |
194
+ | Poor Sleep Quality | 0.20 | Lower quality = More Risk |
195
+ | Poor Diet Quality | 0.15 | Lower quality = More Risk |
196
+ | Low Physical Activity | 0.05 | Lower = More Risk |
197
+ | Screen Time | 0.05 | Higher = More Risk |
198
+
199
+ *Formula:* Higher score = Riskier lifestyle profile (worse for wellbeing)
200
+
201
+ ---
202
+
203
+ ### 4.2 K-Means Clustering (k=3)
204
+
205
+ ![Clustering PCA](./13_clustering_pca.png)
206
+
207
+ We applied K-Means clustering to identify distinct lifestyle segments:
208
+
209
+ | Cluster | Count | Profile |
210
+ |---------|-------|---------|
211
+ | 0 | 129,119 (32%) | Lifestyle Profile A |
212
+ | 1 | 142,014 (36%) | Lifestyle Profile B |
213
+ | 2 | 128,867 (32%) | Lifestyle Profile C |
214
+
215
+ The cluster label becomes a categorical feature that helps the model capture non-linear relationships between lifestyle factors.
216
+
217
+ ### 4.3 PCA Components
218
+
219
+ We added two PCA components (Lifestyle_PCA_1, Lifestyle_PCA_2) that compress the six lifestyle features into orthogonal dimensions capturing the main variance patterns.
220
+
221
+ ---
222
+
223
+ ## 🎯 Part 5: Improved Regression Models
224
+
225
+ ![Model Comparison R²](./14_model_comparison_r2.png)
226
+
227
+ ![Model Comparison RMSE](./15_model_comparison_rmse.png)
228
+
229
+ ### Model Comparison Results
230
+
231
+ | Model | MAE | RMSE | R² |
232
+ |-------|-----|------|-----|
233
+ | Baseline Linear Regression | 4.11 | 5.16 | 0.672 |
234
+ | Linear Regression (engineered) | 4.09 | 5.14 | 0.675 |
235
+ | Random Forest (engineered) | 2.28 | 3.61 | 0.839 |
236
+ | *Gradient Boosting (engineered)* | *2.29* | *3.49* | *0.850* |
237
+
238
+ ### Improvement Analysis
239
+
240
+ | Comparison | Improvement |
241
+ |------------|-------------|
242
+ | Baseline → Gradient Boosting | +26.5% R² improvement |
243
+ | MAE Reduction | 4.11 → 2.29 (44% reduction) |
244
+ | RMSE Reduction | 5.16 → 3.49 (32% reduction) |
245
+
246
+ ---
247
+
248
+ ### Feature Importance (Best Model)
249
+
250
+ ![Feature Importance](./16_best_model_feature_importance.png)
251
+
252
+ *Key Insights:*
253
+ - Stress factors (work, financial) remain the strongest predictors
254
+ - Sleep quality continues to be the top positive factor
255
+ - Engineered features and cluster labels add predictive value
256
+
257
+ ---
258
 
259
+ ## 🏆 Part 6: Regression Winner
260
 
261
+ ### Gradient Boosting Regressor
262
+
263
+ | Metric | Value |
264
+ |--------|-------|
265
+ | R² Score | 0.850 |
266
+ | MAE | 2.29 |
267
+ | RMSE | 3.49 |
268
+
269
+ *Why Gradient Boosting Won:*
270
+ - Captures non-linear relationships between lifestyle factors
271
+ - Handles feature interactions naturally
272
+ - Best balance of accuracy and generalization
273
+ - Lowest RMSE among all models
274
+
275
+ *Saved as:* winning_regressor.pkl
276
+
277
+ ---
278
+
279
+ ## 🔄 Part 7: Regression to Classification
280
+
281
+ We converted wellbeing scores into *2 binary classes* using quantile thresholds:
282
+
283
+ ![Class Distribution](./17_class_distribution.png)
284
+
285
+ | Class | Wellbeing Level | Threshold | Train Count | Percentage |
286
+ |-------|-----------------|-----------|-------------|------------|
287
+ | 0 | Low Wellbeing | < 92.49 | 105,600 | 33% |
288
+ | 1 | Medium/High Wellbeing | ≥ 92.49 | 214,400 | 67% |
289
+
290
+ *Note:* The classes are imbalanced (33% vs 67%), so we focus on F1-score and recall rather than accuracy alone.
291
+
292
+ ---
293
+
294
+ ### Precision vs Recall Analysis
295
+
296
+ *For mental health prediction, Recall is more important:*
297
+
298
+ In the context of predicting mental wellbeing, *recall is more important than precision* for the low-wellbeing class. Missing a person who is actually struggling (false negative) is more harmful than flagging someone as "at risk" when they are actually fine (false positive).
299
+
300
+ ### False Positive vs False Negative
301
+
302
+ *False Negatives are more critical:*
303
+
304
+ | Error Type | Meaning | Consequence |
305
+ |------------|---------|-------------|
306
+ | False Positive | Predict Low, actually OK | Extra attention to someone who is fine (less harmful) |
307
+ | *False Negative* | Predict OK, actually Low | *Person who needs support is not identified (more harmful)* |
308
+
309
+ A false negative means the model predicts that someone is not in the low-wellbeing group, while in reality they are. This could result in a person who needs support not being identified.
310
+
311
+ *Conclusion:* We prioritize recall for Class 0 (Low Wellbeing) to minimize missed at-risk individuals.
312
+
313
+ ---
314
+
315
+ ## 📊 Part 8: Classification Models
316
+
317
+ ### Classification Results
318
+
319
+ | Model | Accuracy | F1 (macro) |
320
+ |-------|----------|------------|
321
+ | *Logistic Regression* | *90.55%* | *0.893* |
322
+ | Gradient Boosting | 90.47% | 0.892 |
323
+ | Random Forest | 90.39% | 0.891 |
324
+
325
+ ### Confusion Matrices
326
+
327
+ ![Confusion Matrix](./18_confusion_matrix.png)
328
+
329
+ *Key Observations:*
330
+ - All models achieve ~90% accuracy
331
+ - Most confusion occurs between the two adjacent classes
332
+ - Models rarely completely misclassify (important for identifying at-risk individuals)
333
+ - Logistic Regression achieves the highest F1 score despite being the simplest model
334
+
335
+ ### Confusion Matrix Analysis
336
+
337
+ The confusion matrices show that most errors are confusions between "medium" and "high" wellbeing individuals. More importantly, the model rarely confuses class 0 (low wellbeing) with class 1 (high wellbeing), which is good from a practical perspective: it almost never predicts "high wellbeing" for people who are actually in the low group.
338
+
339
+ ---
340
+
341
+ ## 🏆 Part 8.4: Classification Winner
342
+
343
+ ### Logistic Regression
344
+
345
+ | Metric | Value |
346
+ |--------|-------|
347
+ | Accuracy | 90.55% |
348
+ | Macro F1 | 0.893 |
349
+
350
+ *Why Logistic Regression Won:*
351
+ - Highest accuracy and F1 score
352
+ - Simple, interpretable model
353
+ - Fast inference time (trained in 1.48 seconds vs 117-247 seconds for others)
354
+ - Excellent calibrated probabilities
355
+ - Performs well on this linearly-separable problem
356
+
357
+ *Saved as:* winning_classifier.pkl
358
+
359
+ ---
360
+
361
+ ## 📁 Repository Files
362
+
363
+ | File | Description |
364
+ |------|-------------|
365
+ | winning_regressor.pkl | Gradient Boosting regression model (R²=0.85) |
366
+ | winning_classifier.pkl | Logistic Regression classifier (90.6% accuracy) |
367
+ | notebook.ipynb | Complete Jupyter notebook with all code |
368
+
369
+ ---
370
+
371
+ ## 💡 Key Takeaways
372
+
373
+ ### What Affects Mental Wellbeing Most?
374
+
375
+ *Negative Factors (Risk):*
376
+ 1. 🔴 *Work Stress* - Strongest negative impact (coefficient: -4.04)
377
+ 2. 🔴 *Financial Stress* - Significant negative impact (coefficient: -2.69)
378
+ 3. 🟡 *Screen Time* - Weak negative impact (coefficient: -1.56)
379
+
380
+ *Positive Factors (Protective):*
381
+ 1. 🟢 *Sleep Quality* - Strongest positive impact (coefficient: +4.03)
382
+ 2. 🟢 *Diet Quality* - Significant positive impact (coefficient: +2.69)
383
+ 3. 🟢 *Physical Activity* - Moderate positive impact (coefficient: +2.24)
384
+
385
+ ### Model Performance Summary
386
+
387
+ | Task | Best Model | Performance |
388
+ |------|------------|-------------|
389
+ | Regression | Gradient Boosting | R² = 0.850, RMSE = 3.49 |
390
+ | Classification | Logistic Regression | 90.55% accuracy, F1 = 0.893 |
391
+
392
+ ### Feature Engineering Impact
393
+
394
+ | Model | MAE | RMSE | R² |
395
+ |-------|-----|------|-----|
396
+ | Baseline (6 features) | 4.11 | 5.16 | 0.672 |
397
+ | Gradient Boosting (engineered) | 2.29 | 3.49 | 0.850 |
398
+ | *Improvement* | *-44%* | *-32%* | *+26.5%* |
399
+
400
+ ### Lessons Learned
401
+
402
+ 1. *Stress management is crucial* - Work and financial stress are the strongest predictors of low wellbeing
403
+ 2. *Sleep quality matters most* among positive lifestyle factors
404
+ 3. *Feature engineering helps* - Weighted risk score and cluster features improved predictions
405
+ 4. *Simple models can win* - Logistic Regression beat complex models for classification
406
+ 5. *Ensemble methods excel for regression* - Gradient Boosting captured non-linear patterns
407
+ 6. *Recall matters for mental health* - Don't miss at-risk individuals (minimize false negatives)
408
+
409
+ ---
410
+
411
+ ## 👤 Author
412
+
413
+ *Odeya*
414
+
415
+ Assignment #2: Classification, Regression, Clustering, Evaluation
416
+
417
+ ---
418
 
419
+ ## 📚 References
420
 
421
+ - *Dataset:* [Synthetic Mental Health, Lifestyle & Wellbeing Dataset - Kaggle](https://www.kaggle.com/datasets/raghavgour/syntheticmental-healthlifestyle-wellbeing-dataset)
422
+ - *Tools:* scikit-learn, pandas, numpy, matplotlib, seaborn
423
+ - *Algorithms:* Linear Regression, Random Forest, Gradient Boosting, Logistic Regression, K-Means