odeyaaa
/

Predicting_level_of_mental_well-being_based_on_lifestyle

@@ -1,147 +1,423 @@
-# From Lifestyle Patterns to Wellbeing
-### Classification, Regression & Clustering on a Mental-Health & Lifestyle Dataset
-> **Course:** Introduction to Data Science – Assignment #2
-> **Student:** Odeya Shmuel
-> **Tools:** Python, Pandas, NumPy, Scikit-Learn, Seaborn, Matplotlib, HuggingFace
 ---
-## 🎥 Presentation Video
-> ▶️ **Watch the 4–6 min walkthrough:**
-> [![Assignment 2 Video](https://img.shields.io/badge/Video-YouTube-red)](<FILL_IN_VIDEO_LINK>)
-The video walks through this repository, the notebook, main visualizations, models, and key takeaways – with screen-share and camera on, as required in the assignment.
 ---
-## 1. Project Overview
-This project explores how **daily lifestyle patterns** (screen time, physical activity, work & financial stress, diet, sleep) are related to **mental wellbeing**.
-Using a synthetic mental-health & lifestyle dataset, I built a **full DS pipeline**:
-- **EDA** – understanding distributions, relationships, and potential data issues
-- **Feature Engineering** – composite lifestyle scores, interactions, scaling, PCA, cluster-based features
-- **Clustering** – discovering latent lifestyle “profiles”
-- **Classification** – predicting **high vs. low wellbeing**
-- **Regression** – predicting the **continuous wellbeing score**
-- **Evaluation & Interpretation** – using appropriate metrics, comparing models, and interpreting feature importance
-The main **business/research question**:
-> **“Which lifestyle factors and behavior patterns are most predictive of wellbeing, and can we meaningfully separate low-, medium-, and high-wellbeing individuals?”**
 ---
-## 2. Dataset
-- **Source:** Synthetic Mental Health, Lifestyle & Wellbeing dataset (Kaggle)
-- **Rows:** `<FILL_IN>`
-- **Columns:** `<FILL_IN>`
-### Main variable groups
-- **Wellbeing target**
-  - `wellbeing_score` – continuous target used for regression
-  - `wellbeing_label` – derived binary/3-class label (e.g. low / medium / high) used for classification
-- **Lifestyle & behavior**
-  - Screen time questions: `screen_time_1` … `screen_time_9`
-  - Physical activity: duration / intensity features (e.g. `activity_minutes`, `activity_intensity`, etc.)
-  - Work stress: `work_stress_1` … `work_stress_9`
-  - Financial stress: `financial_stress_1` … `financial_stress_9`
-  - Diet quality: `diet_quality`
-  - Sleep quality: `sleep_quality_1` … `sleep_quality_9`
-- **Demographics or context (if present)**
-  Age, gender, employment, etc. – used for additional EDA but not all were kept in the final models.
 ---
-## 3. End-to-End Pipeline
-The notebook follows the standard **data-science lifecycle**:
-1. **Problem framing** – define targets & success criteria
-2. **Data loading & inspection**
-3. **Data cleaning & EDA**
-4. **Feature engineering**
-5. **Clustering (unsupervised)**
-6. **Train/test split & modeling (classification + regression)**
-7. **Evaluation & model comparison**
-8. **Interpretation & storytelling**
-Random seeds (`random_state=<FILL_IN>`) are set to keep results reproducible.
 ---
-## 4. Data Handling & EDA (20%)
-### 4.1 Data cleaning
-Steps taken to ensure data quality:
-- **Missing values**
-  - Checked with `df.isna().sum()` and missingness heatmaps.
-  - Strategy:
-    - Numeric features → **median imputation**
-    - Categorical features → **mode imputation**
-  - Verified there is no leakage from the target during imputation (fit only on train).
-- **Duplicates**
-  - Used `df.duplicated().sum()` and dropped exact duplicates when needed.
-- **Outliers**
-  - Inspected distributions for key numeric variables with:
-    - Histograms / KDE plots
-    - Boxplots
-  - Used **IQR rule** (Q1 − 1.5·IQR, Q3 + 1.5·IQR) to identify extreme values in:
-    - `screen_time_*`
-    - activity features
-    - stress scores
-  - Instead of dropping many rows, I **capped / winsorized** extreme values for some features to reduce their influence without losing information.
-### 4.2 Exploratory Data Analysis
-Key EDA elements (with plots shown in the notebook):
-- **Univariate distributions**
-  - Histograms for wellbeing, lifestyle scores, and stress levels.
-  - Showed that wellbeing is **not perfectly normal**, with slight skew towards `<FILL_IN: e.g. lower wellbeing>`.
-- **Bivariate relationships**
-  - **Correlation matrix** of numeric features (heatmap).
-  - Scatter plots of `wellbeing_score` vs:
-    - average sleep quality
-    - activity score
-    - work & financial stress
-  - Boxplots of wellbeing per demographic groups (if present).
-- **Insights from EDA**
-  - Higher wellbeing is associated with:
-    - **better sleep & diet quality**
-    - **higher physical activity**
-    - **lower work & financial stress**
-  - Some variables are **highly correlated** (e.g. individual work-stress items), motivating **composite scores** and **dimensionality reduction (PCA)** later on.
-EDA directly guided the **feature engineering choices** and which variables to prefer in modeling.
 ---
-## 5. Feature Engineering (20%)
-The goal was to create **meaningful, low-noise features** reflecting lifestyle patterns, and to reduce multicollinearity.
-### 5.1 Composite lifestyle scores
-I aggregated question-level items into **domain scores**:
-```python
-df["screen_time_score"]     = df[[c for c in df.columns if "screen_time" in c]].mean(axis=1)
-df["work_stress_score"]     = df[[c for c in df.columns if "work_stress" in c]].mean(axis=1)
-df["financial_stress_score"]= df[[c for c in df.columns if "financial_stress" in c]].mean(axis=1)
-df["sleep_quality_score"]   = df[[c for c in df.columns if "sleep_quality" in c]].mean(axis=1)
-df["activity_score"]        = df[[c for c in df.columns if "activity" in c]].mean(axis=1)
-df["diet_quality_score"]    = df["diet_quality"]

+---
+license: mit
+tags:
+  - regression
+  - classification
+  - mental-health
+  - wellbeing
+  - gradient-boosting
+  - sklearn
+  - lifestyle
+  - clustering
+---
+# Mental Health & Wellbeing Prediction
+## 📹 Video Presentation
+[[YOUR VIDEO LINK HERE - Add after recording](https://1drv.ms/f/c/e67fe0aaccf6536c/IgCf0p3QgN9PR6pi1VVDzivQAZY1mD5BqUUdajvKncgdiOg)]
 ---
+## 📋 Project Overview
+This project predicts *mental wellbeing scores* based on lifestyle and environmental factors. We built both *regression* models (to predict exact scores) and *classification* models (to categorize wellbeing levels as Low vs Medium/High).
+| | |
+|---|---|
+| *Dataset* | Synthetic Mental Health, Lifestyle & Wellbeing (Kaggle) |
+| *Size* | 400,000 individuals, 15 features |
+| *Target* | mental_wellbeing_score (0-100) |
+| *Train/Test* | 320,000 / 80,000 (80/20 split) |
+### Main Question
+Which lifestyle and environmental factors are most strongly associated with mental wellbeing, and how accurately can we predict wellbeing scores from these features?
+### Goals
+1. Explore relationships between lifestyle factors and mental wellbeing
+2. Build baseline regression model and improve through feature engineering
+3. Apply K-Means clustering to discover lifestyle segments
+4. Convert to binary classification and identify at-risk individuals
 ---
+## 📊 Part 1-2: Exploratory Data Analysis
+### Dataset Features
+| Feature Type | Features |
+|--------------|----------|
+| *Target* | mental_wellbeing_score (0-100) |
+| *Lifestyle* | sleep_hours, screen_time, physical_activity, diet_quality, sleep_quality |
+| *Stress* | work_stress, financial_stress |
+| *Social* | social_interactions |
+| *Environment* | air_quality_index, noise_level |
+| *Demographics* | age, gender, city_type |
+### Target Distribution
+![Target Distribution](./01_target_distribution.png)
+The mental wellbeing score ranges from 0 to 100, with scores concentrated in the 80-100 range.
 ---
+### Research Question 1: Screen Time vs Wellbeing
+![Screen Time vs Wellbeing](./05_screen_time_vs_wellbeing.png)
+*Finding:* Higher screen time is associated with slightly lower mental wellbeing. The relationship is negative but relatively weak.
+---
+### Research Question 2: Physical Activity vs Wellbeing
+![Physical Activity vs Wellbeing](./06_physical_activity_vs_wellbeing.png)
+*Finding:* Higher physical activity levels are associated with better mental wellbeing. This is one of the positive lifestyle factors.
 ---
+### Research Question 3: Work Stress vs Wellbeing
+![Work Stress vs Wellbeing](./07_work_stress_vs_wellbeing.png)
+*Finding:* Work stress has a strong negative relationship with mental wellbeing - one of the most impactful factors.
+---
+### Research Question 4: Sleep Quality vs Wellbeing
+![Sleep Quality vs Wellbeing](./08_sleep_quality_vs_wellbeing.png)
+*Finding:* Better sleep quality strongly correlates with higher mental wellbeing scores. Sleep quality is one of the top positive predictors.
 ---
+### Research Question 5: Diet Quality vs Wellbeing
+![Diet Quality vs Wellbeing](./09_diet_quality_vs_wellbeing.png)
+*Finding:* Higher diet quality is associated with better mental wellbeing outcomes.
+---
+### Correlation Analysis
+![Correlation Heatmap](./04_correlation_heatmap.png)
+*Key Correlations with Mental Wellbeing:*
+| Factor | Correlation | Direction |
+|--------|-------------|-----------|
+| Sleep Quality | Strong Positive | ↑ Better sleep = Higher wellbeing |
+| Diet Quality | Moderate Positive | ↑ Better diet = Higher wellbeing |
+| Physical Activity | Moderate Positive | ↑ More activity = Higher wellbeing |
+| Work Stress | Strong Negative | ↑ More stress = Lower wellbeing |
+| Financial Stress | Moderate Negative | ↑ More stress = Lower wellbeing |
+| Screen Time | Weak Negative | ↑ More screen time = Lower wellbeing |
+---
+### Feature Correlation with Target
+![Feature Correlation](./10_feature_correlation_target.png)
+This visualization shows how each feature correlates with mental wellbeing score. Green bars indicate positive relationships (beneficial factors), while red bars indicate negative relationships (risk factors).
 ---
+## 📈 Part 3: Baseline Model
+### Baseline Configuration
+| Setting | Value |
+|---------|-------|
+| Algorithm | Linear Regression |
+| Features | 6 lifestyle scores |
+| Preprocessing | StandardScaler |
+| Train/Test Split | 80/20 |
+### Baseline Results
+| Metric | Value |
+|--------|-------|
+| R² Score | 0.672 |
+| MAE | 4.11 |
+| RMSE | 5.16 |
+*Interpretation:* The baseline model explains 67.2% of variance in wellbeing scores with an average error of about 4 points on the 0-100 scale. This is a solid baseline.
+### Baseline: Actual vs Predicted
+![Baseline Actual vs Predicted](./11_baseline_actual_vs_predicted.png)
+### Baseline Feature Importance
+![Baseline Feature Importance](./12_baseline_feature_importance.png)
+*Top Features (Baseline):*
+| Rank | Feature | Coefficient | Effect |
+|------|---------|-------------|--------|
+| 1 | Work Stress | -4.04 | Strongest negative |
+| 2 | Sleep Quality | +4.03 | Strongest positive |
+| 3 | Financial Stress | -2.69 | Negative |
+| 4 | Diet Quality | +2.69 | Positive |
+| 5 | Physical Activity | +2.24 | Positive |
+| 6 | Screen Time | -1.56 | Weakest negative |
+---
+## 🔧 Part 4: Feature Engineering
+### Engineered Features
+We created additional features to capture more complex relationships:
+| Feature | Description | Rationale |
+|---------|-------------|-----------|
+| *Weighted Lifestyle Risk* | Composite score combining all risk factors | Captures overall lifestyle health |
+| *Cluster Labels* | K-Means lifestyle segments (k=3) | Non-linear pattern capture |
+| *PCA Components* | Lifestyle_PCA_1, Lifestyle_PCA_2 | Dimensionality reduction |
+### 4.1 Weighted Lifestyle Risk Score
+![Lifestyle Risk Distribution](./13_lifestyle_risk_distribution.png)
+We created a weighted lifestyle risk score based on EDA findings:
+| Factor | Weight | Direction |
+|--------|--------|-----------|
+| Work Stress | 0.30 | Higher = More Risk |
+| Financial Stress | 0.25 | Higher = More Risk |
+| Poor Sleep Quality | 0.20 | Lower quality = More Risk |
+| Poor Diet Quality | 0.15 | Lower quality = More Risk |
+| Low Physical Activity | 0.05 | Lower = More Risk |
+| Screen Time | 0.05 | Higher = More Risk |
+*Formula:* Higher score = Riskier lifestyle profile (worse for wellbeing)
+---
+### 4.2 K-Means Clustering (k=3)
+![Clustering PCA](./13_clustering_pca.png)
+We applied K-Means clustering to identify distinct lifestyle segments:
+| Cluster | Count | Profile |
+|---------|-------|---------|
+| 0 | 129,119 (32%) | Lifestyle Profile A |
+| 1 | 142,014 (36%) | Lifestyle Profile B |
+| 2 | 128,867 (32%) | Lifestyle Profile C |
+The cluster label becomes a categorical feature that helps the model capture non-linear relationships between lifestyle factors.
+### 4.3 PCA Components
+We added two PCA components (Lifestyle_PCA_1, Lifestyle_PCA_2) that compress the six lifestyle features into orthogonal dimensions capturing the main variance patterns.
+---
+## 🎯 Part 5: Improved Regression Models
+![Model Comparison R²](./14_model_comparison_r2.png)
+![Model Comparison RMSE](./15_model_comparison_rmse.png)
+### Model Comparison Results
+| Model | MAE | RMSE | R² |
+|-------|-----|------|-----|
+| Baseline Linear Regression | 4.11 | 5.16 | 0.672 |
+| Linear Regression (engineered) | 4.09 | 5.14 | 0.675 |
+| Random Forest (engineered) | 2.28 | 3.61 | 0.839 |
+| *Gradient Boosting (engineered)* | *2.29* | *3.49* | *0.850* |
+### Improvement Analysis
+| Comparison | Improvement |
+|------------|-------------|
+| Baseline → Gradient Boosting | +26.5% R² improvement |
+| MAE Reduction | 4.11 → 2.29 (44% reduction) |
+| RMSE Reduction | 5.16 → 3.49 (32% reduction) |
+---
+### Feature Importance (Best Model)
+![Feature Importance](./16_best_model_feature_importance.png)
+*Key Insights:*
+- Stress factors (work, financial) remain the strongest predictors
+- Sleep quality continues to be the top positive factor
+- Engineered features and cluster labels add predictive value
+---
+## 🏆 Part 6: Regression Winner
+### Gradient Boosting Regressor
+| Metric | Value |
+|--------|-------|
+| R² Score | 0.850 |
+| MAE | 2.29 |
+| RMSE | 3.49 |
+*Why Gradient Boosting Won:*
+- Captures non-linear relationships between lifestyle factors
+- Handles feature interactions naturally
+- Best balance of accuracy and generalization
+- Lowest RMSE among all models
+*Saved as:* winning_regressor.pkl
+---
+## 🔄 Part 7: Regression to Classification
+We converted wellbeing scores into *2 binary classes* using quantile thresholds:
+![Class Distribution](./17_class_distribution.png)
+| Class | Wellbeing Level | Threshold | Train Count | Percentage |
+|-------|-----------------|-----------|-------------|------------|
+| 0 | Low Wellbeing | < 92.49 | 105,600 | 33% |
+| 1 | Medium/High Wellbeing | ≥ 92.49 | 214,400 | 67% |
+*Note:* The classes are imbalanced (33% vs 67%), so we focus on F1-score and recall rather than accuracy alone.
+---
+### Precision vs Recall Analysis
+*For mental health prediction, Recall is more important:*
+In the context of predicting mental wellbeing, *recall is more important than precision* for the low-wellbeing class. Missing a person who is actually struggling (false negative) is more harmful than flagging someone as "at risk" when they are actually fine (false positive).
+### False Positive vs False Negative
+*False Negatives are more critical:*
+| Error Type | Meaning | Consequence |
+|------------|---------|-------------|
+| False Positive | Predict Low, actually OK | Extra attention to someone who is fine (less harmful) |
+| *False Negative* | Predict OK, actually Low | *Person who needs support is not identified (more harmful)* |
+A false negative means the model predicts that someone is not in the low-wellbeing group, while in reality they are. This could result in a person who needs support not being identified.
+*Conclusion:* We prioritize recall for Class 0 (Low Wellbeing) to minimize missed at-risk individuals.
+---
+## 📊 Part 8: Classification Models
+### Classification Results
+| Model | Accuracy | F1 (macro) |
+|-------|----------|------------|
+| *Logistic Regression* | *90.55%* | *0.893* |
+| Gradient Boosting | 90.47% | 0.892 |
+| Random Forest | 90.39% | 0.891 |
+### Confusion Matrices
+![Confusion Matrix](./18_confusion_matrix.png)
+*Key Observations:*
+- All models achieve ~90% accuracy
+- Most confusion occurs between the two adjacent classes
+- Models rarely completely misclassify (important for identifying at-risk individuals)
+- Logistic Regression achieves the highest F1 score despite being the simplest model
+### Confusion Matrix Analysis
+The confusion matrices show that most errors are confusions between "medium" and "high" wellbeing individuals. More importantly, the model rarely confuses class 0 (low wellbeing) with class 1 (high wellbeing), which is good from a practical perspective: it almost never predicts "high wellbeing" for people who are actually in the low group.
+---
+## 🏆 Part 8.4: Classification Winner
+### Logistic Regression
+| Metric | Value |
+|--------|-------|
+| Accuracy | 90.55% |
+| Macro F1 | 0.893 |
+*Why Logistic Regression Won:*
+- Highest accuracy and F1 score
+- Simple, interpretable model
+- Fast inference time (trained in 1.48 seconds vs 117-247 seconds for others)
+- Excellent calibrated probabilities
+- Performs well on this linearly-separable problem
+*Saved as:* winning_classifier.pkl
+---
+## 📁 Repository Files
+| File | Description |
+|------|-------------|
+| winning_regressor.pkl | Gradient Boosting regression model (R²=0.85) |
+| winning_classifier.pkl | Logistic Regression classifier (90.6% accuracy) |
+| notebook.ipynb | Complete Jupyter notebook with all code |
+---
+## 💡 Key Takeaways
+### What Affects Mental Wellbeing Most?
+*Negative Factors (Risk):*
+1. 🔴 *Work Stress* - Strongest negative impact (coefficient: -4.04)
+2. 🔴 *Financial Stress* - Significant negative impact (coefficient: -2.69)
+3. 🟡 *Screen Time* - Weak negative impact (coefficient: -1.56)
+*Positive Factors (Protective):*
+1. 🟢 *Sleep Quality* - Strongest positive impact (coefficient: +4.03)
+2. 🟢 *Diet Quality* - Significant positive impact (coefficient: +2.69)
+3. 🟢 *Physical Activity* - Moderate positive impact (coefficient: +2.24)
+### Model Performance Summary
+| Task | Best Model | Performance |
+|------|------------|-------------|
+| Regression | Gradient Boosting | R² = 0.850, RMSE = 3.49 |
+| Classification | Logistic Regression | 90.55% accuracy, F1 = 0.893 |
+### Feature Engineering Impact
+| Model | MAE | RMSE | R² |
+|-------|-----|------|-----|
+| Baseline (6 features) | 4.11 | 5.16 | 0.672 |
+| Gradient Boosting (engineered) | 2.29 | 3.49 | 0.850 |
+| *Improvement* | *-44%* | *-32%* | *+26.5%* |
+### Lessons Learned
+1. *Stress management is crucial* - Work and financial stress are the strongest predictors of low wellbeing
+2. *Sleep quality matters most* among positive lifestyle factors
+3. *Feature engineering helps* - Weighted risk score and cluster features improved predictions
+4. *Simple models can win* - Logistic Regression beat complex models for classification
+5. *Ensemble methods excel for regression* - Gradient Boosting captured non-linear patterns
+6. *Recall matters for mental health* - Don't miss at-risk individuals (minimize false negatives)
+---
+## 👤 Author
+*Odeya*
+Assignment #2: Classification, Regression, Clustering, Evaluation
+---
+## 📚 References
+- *Dataset:* [Synthetic Mental Health, Lifestyle & Wellbeing Dataset - Kaggle](https://www.kaggle.com/datasets/raghavgour/syntheticmental-healthlifestyle-wellbeing-dataset)
+- *Tools:* scikit-learn, pandas, numpy, matplotlib, seaborn
+- *Algorithms:* Linear Regression, Random Forest, Gradient Boosting, Logistic Regression, K-Means