Youth Smoking & Drug Use – Regression & Classification Project
Video Link:
Goal: Analyze factors related to youth smoking and drug use and build models that
- Predict smoking prevalence (regression)
- Classify individuals into low vs high smoking groups (classification).
1. Dataset
Dataset Link: https://www.kaggle.com/datasets/waqi786/youth-smoking-and-drug-dataset
- Name: Youth Smoking & Drug Use Dataset
- Size: 10,000 rows
- Type: Tabular - numeric + categorical
- Target (regression):
Smoking_Prevalence– continuous score indicating level of smoking.
2. Exploratory Data Analysis (EDA)
Exploratory analysis was performed to understand the distribution of key variables and explore relationships relevant to smoking and drug use.
2.1 Distributions:
Smoking_Prevalenceis roughly centered around ~27–28 with moderate spread.- Many features such as
Peer_Influence,Media_Influence,Mental_Healthare on a 1–10 scale. - No strong class imbalance issues were found once the target was binarized.
Relationships:
Key exploratory plots included:
Smoking Prevalence vs. Peer Influence
Line/point plots of average smoking across peer-influence levels showed a weak but
mostly increasing trend: higher peer influence is loosely associated with higher smoking levels.
Drug Experimentation vs. Family Background
Boxplot suggested that higher family risk (worse family background) tends to be associated
with slightly higher levels of drug experimentation, but the relationship is noisy.

Peer Influence by Age Group A boxplot showed how peer-influence scores are distributed across different age groups. Although the median peer-influence levels remain fairly consistent across ages, younger groups (10–24) displayed slightly wider variability. Overall, peer influence does not show strong age-related trends but remains relatively stable across the population.
Research Questions:
- Does peer influence affect smoking rates?
- Does peer influence affect drug Use?
- Does family background affect drug experimentation
- Does smoking prevalence change as mental-health scores increase?
- Are younger age groups more influenced by peers than older groups?
3. Baseline Regression Model:
The initial regression model aimed to predict smoking prevalence from demographic, behavioral, and social factors.
Conclusion of the feature importance:
The baseline linear regression model performed poorly. The near-zero explanatory power suggests that a simple linear relationship does not adequately capture the complexity of smoking behavior. This motivated more advanced modeling approaches and feature engineering.
4. Feature Engineering
To improve model performance, several engineered features were created:
Scaling
- Numeric features were standardized for models sensitive to scale (e.g., Linear Regression, Logistic Regression).
Polynomial Feature
Peer_Influence_Sq = Peer_Influence^2
Captures potential non-linear effects of peer influence on smoking.
Combined Feature
Family_Community_Support– sum/combination of family and community support variables.
Clustering Feature
- A clustering algorithm (e.g., K-Means) was applied on the feature space to produce
Cluster_ID.
This feature encodes “behavioral groups” that may differ in smoking/drug patterns.
- A clustering algorithm (e.g., K-Means) was applied on the feature space to produce
These engineered features were re-used both for regression and classification tasks.
5. Regression: Predicting Smoking Prevalence
Model Number 1 - Baseline vs. Improved Model:
The improved model revealed:
- Strong positive effect of Peer_Influence_Sq, indicating a nonlinear relationship
- Positive contributions from Mental_Health, Media_Influence, and Community_Support
- Negative effect from Family_Background, suggesting stronger families reduce smoking
- Minor contributions from Social_Risk and Cluster_ID
This model uncovered patterns the baseline model failed to capture.
Model Number 2 - Random Forest:
- Drug_Experimentation emerged as the strongest predictor
- Followed by Social_Risk, Mental_Health, and Family_Community_Support
- Peer influence (linear or squared) was less impactful
- Cluster membership contributed minimally
Random Forest captured complex nonlinear interactions that differ from the linear model.
Model Number 3 - Gradient Boosting:
- Again, Drug_Experimentation dominated as the key predictor
- Social_Risk and Family_Community_Support also showed significant influence
- Other features played only small roles
- The model produced a slightly negative R², indicating overall weak predictive power
5.2 Evaluation Metrics
Common regression metrics were used:
- MAE – Mean Absolute Error
- MSE – Mean Squared Error
- RMSE – Root Mean Squared Error
- R² – Coefficient of determination
| Model | MAE | MSE | RMSE | R² |
|---|---|---|---|---|
| Baseline Linear Regression | ~11.1 | ~167 | ~12.9 | ≈ 0 or slightly negative |
| Improved Linear Regression | ~11.1 | ~167 | ~12.9 | ≈ 0 |
| Random Forest Regressor | ~11.3 | ~176 | ~13.3 | ≈ 0 (slightly negative) |
| Gradient Boosting Regr. | ~11.1 | ~168 | ~12.98 | ≈ 0 (slightly negative) |
All models reached similar error levels, and the R² values near zero indicate that the dataset contains weak predictive signals for smoking prevalence.
The baseline linear regression model fails to capture meaningful patterns in the data, predicting nearly constant values regardless of the actual smoking prevalence
5.3 Coefficients & Feature Importance
In the improved Linear Regression model, the most influential coefficients were:
- Positive:
Peer_Influence_Sq,Mental_Health,Media_Influence,Community_Support - Negative:
Family_Background(suggesting that a stronger family background tends to reduce smoking)
- Positive:
In Random Forest and Gradient Boosting,
Drug_Experimentationwas the most important predictor, followed bySocial_RiskandFamily_Community_Support.
5.4 Winning Regression Model
Despite overall limited predictive strength, Gradient Boosting consistently achieved the lowest error metrics among the tested regression models. Therefore, Gradient Boosting is selected as the winning regression model.
6. Classification: Predicting Low vs. High Smoking Group
After completing the regression task, the problem was reformulated as a binary classification task.
The continuous target Smoking_Prevalence was converted into two classes:
- 0 = Low Smoking (≤ median)
- 1 = High Smoking (> median)
This produced a balanced dataset appropriate for training classification models.
6.1 Checking Class Balance
The distribution of the new target variable was examined:
- Low Smoking: ~50%
- High Smoking: ~50%
Since the classes were balanced, accuracy remained a meaningful metric.
However, precision, recall, and F1-score were also evaluated to gain deeper insight.
6.2 Models Trained
Three classification models were trained using the engineered features:
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
All models were trained using the same train/test split to ensure fair comparison.
6.3 Classification Results
Model 1 – Logistic Regression
- Struggles to capture nonlinear relationships
- Misclassifies many high-smoking individuals
- Overall moderate performance
Model 2 – Random Forest Classifier
- Best performance among the three models
- Captures nonlinear patterns effectively
- More accurate at identifying high-smoking individuals
- Higher precision and recall
Model 3 – Gradient Boosting Classifier
- Better than Logistic Regression
- Close to Random Forest performance but less consistent
- Slightly lower recall for high-smoking individuals
6.4 Classification Metrics Summary
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Logistic Regression | ~0.50–0.60 | Moderate | Lower | Moderate |
| Random Forest Classifier | Highest | Highest | Highest | Best Overall |
| Gradient Boosting Class. | Close to RF | Slightly lower | Slightly lower | Strong |
The confusion matrices clearly show that Random Forest makes the fewest classification errors.
6.5 Winning Classification Model
The Random Forest Classifier was selected as the winning model because it:
- Achieved the highest accuracy and F1-score
- Performed best at identifying high-smoking individuals
- Captured nonlinear interactions effectively
- Outperformed both Logistic Regression and Gradient Boosting on most evaluation metrics
7. Model Deployment (Hugging Face Repository)
Both winning models were uploaded to the Hugging Face repository:
- Winning Regression Model:
winning_model.pkl(Gradient Boosting) - Winning Classification Model:
winning_classifier.pkl(Random Forest)
8. Limitations & Future Work
Limitations
- The dataset appears synthetic and may lack strong predictive signals, which limits the performance of both regression and classification models.
- Regression performance was generally low (R² ≈ 0), indicating that the features in the dataset do not strongly explain smoking prevalence.
- Important behavioral, social, or environmental variables that influence smoking may be missing from the dataset.
- Classification performance was stronger, but still constrained by the limited variability represented in the available features.
Future Improvements
- Incorporate richer psychological, environmental, or temporal factors to improve predictive capacity.
- Experiment with alternative binning strategies for the classification target (e.g., quartiles instead of median-split).
- Test additional model families such as XGBoost, LightGBM, or neural networks for both regression and classification tasks.
- Collect or generate more realistic real-world data with higher signal-to-noise ratio.
- Explore feature selection or dimensionality reduction techniques to identify stronger predictive patterns.









