YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Youth Smoking & Drug Use – Regression & Classification Project

Video Link:

https://youtu.be/TROrsYcZHXw

Goal: Analyze factors related to youth smoking and drug use and build models that

  1. Predict smoking prevalence (regression)
  2. Classify individuals into low vs high smoking groups (classification).

1. Dataset

Dataset Link: https://www.kaggle.com/datasets/waqi786/youth-smoking-and-drug-dataset

  • Name: Youth Smoking & Drug Use Dataset
  • Size: 10,000 rows
  • Type: Tabular - numeric + categorical
  • Target (regression): Smoking_Prevalence – continuous score indicating level of smoking.

2. Exploratory Data Analysis (EDA)

Exploratory analysis was performed to understand the distribution of key variables and explore relationships relevant to smoking and drug use.

2.1 Distributions:

  • Smoking_Prevalence is roughly centered around ~27–28 with moderate spread.
  • Many features such as Peer_Influence, Media_Influence, Mental_Health are on a 1–10 scale.
  • No strong class imbalance issues were found once the target was binarized.

1 - Smoking Prevalence Distribution

Relationships:

Key exploratory plots included:

Smoking Prevalence vs. Peer Influence
Line/point plots of average smoking across peer-influence levels showed a weak but
mostly increasing trend: higher peer influence is loosely associated with higher smoking levels.

2_Average_Smoking_by_Peer_Influence

Drug Experimentation vs. Family Background
Boxplot suggested that higher family risk (worse family background) tends to be associated
with slightly higher levels of drug experimentation, but the relationship is noisy. 3_Average_Drug_Experimentation_by_Peer_Influence

Peer Influence by Age Group A boxplot showed how peer-influence scores are distributed across different age groups. Although the median peer-influence levels remain fairly consistent across ages, younger groups (10–24) displayed slightly wider variability. Overall, peer influence does not show strong age-related trends but remains relatively stable across the population.

peer_by_age


Research Questions:

  1. Does peer influence affect smoking rates?
  2. Does peer influence affect drug Use?
  3. Does family background affect drug experimentation
  4. Does smoking prevalence change as mental-health scores increase?
  5. Are younger age groups more influenced by peers than older groups?

3. Baseline Regression Model:

The initial regression model aimed to predict smoking prevalence from demographic, behavioral, and social factors.

Actual_vs_Predicted

Conclusion of the feature importance:

Linear_Regression_Coefficients

The baseline linear regression model performed poorly. The near-zero explanatory power suggests that a simple linear relationship does not adequately capture the complexity of smoking behavior. This motivated more advanced modeling approaches and feature engineering.

4. Feature Engineering

To improve model performance, several engineered features were created:

  1. Scaling

    • Numeric features were standardized for models sensitive to scale (e.g., Linear Regression, Logistic Regression).
  2. Polynomial Feature

    • Peer_Influence_Sq = Peer_Influence^2
      Captures potential non-linear effects of peer influence on smoking.
  3. Combined Feature

    • Family_Community_Support – sum/combination of family and community support variables.
  4. Clustering Feature

    • A clustering algorithm (e.g., K-Means) was applied on the feature space to produce Cluster_ID.
      This feature encodes “behavioral groups” that may differ in smoking/drug patterns.

These engineered features were re-used both for regression and classification tasks.

clusters


5. Regression: Predicting Smoking Prevalence

Model Number 1 - Baseline vs. Improved Model:

The improved model revealed:

  • Strong positive effect of Peer_Influence_Sq, indicating a nonlinear relationship
  • Positive contributions from Mental_Health, Media_Influence, and Community_Support
  • Negative effect from Family_Background, suggesting stronger families reduce smoking
  • Minor contributions from Social_Risk and Cluster_ID

This model uncovered patterns the baseline model failed to capture.

improved_linear

Model Number 2 - Random Forest:

  • Drug_Experimentation emerged as the strongest predictor
  • Followed by Social_Risk, Mental_Health, and Family_Community_Support
  • Peer influence (linear or squared) was less impactful
  • Cluster membership contributed minimally

Random Forest captured complex nonlinear interactions that differ from the linear model.

random_forest

Model Number 3 - Gradient Boosting:

  • Again, Drug_Experimentation dominated as the key predictor
  • Social_Risk and Family_Community_Support also showed significant influence
  • Other features played only small roles
  • The model produced a slightly negative R², indicating overall weak predictive power

gradient

5.2 Evaluation Metrics

Common regression metrics were used:

  • MAE – Mean Absolute Error
  • MSE – Mean Squared Error
  • RMSE – Root Mean Squared Error
  • R² – Coefficient of determination
Model MAE MSE RMSE
Baseline Linear Regression ~11.1 ~167 ~12.9 ≈ 0 or slightly negative
Improved Linear Regression ~11.1 ~167 ~12.9 ≈ 0
Random Forest Regressor ~11.3 ~176 ~13.3 ≈ 0 (slightly negative)
Gradient Boosting Regr. ~11.1 ~168 ~12.98 ≈ 0 (slightly negative)

All models reached similar error levels, and the R² values near zero indicate that the dataset contains weak predictive signals for smoking prevalence.

The baseline linear regression model fails to capture meaningful patterns in the data, predicting nearly constant values regardless of the actual smoking prevalence

5.3 Coefficients & Feature Importance

  • In the improved Linear Regression model, the most influential coefficients were:

    • Positive: Peer_Influence_Sq, Mental_Health, Media_Influence, Community_Support
    • Negative: Family_Background (suggesting that a stronger family background tends to reduce smoking)
  • In Random Forest and Gradient Boosting,
    Drug_Experimentation was the most important predictor, followed by Social_Risk and Family_Community_Support.

5.4 Winning Regression Model

Despite overall limited predictive strength, Gradient Boosting consistently achieved the lowest error metrics among the tested regression models. Therefore, Gradient Boosting is selected as the winning regression model.


6. Classification: Predicting Low vs. High Smoking Group

After completing the regression task, the problem was reformulated as a binary classification task.
The continuous target Smoking_Prevalence was converted into two classes:

  • 0 = Low Smoking (≤ median)
  • 1 = High Smoking (> median)

This produced a balanced dataset appropriate for training classification models.


6.1 Checking Class Balance

The distribution of the new target variable was examined:

  • Low Smoking: ~50%
  • High Smoking: ~50%

Since the classes were balanced, accuracy remained a meaningful metric.
However, precision, recall, and F1-score were also evaluated to gain deeper insight.


6.2 Models Trained

Three classification models were trained using the engineered features:

  1. Logistic Regression
  2. Random Forest Classifier
  3. Gradient Boosting Classifier

All models were trained using the same train/test split to ensure fair comparison.


6.3 Classification Results

Model 1 – Logistic Regression

  • Struggles to capture nonlinear relationships
  • Misclassifies many high-smoking individuals
  • Overall moderate performance

Model 2 – Random Forest Classifier

  • Best performance among the three models
  • Captures nonlinear patterns effectively
  • More accurate at identifying high-smoking individuals
  • Higher precision and recall

Model 3 – Gradient Boosting Classifier

  • Better than Logistic Regression
  • Close to Random Forest performance but less consistent
  • Slightly lower recall for high-smoking individuals

6.4 Classification Metrics Summary

Model Accuracy Precision Recall F1 Score
Logistic Regression ~0.50–0.60 Moderate Lower Moderate
Random Forest Classifier Highest Highest Highest Best Overall
Gradient Boosting Class. Close to RF Slightly lower Slightly lower Strong

The confusion matrices clearly show that Random Forest makes the fewest classification errors.


6.5 Winning Classification Model

The Random Forest Classifier was selected as the winning model because it:

  • Achieved the highest accuracy and F1-score
  • Performed best at identifying high-smoking individuals
  • Captured nonlinear interactions effectively
  • Outperformed both Logistic Regression and Gradient Boosting on most evaluation metrics

random_heat


7. Model Deployment (Hugging Face Repository)

Both winning models were uploaded to the Hugging Face repository:

  • Winning Regression Model: winning_model.pkl (Gradient Boosting)
  • Winning Classification Model: winning_classifier.pkl (Random Forest)

8. Limitations & Future Work

Limitations

  • The dataset appears synthetic and may lack strong predictive signals, which limits the performance of both regression and classification models.
  • Regression performance was generally low (R² ≈ 0), indicating that the features in the dataset do not strongly explain smoking prevalence.
  • Important behavioral, social, or environmental variables that influence smoking may be missing from the dataset.
  • Classification performance was stronger, but still constrained by the limited variability represented in the available features.

Future Improvements

  • Incorporate richer psychological, environmental, or temporal factors to improve predictive capacity.
  • Experiment with alternative binning strategies for the classification target (e.g., quartiles instead of median-split).
  • Test additional model families such as XGBoost, LightGBM, or neural networks for both regression and classification tasks.
  • Collect or generate more realistic real-world data with higher signal-to-noise ratio.
  • Explore feature selection or dimensionality reduction techniques to identify stronger predictive patterns.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support