YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Youth Smoking & Drug Use – Regression & Classification Project

Video Link:

https://youtu.be/TROrsYcZHXw

Goal: Analyze factors related to youth smoking and drug use and build models that

Predict smoking prevalence (regression)
Classify individuals into low vs high smoking groups (classification).

1. Dataset

Dataset Link: https://www.kaggle.com/datasets/waqi786/youth-smoking-and-drug-dataset

Name: Youth Smoking & Drug Use Dataset
Size: 10,000 rows
Type: Tabular - numeric + categorical
Target (regression): Smoking_Prevalence – continuous score indicating level of smoking.

2. Exploratory Data Analysis (EDA)

Exploratory analysis was performed to understand the distribution of key variables and explore relationships relevant to smoking and drug use.

2.1 Distributions:

Smoking_Prevalence is roughly centered around ~27–28 with moderate spread.
Many features such as Peer_Influence, Media_Influence, Mental_Health are on a 1–10 scale.
No strong class imbalance issues were found once the target was binarized.

Relationships:

Key exploratory plots included:

Smoking Prevalence vs. Peer Influence
Line/point plots of average smoking across peer-influence levels showed a weak but
mostly increasing trend: higher peer influence is loosely associated with higher smoking levels.

Drug Experimentation vs. Family Background
Boxplot suggested that higher family risk (worse family background) tends to be associated
with slightly higher levels of drug experimentation, but the relationship is noisy.

Peer Influence by Age Group A boxplot showed how peer-influence scores are distributed across different age groups. Although the median peer-influence levels remain fairly consistent across ages, younger groups (10–24) displayed slightly wider variability. Overall, peer influence does not show strong age-related trends but remains relatively stable across the population.

Research Questions:

Does peer influence affect smoking rates?
Does peer influence affect drug Use?
Does family background affect drug experimentation
Does smoking prevalence change as mental-health scores increase?
Are younger age groups more influenced by peers than older groups?

3. Baseline Regression Model:

The initial regression model aimed to predict smoking prevalence from demographic, behavioral, and social factors.

Conclusion of the feature importance:

The baseline linear regression model performed poorly. The near-zero explanatory power suggests that a simple linear relationship does not adequately capture the complexity of smoking behavior. This motivated more advanced modeling approaches and feature engineering.

4. Feature Engineering

To improve model performance, several engineered features were created:

Scaling
- Numeric features were standardized for models sensitive to scale (e.g., Linear Regression, Logistic Regression).
Polynomial Feature
- Peer_Influence_Sq = Peer_Influence^2
  Captures potential non-linear effects of peer influence on smoking.
Combined Feature
- Family_Community_Support – sum/combination of family and community support variables.
Clustering Feature
- A clustering algorithm (e.g., K-Means) was applied on the feature space to produce Cluster_ID.
  This feature encodes “behavioral groups” that may differ in smoking/drug patterns.

These engineered features were re-used both for regression and classification tasks.

5. Regression: Predicting Smoking Prevalence

Model Number 1 - Baseline vs. Improved Model:

The improved model revealed:

Strong positive effect of Peer_Influence_Sq, indicating a nonlinear relationship
Positive contributions from Mental_Health, Media_Influence, and Community_Support
Negative effect from Family_Background, suggesting stronger families reduce smoking
Minor contributions from Social_Risk and Cluster_ID

This model uncovered patterns the baseline model failed to capture.

Model Number 2 - Random Forest:

Drug_Experimentation emerged as the strongest predictor
Followed by Social_Risk, Mental_Health, and Family_Community_Support
Peer influence (linear or squared) was less impactful
Cluster membership contributed minimally

Random Forest captured complex nonlinear interactions that differ from the linear model.

Model Number 3 - Gradient Boosting:

Again, Drug_Experimentation dominated as the key predictor
Social_Risk and Family_Community_Support also showed significant influence
Other features played only small roles
The model produced a slightly negative R², indicating overall weak predictive power

5.2 Evaluation Metrics

Common regression metrics were used:

MAE – Mean Absolute Error
MSE – Mean Squared Error
RMSE – Root Mean Squared Error
R² – Coefficient of determination

Model	MAE	MSE	RMSE	R²
Baseline Linear Regression	~11.1	~167	~12.9	≈ 0 or slightly negative
Improved Linear Regression	~11.1	~167	~12.9	≈ 0
Random Forest Regressor	~11.3	~176	~13.3	≈ 0 (slightly negative)
Gradient Boosting Regr.	~11.1	~168	~12.98	≈ 0 (slightly negative)

All models reached similar error levels, and the R² values near zero indicate that the dataset contains weak predictive signals for smoking prevalence.

The baseline linear regression model fails to capture meaningful patterns in the data, predicting nearly constant values regardless of the actual smoking prevalence

5.3 Coefficients & Feature Importance

In the improved Linear Regression model, the most influential coefficients were:
- Positive: Peer_Influence_Sq, Mental_Health, Media_Influence, Community_Support
- Negative: Family_Background (suggesting that a stronger family background tends to reduce smoking)
In Random Forest and Gradient Boosting,
Drug_Experimentation was the most important predictor, followed by Social_Risk and Family_Community_Support.

5.4 Winning Regression Model

Despite overall limited predictive strength, Gradient Boosting consistently achieved the lowest error metrics among the tested regression models. Therefore, Gradient Boosting is selected as the winning regression model.

6. Classification: Predicting Low vs. High Smoking Group

After completing the regression task, the problem was reformulated as a binary classification task.
The continuous target Smoking_Prevalence was converted into two classes:

0 = Low Smoking (≤ median)
1 = High Smoking (> median)

This produced a balanced dataset appropriate for training classification models.

6.1 Checking Class Balance

The distribution of the new target variable was examined:

Low Smoking: ~50%
High Smoking: ~50%

Since the classes were balanced, accuracy remained a meaningful metric.
However, precision, recall, and F1-score were also evaluated to gain deeper insight.

6.2 Models Trained

Three classification models were trained using the engineered features:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier

All models were trained using the same train/test split to ensure fair comparison.

6.3 Classification Results

Model 1 – Logistic Regression

Struggles to capture nonlinear relationships
Misclassifies many high-smoking individuals
Overall moderate performance

Model 2 – Random Forest Classifier

Best performance among the three models
Captures nonlinear patterns effectively
More accurate at identifying high-smoking individuals
Higher precision and recall

Model 3 – Gradient Boosting Classifier

Better than Logistic Regression
Close to Random Forest performance but less consistent
Slightly lower recall for high-smoking individuals

6.4 Classification Metrics Summary

Model	Accuracy	Precision	Recall	F1 Score
Logistic Regression	~0.50–0.60	Moderate	Lower	Moderate
Random Forest Classifier	Highest	Highest	Highest	Best Overall
Gradient Boosting Class.	Close to RF	Slightly lower	Slightly lower	Strong

The confusion matrices clearly show that Random Forest makes the fewest classification errors.

6.5 Winning Classification Model

The Random Forest Classifier was selected as the winning model because it:

Achieved the highest accuracy and F1-score
Performed best at identifying high-smoking individuals
Captured nonlinear interactions effectively
Outperformed both Logistic Regression and Gradient Boosting on most evaluation metrics

7. Model Deployment (Hugging Face Repository)

Both winning models were uploaded to the Hugging Face repository:

Winning Regression Model: winning_model.pkl (Gradient Boosting)
Winning Classification Model: winning_classifier.pkl (Random Forest)

8. Limitations & Future Work

Limitations

The dataset appears synthetic and may lack strong predictive signals, which limits the performance of both regression and classification models.
Regression performance was generally low (R² ≈ 0), indicating that the features in the dataset do not strongly explain smoking prevalence.
Important behavioral, social, or environmental variables that influence smoking may be missing from the dataset.
Classification performance was stronger, but still constrained by the limited variability represented in the available features.

Future Improvements

Incorporate richer psychological, environmental, or temporal factors to improve predictive capacity.
Experiment with alternative binning strategies for the classification target (e.g., quartiles instead of median-split).
Test additional model families such as XGBoost, LightGBM, or neural networks for both regression and classification tasks.
Collect or generate more realistic real-world data with higher signal-to-noise ratio.
Explore feature selection or dimensionality reduction techniques to identify stronger predictive patterns.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support