viral_predictor / README.md
MatanKriel's picture
Update README.md
c025bcd verified
# ๐Ÿš€ Predicting Viral News: A Data Science Pipeline
**Author:** Matan Kriel
## ๐Ÿ“Œ Project Overview
Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.
The goal was to transform raw data into actionable insights by:
1. **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
2. **Regression Analysis:** Attempting to predict the exact share count.
3. **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".
---
## ๐Ÿ› ๏ธ The Dataset
* **Source:** UCI Machine Learning Repository (Online News Popularity).
* **Size:** ~39,000 articles.
* **Features:** 61 columns (Content, Sentiment, Time, Keywords).
* **Target:** `shares` (Number of social media shares).
---
## ๐Ÿงน Phase 1: Data Handling & EDA
We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set.
### Correlation Analysis
We analyzed the relationship between content features (images, links, sentiment) and the target variable.
![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/SXWeSP_FB4xkGCfGMp64F.png)
*Above: Correlation Heatmap showing feature relationships.*
**๐Ÿ“‰ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.
---
## ๐Ÿ“Š Phase 2: Regression Model Strategy
To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.
**The 3 Models Compared:**
1. **Linear Regression (Baseline):** A simple linear model to establish the minimum performance benchmark.
2. **Random Forest Regressor:** Selected for its ability to handle non-linear relationships and interactions between features (e.g., *Sentiment* vs. *Subjectivity*).
3. **Gradient Boosting Regressor:** Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.
![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/U1FQ43jtqphLJnGa0cwWF.png)
*Results of this comparison are detailed in Phase 4.*
---
## ๐Ÿงช Phase 3: Feature Engineering (Clustering)
To capture the subtle "tone" of an articleโ€”which raw numbers often missโ€”we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
1. **Sentiment:** (Positive vs. Negative)
2. **Subjectivity:** (Opinion vs. Fact)
### Choosing the Optimal 'k'
We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.
![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/DaPKkYm1rCC6ShOKSQJmw.png)
*Above: Elbow Method (Left) and Silhouette Score (Right).*
**Decisions & Logic:**
* **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
* **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
* **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.
---
## ๐Ÿ“‰ Phase 4: Regression Results
With our features engineered, we evaluated the three models defined in Phase 2 using **RMSE** (Root Mean Squared Error) and **R2 Score**.
**๐Ÿ† Result:**
* All models struggled to predict the exact number (Low R2 scores across the board).
* **Gradient Boosting** performed best, minimizing the error more than the Linear Baseline.
* **Pivot Decision:** We concluded that predicting the *exact* share count is inherently noisy due to massive viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*
![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/s3ORI1PVyYMymPVc6t8oN.png)
---
## ๐Ÿš€ Phase 5: Classification Analysis (The Solution)
**Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
**Threshold:** Median split (>1400 shares).
We repeated the comparison process, pitting **Logistic Regression** (Baseline) against **Random Forest** and **Gradient Boosting**.
### Model Showdown
![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/89myhhhLgf_LA4oDIqto0.png)
*Above: ROC Curves comparing the 3 models.*
**๐Ÿ† Result:**
* **Gradient Boosting** was the clear winner with an **AUC of ~0.75**.
* It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.
### What Drives Virality? (Interpretation)
We analyzed which features the model found most important.
![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/nI0qJkf_gsC7ckR-vJAs8.png)
**๐Ÿ’ก Key Insights:**
* **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
* **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used.
* **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.
---
## โš–๏ธ Phase 6: Final Evaluation
We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).
* **Final AUC:** ~0.75
* **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.
![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/0PpsAHMoGwCVUJ8TY2rBQ.png)
---
## ๐ŸŽฎ Bonus: The Viral-O-Meter
To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.
---
## ๐Ÿ“‚ Files in this Repo
* `notebook.ipynb`: The complete Python code for the pipeline.
* `gradient_boosting_viral_predictor.joblib`: The saved final model.
* `README.md`: Project documentation.
Video Link: https://youtu.be/Al665qltkDg
---
license: mit
---