File size: 6,951 Bytes

# 🚀 Predicting Viral News: A Data Science Pipeline
**Author:** Matan Kriel


## 📌 Project Overview
Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.

The goal was to transform raw data into actionable insights by:
1.  **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
2.  **Regression Analysis:** Attempting to predict the exact share count.
3.  **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".

---

## 🛠️ The Dataset
* **Source:** UCI Machine Learning Repository (Online News Popularity).
* **Size:** ~39,000 articles.
* **Features:** 61 columns (Content, Sentiment, Time, Keywords).
* **Target:** `shares` (Number of social media shares).

---

## 🧹 Phase 1: Data Handling & EDA
We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set.

### Correlation Analysis
We analyzed the relationship between content features (images, links, sentiment) and the target variable.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/SXWeSP_FB4xkGCfGMp64F.png)

*Above: Correlation Heatmap showing feature relationships.*

**📉 Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.

---

## 📊 Phase 2: Regression Model Strategy
To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.

**The 3 Models Compared:**
1.  **Linear Regression (Baseline):** A simple linear model to establish the minimum performance benchmark.
2.  **Random Forest Regressor:** Selected for its ability to handle non-linear relationships and interactions between features (e.g., *Sentiment* vs. *Subjectivity*).
3.  **Gradient Boosting Regressor:** Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/U1FQ43jtqphLJnGa0cwWF.png)

*Results of this comparison are detailed in Phase 4.*

---

## 🧪 Phase 3: Feature Engineering (Clustering)
To capture the subtle "tone" of an article—which raw numbers often miss—we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
1.  **Sentiment:** (Positive vs. Negative)
2.  **Subjectivity:** (Opinion vs. Fact)

### Choosing the Optimal 'k'
We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/DaPKkYm1rCC6ShOKSQJmw.png)

*Above: Elbow Method (Left) and Silhouette Score (Right).*

**Decisions & Logic:**
* **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
* **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
* **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.

---

## 📉 Phase 4: Regression Results
With our features engineered, we evaluated the three models defined in Phase 2 using **RMSE** (Root Mean Squared Error) and **R2 Score**.

**🏆 Result:**
* All models struggled to predict the exact number (Low R2 scores across the board).
* **Gradient Boosting** performed best, minimizing the error more than the Linear Baseline.
* **Pivot Decision:** We concluded that predicting the *exact* share count is inherently noisy due to massive viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/s3ORI1PVyYMymPVc6t8oN.png)

---

## 🚀 Phase 5: Classification Analysis (The Solution)
**Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
**Threshold:** Median split (>1400 shares).

We repeated the comparison process, pitting **Logistic Regression** (Baseline) against **Random Forest** and **Gradient Boosting**.

### Model Showdown

![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/89myhhhLgf_LA4oDIqto0.png)

*Above: ROC Curves comparing the 3 models.*

**🏆 Result:**
* **Gradient Boosting** was the clear winner with an **AUC of ~0.75**.
* It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.

### What Drives Virality? (Interpretation)
We analyzed which features the model found most important.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/nI0qJkf_gsC7ckR-vJAs8.png)


**💡 Key Insights:**
* **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
* **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used.
* **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.

---

## ⚖️ Phase 6: Final Evaluation
We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).

* **Final AUC:** ~0.75
* **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.

![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/0PpsAHMoGwCVUJ8TY2rBQ.png)

---

## 🎮 Bonus: The Viral-O-Meter
To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.

---

## 📂 Files in this Repo
* `notebook.ipynb`: The complete Python code for the pipeline.
* `gradient_boosting_viral_predictor.joblib`: The saved final model.
* `README.md`: Project documentation.
  

Video Link: https://youtu.be/Al665qltkDg

---
license: mit
---