File size: 6,951 Bytes
cd29087
 
 
2b0eb06
963e56b
cd29087
963e56b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd29087
 
 
963e56b
 
 
 
 
 
cd29087
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
963e56b
 
 
 
 
 
cd29087
 
 
963e56b
 
 
 
 
cd29087
963e56b
 
 
cd29087
 
963e56b
 
cd29087
 
 
 
 
 
963e56b
 
 
cd29087
963e56b
 
 
cd29087
963e56b
 
cd29087
 
 
963e56b
 
 
 
 
 
 
 
 
cc3be00
 
 
963e56b
 
 
 
 
 
 
 
cd29087
963e56b
 
 
 
 
cd29087
 
963e56b
 
 
 
 
 
 
 
 
 
 
cd29087
c025bcd
 
 
2b0eb06
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# ๐Ÿš€ Predicting Viral News: A Data Science Pipeline
**Author:** Matan Kriel


## ๐Ÿ“Œ Project Overview
Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.

The goal was to transform raw data into actionable insights by:
1.  **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
2.  **Regression Analysis:** Attempting to predict the exact share count.
3.  **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".

---

## ๐Ÿ› ๏ธ The Dataset
* **Source:** UCI Machine Learning Repository (Online News Popularity).
* **Size:** ~39,000 articles.
* **Features:** 61 columns (Content, Sentiment, Time, Keywords).
* **Target:** `shares` (Number of social media shares).

---

## ๐Ÿงน Phase 1: Data Handling & EDA
We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set.

### Correlation Analysis
We analyzed the relationship between content features (images, links, sentiment) and the target variable.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/SXWeSP_FB4xkGCfGMp64F.png)

*Above: Correlation Heatmap showing feature relationships.*

**๐Ÿ“‰ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.

---

## ๐Ÿ“Š Phase 2: Regression Model Strategy
To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.

**The 3 Models Compared:**
1.  **Linear Regression (Baseline):** A simple linear model to establish the minimum performance benchmark.
2.  **Random Forest Regressor:** Selected for its ability to handle non-linear relationships and interactions between features (e.g., *Sentiment* vs. *Subjectivity*).
3.  **Gradient Boosting Regressor:** Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/U1FQ43jtqphLJnGa0cwWF.png)

*Results of this comparison are detailed in Phase 4.*

---

## ๐Ÿงช Phase 3: Feature Engineering (Clustering)
To capture the subtle "tone" of an articleโ€”which raw numbers often missโ€”we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
1.  **Sentiment:** (Positive vs. Negative)
2.  **Subjectivity:** (Opinion vs. Fact)

### Choosing the Optimal 'k'
We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/DaPKkYm1rCC6ShOKSQJmw.png)

*Above: Elbow Method (Left) and Silhouette Score (Right).*

**Decisions & Logic:**
* **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
* **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
* **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.

---

## ๐Ÿ“‰ Phase 4: Regression Results
With our features engineered, we evaluated the three models defined in Phase 2 using **RMSE** (Root Mean Squared Error) and **R2 Score**.

**๐Ÿ† Result:**
* All models struggled to predict the exact number (Low R2 scores across the board).
* **Gradient Boosting** performed best, minimizing the error more than the Linear Baseline.
* **Pivot Decision:** We concluded that predicting the *exact* share count is inherently noisy due to massive viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/s3ORI1PVyYMymPVc6t8oN.png)

---

## ๐Ÿš€ Phase 5: Classification Analysis (The Solution)
**Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
**Threshold:** Median split (>1400 shares).

We repeated the comparison process, pitting **Logistic Regression** (Baseline) against **Random Forest** and **Gradient Boosting**.

### Model Showdown

![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/89myhhhLgf_LA4oDIqto0.png)

*Above: ROC Curves comparing the 3 models.*

**๐Ÿ† Result:**
* **Gradient Boosting** was the clear winner with an **AUC of ~0.75**.
* It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.

### What Drives Virality? (Interpretation)
We analyzed which features the model found most important.


![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/nI0qJkf_gsC7ckR-vJAs8.png)


**๐Ÿ’ก Key Insights:**
* **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
* **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used.
* **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.

---

## โš–๏ธ Phase 6: Final Evaluation
We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).

* **Final AUC:** ~0.75
* **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.

![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/0PpsAHMoGwCVUJ8TY2rBQ.png)

---

## ๐ŸŽฎ Bonus: The Viral-O-Meter
To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.

---

## ๐Ÿ“‚ Files in this Repo
* `notebook.ipynb`: The complete Python code for the pipeline.
* `gradient_boosting_viral_predictor.joblib`: The saved final model.
* `README.md`: Project documentation.
  

Video Link: https://youtu.be/Al665qltkDg

---
license: mit
---