File size: 6,951 Bytes
cd29087 2b0eb06 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cd29087 963e56b cc3be00 963e56b cd29087 963e56b cd29087 963e56b cd29087 c025bcd 2b0eb06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# ๐ Predicting Viral News: A Data Science Pipeline
**Author:** Matan Kriel
## ๐ Project Overview
Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.
The goal was to transform raw data into actionable insights by:
1. **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
2. **Regression Analysis:** Attempting to predict the exact share count.
3. **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".
---
## ๐ ๏ธ The Dataset
* **Source:** UCI Machine Learning Repository (Online News Popularity).
* **Size:** ~39,000 articles.
* **Features:** 61 columns (Content, Sentiment, Time, Keywords).
* **Target:** `shares` (Number of social media shares).
---
## ๐งน Phase 1: Data Handling & EDA
We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set.
### Correlation Analysis
We analyzed the relationship between content features (images, links, sentiment) and the target variable.

*Above: Correlation Heatmap showing feature relationships.*
**๐ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.
---
## ๐ Phase 2: Regression Model Strategy
To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.
**The 3 Models Compared:**
1. **Linear Regression (Baseline):** A simple linear model to establish the minimum performance benchmark.
2. **Random Forest Regressor:** Selected for its ability to handle non-linear relationships and interactions between features (e.g., *Sentiment* vs. *Subjectivity*).
3. **Gradient Boosting Regressor:** Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.

*Results of this comparison are detailed in Phase 4.*
---
## ๐งช Phase 3: Feature Engineering (Clustering)
To capture the subtle "tone" of an articleโwhich raw numbers often missโwe engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
1. **Sentiment:** (Positive vs. Negative)
2. **Subjectivity:** (Opinion vs. Fact)
### Choosing the Optimal 'k'
We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.

*Above: Elbow Method (Left) and Silhouette Score (Right).*
**Decisions & Logic:**
* **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
* **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
* **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.
---
## ๐ Phase 4: Regression Results
With our features engineered, we evaluated the three models defined in Phase 2 using **RMSE** (Root Mean Squared Error) and **R2 Score**.
**๐ Result:**
* All models struggled to predict the exact number (Low R2 scores across the board).
* **Gradient Boosting** performed best, minimizing the error more than the Linear Baseline.
* **Pivot Decision:** We concluded that predicting the *exact* share count is inherently noisy due to massive viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*

---
## ๐ Phase 5: Classification Analysis (The Solution)
**Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
**Threshold:** Median split (>1400 shares).
We repeated the comparison process, pitting **Logistic Regression** (Baseline) against **Random Forest** and **Gradient Boosting**.
### Model Showdown

*Above: ROC Curves comparing the 3 models.*
**๐ Result:**
* **Gradient Boosting** was the clear winner with an **AUC of ~0.75**.
* It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.
### What Drives Virality? (Interpretation)
We analyzed which features the model found most important.

**๐ก Key Insights:**
* **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
* **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used.
* **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.
---
## โ๏ธ Phase 6: Final Evaluation
We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).
* **Final AUC:** ~0.75
* **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.

---
## ๐ฎ Bonus: The Viral-O-Meter
To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.
---
## ๐ Files in this Repo
* `notebook.ipynb`: The complete Python code for the pipeline.
* `gradient_boosting_viral_predictor.joblib`: The saved final model.
* `README.md`: Project documentation.
Video Link: https://youtu.be/Al665qltkDg
---
license: mit
---
|