Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,116 @@
|
|
| 1 |
-
๐ Predicting Viral News: A Data Science
|
|
|
|
|
|
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
license: mit
|
| 5 |
---
|
|
|
|
| 1 |
+
"""# ๐ Predicting Viral News: A Data Science Pipeline
|
| 2 |
+
**Author:** [Your Name]
|
| 3 |
+
**Course:** [Course Name/Number]
|
| 4 |
|
| 5 |
+
## ๐ Project Overview
|
| 6 |
+
In the fast-paced world of digital media, predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.
|
| 7 |
+
|
| 8 |
+
The goal was to transform raw data into actionable insights by:
|
| 9 |
+
1. **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
|
| 10 |
+
2. **Regression Analysis:** Attempting to predict the exact share count.
|
| 11 |
+
3. **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## ๐ ๏ธ The Dataset
|
| 16 |
+
* **Source:** UCI Machine Learning Repository (Online News Popularity).
|
| 17 |
+
* **Size:** ~39,000 articles.
|
| 18 |
+
* **Features:** 61 columns (Content, Sentiment, Time, Keywords).
|
| 19 |
+
* **Target:** `shares` (Number of social media shares).
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## ๐งน Phase 1: Data Handling & EDA
|
| 24 |
+
We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set.
|
| 25 |
+
|
| 26 |
+
### Correlation Analysis
|
| 27 |
+
We analyzed the relationship between content features (images, links, sentiment) and the target variable.
|
| 28 |
+
|
| 29 |
+

|
| 30 |
+
*Above: Correlation Heatmap showing feature relationships.*
|
| 31 |
+
|
| 32 |
+
**๐ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## ๐งช Phase 2: Feature Engineering (Clustering)
|
| 37 |
+
To capture the subtle "tone" of an article, we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
|
| 38 |
+
1. **Sentiment:** (Positive vs. Negative)
|
| 39 |
+
2. **Subjectivity:** (Opinion vs. Fact)
|
| 40 |
+
|
| 41 |
+
### Choosing the Optimal 'k'
|
| 42 |
+
We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.
|
| 43 |
+
|
| 44 |
+

|
| 45 |
+
*Above: Elbow Method (Left) and Silhouette Score (Right).*
|
| 46 |
+
|
| 47 |
+
**Decisions & Logic:**
|
| 48 |
+
* **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
|
| 49 |
+
* **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
|
| 50 |
+
* **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## ๐ Phase 3: Regression Analysis
|
| 55 |
+
**Goal:** Predict the exact number of shares (`log_shares`).
|
| 56 |
+
|
| 57 |
+
We trained three models to compare performance:
|
| 58 |
+
1. **Linear Regression** (Baseline)
|
| 59 |
+
2. **Random Forest Regressor**
|
| 60 |
+
3. **Gradient Boosting Regressor**
|
| 61 |
+
|
| 62 |
+
**๐ Result:**
|
| 63 |
+
* All models struggled to predict the exact number (Low R2).
|
| 64 |
+
* **Gradient Boosting** performed best, but the error margin was still too high for business use.
|
| 65 |
+
* **Pivot:** We concluded that predicting the *exact* share count is noisy due to viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## ๐ Phase 4: Classification Analysis (The Solution)
|
| 70 |
+
**Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
|
| 71 |
+
**Threshold:** Median split (>1400 shares).
|
| 72 |
+
|
| 73 |
+
We compared Logistic Regression (Baseline) against Tree-based models.
|
| 74 |
+
|
| 75 |
+
### Model Showdown
|
| 76 |
+

|
| 77 |
+
*Above: ROC Curves comparing the 3 models.*
|
| 78 |
+
|
| 79 |
+
**๐ Result:**
|
| 80 |
+
* **Gradient Boosting** was the clear winner with an **AUC of ~0.75**.
|
| 81 |
+
* It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.
|
| 82 |
+
|
| 83 |
+
### What Drives Virality? (Interpretation)
|
| 84 |
+
We analyzed which features the model found most important.
|
| 85 |
+
|
| 86 |
+

|
| 87 |
+
*Above: ROC Curve (Left) and Feature Importance Plot (Right).*
|
| 88 |
+
|
| 89 |
+
**๐ก Key Insights:**
|
| 90 |
+
* **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
|
| 91 |
+
* **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used.
|
| 92 |
+
* **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.
|
| 93 |
+
|
| 94 |
+
---
|
| 95 |
+
|
| 96 |
+
## โ๏ธ Phase 5: Final Evaluation
|
| 97 |
+
We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).
|
| 98 |
+
|
| 99 |
+
* **Final AUC:** ~0.75
|
| 100 |
+
* **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## ๐ฎ Bonus: The Viral-O-Meter
|
| 105 |
+
To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## ๐ Files in this Repo
|
| 110 |
+
* `notebook.ipynb`: The complete Python code for the pipeline.
|
| 111 |
+
* `gradient_boosting_viral_predictor.joblib`: The saved final model.
|
| 112 |
+
* `README.md`: Project documentation.
|
| 113 |
+
"""
|
| 114 |
---
|
| 115 |
license: mit
|
| 116 |
---
|