viral_predictor / README.md

Update README.md

c025bcd verified 2 months ago

6.95 kB

	# 🚀 Predicting Viral News: A Data Science Pipeline
	Author: Matan Kriel


	## 📌 Project Overview
	Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the Online News Popularity dataset.

	The goal was to transform raw data into actionable insights by:
	1. Engineering Features: Creating a custom "Article Vibe" feature using Clustering.
	2. Regression Analysis: Attempting to predict the exact share count.
	3. Classification Analysis: Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".

	---

	## 🛠️ The Dataset
	* Source: UCI Machine Learning Repository (Online News Popularity).
	* Size: ~39,000 articles.
	* Features: 61 columns (Content, Sentiment, Time, Keywords).
	* Target: `shares` (Number of social media shares).

	---

	## 🧹 Phase 1: Data Handling & EDA
	We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a Time-Based Split. Since the goal is to predict future performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set.

	### Correlation Analysis
	We analyzed the relationship between content features (images, links, sentiment) and the target variable.


	![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/SXWeSP_FB4xkGCfGMp64F.png)

	Above: Correlation Heatmap showing feature relationships.

	📉 Insight: As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is non-linear and complex, justifying the need for advanced tree-based models over simple linear regression.

	---

	## 📊 Phase 2: Regression Model Strategy
	To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.

	The 3 Models Compared:
	1. Linear Regression (Baseline): A simple linear model to establish the minimum performance benchmark.
	2. Random Forest Regressor: Selected for its ability to handle non-linear relationships and interactions between features (e.g., Sentiment vs. Subjectivity).
	3. Gradient Boosting Regressor: Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.


	![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/U1FQ43jtqphLJnGa0cwWF.png)

	Results of this comparison are detailed in Phase 4.

	---

	## 🧪 Phase 3: Feature Engineering (Clustering)
	To capture the subtle "tone" of an article—which raw numbers often miss—we engineered a new feature called `cluster_vibe`. We used K-Means Clustering to group articles based on two dimensions:
	1. Sentiment: (Positive vs. Negative)
	2. Subjectivity: (Opinion vs. Fact)

	### Choosing the Optimal 'k'
	We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.


	![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/DaPKkYm1rCC6ShOKSQJmw.png)

	Above: Elbow Method (Left) and Silhouette Score (Right).

	Decisions & Logic:
	* The Elbow: We observed a distinct "bend" in the WCSS curve at k=4.
	* Silhouette Score: While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., Positive-Opinion, Neutral-Fact, etc.).
	* Action: We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.

	---

	## 📉 Phase 4: Regression Results
	With our features engineered, we evaluated the three models defined in Phase 2 using RMSE (Root Mean Squared Error) and R2 Score.

	🏆 Result:
	* All models struggled to predict the exact number (Low R2 scores across the board).
	* Gradient Boosting performed best, minimizing the error more than the Linear Baseline.
	* Pivot Decision: We concluded that predicting the exact share count is inherently noisy due to massive viral outliers. We decided to pivot to Classification to solve a more actionable business problem: "Will this be popular or not?"


	![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/s3ORI1PVyYMymPVc6t8oN.png)

	---

	## 🚀 Phase 5: Classification Analysis (The Solution)
	Goal: Classify articles as Viral (1) or Not Viral (0).
	Threshold: Median split (>1400 shares).

	We repeated the comparison process, pitting Logistic Regression (Baseline) against Random Forest and Gradient Boosting.

	### Model Showdown

	![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/89myhhhLgf_LA4oDIqto0.png)

	Above: ROC Curves comparing the 3 models.

	🏆 Result:
	* Gradient Boosting was the clear winner with an AUC of ~0.75.
	* It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.

	### What Drives Virality? (Interpretation)
	We analyzed which features the model found most important.


	![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/nI0qJkf_gsC7ckR-vJAs8.png)


	💡 Key Insights:
	* #1 Predictor (kw_avg_avg): The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
	* Content vs. Context: Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used.
	* Cluster Vibe: While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.

	---

	## ⚖️ Phase 6: Final Evaluation
	We ran the winning Gradient Boosting Classifier on the Test Set (the "Future" data held out from the start).

	* Final AUC: ~0.75
	* Conclusion: The model is robust and generalizes well to unseen data. It is ready for deployment.

	![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/0PpsAHMoGwCVUJ8TY2rBQ.png)

	---

	## 🎮 Bonus: The Viral-O-Meter
	To demonstrate the model's utility, we built an interactive Gradio Dashboard embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.

	---

	## 📂 Files in this Repo
	* `notebook.ipynb`: The complete Python code for the pipeline.
	* `gradient_boosting_viral_predictor.joblib`: The saved final model.
	* `README.md`: Project documentation.


	Video Link: https://youtu.be/Al665qltkDg

	---
	license: mit
	---