MatanKriel
/

viral_predictor

Model card Files Files and versions

xet

Community

MatanKriel commited on Dec 6, 2025

Commit

963e56b

verified ·

1 Parent(s): 2b0eb06

Update README.md

Browse files

Files changed (1) hide show

README.md +112 -1

README.md CHANGED Viewed

@@ -1,5 +1,116 @@
-🚀 Predicting Viral News: A Data Science PipelineAuthor: [Your Name]Course: [Course Name/Number]📌 Project OverviewIn the fast-paced world of digital media, predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the Online News Popularity dataset.The goal was to transform raw data into actionable insights by:Engineering Features: Creating a custom "Article Vibe" feature using Clustering.Regression Analysis: Attempting to predict the exact share count.Classification Analysis: Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".🛠️ The DatasetSource: UCI Machine Learning Repository (Online News Popularity).Size: ~39,000 articles.Features: 61 columns (Content, Sentiment, Time, Keywords).Target: shares (Number of social media shares).Phase 1: Data Handling & EDAWe began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a Time-Based Split. Since the goal is to predict future performance, we sorted data by date (timedelta) to prevent "data leakage" from the future into the training set.Correlation AnalysisWe analyzed the relationship between content features (images, links, sentiment) and the target variable.Above: Correlation Heatmap showing feature relationships.📉 Insight: As seen in the heatmap, the linear correlation between individual features (like n_tokens_content or num_imgs) and shares is extremely low (max ~0.06). This suggests that virality is non-linear and complex, justifying the need for advanced tree-based models over simple linear regression.Phase 2: Feature Engineering (Clustering)To capture the subtle "tone" of an article, we engineered a new feature called cluster_vibe. We used K-Means Clustering to group articles based on two dimensions:Sentiment: (Positive vs. Negative)Subjectivity: (Opinion vs. Fact)Choosing the Optimal 'k'We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.Above: Elbow Method (Left) and Silhouette Score (Right).Decisions & Logic:The Elbow: We observed a distinct "bend" in the WCSS curve at $k=4$.Silhouette Score: While $k=2$ had a higher score, it was too broad (just Pos/Neg). $k=4$ maintained a strong score (~0.32) while providing necessary granularity (e.g., Positive-Opinion, Neutral-Fact, etc.).Action: We assigned every article to one of these 4 clusters and added it as a categorical feature.Phase 3: Regression AnalysisGoal: Predict the exact number of shares (log_shares).We trained three models to compare performance:Linear Regression (Baseline)Random Forest RegressorGradient Boosting Regressor🏆 Result:All models struggled to predict the exact number (Low $R^2$).Gradient Boosting performed best, but the error margin was still too high for business use.Pivot: We concluded that predicting the exact share count is noisy due to viral outliers. We decided to pivot to Classification to solve a more actionable business problem: "Will this be popular or not?"Phase 4: Classification Analysis (The Solution)Goal: Classify articles as Viral (1) or Not Viral (0).Threshold: Median split (>1400 shares).We compared Logistic Regression (Baseline) against Tree-based models.Model ShowdownAbove: ROC Curves comparing the 3 models.🚀 Result:Gradient Boosting was the clear winner with an AUC of ~0.75.It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.What Drives Virality? (Interpretation)We analyzed which features the model found most important.Above: ROC Curve (Left) and Feature Importance Plot (Right).💡 Key Insights:#1 Predictor (kw_avg_avg): The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.Content vs. Context: Structural features (like is_weekend or num_imgs) mattered less than the specific keywords used.Cluster Vibe: While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.Phase 5: Final EvaluationWe ran the winning Gradient Boosting Classifier on the Test Set (the "Future" data held out from the start).Final AUC: ~0.75Conclusion: The model is robust and generalizes well to unseen data. It is ready for deployment.🎮 Bonus: The Viral-O-MeterTo demonstrate the model's utility, we built an interactive Gradio Dashboard embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.📂 Files in this Reponotebook.ipynb: The complete Python code for the pipeline.gradient_boosting_viral_predictor.joblib: The saved final model.README.md: Project documentation.
 ---
 license: mit
 ---

+"""# 🚀 Predicting Viral News: A Data Science Pipeline
+**Author:** [Your Name]
+**Course:** [Course Name/Number]
+## 📌 Project Overview
+In the fast-paced world of digital media, predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.
+The goal was to transform raw data into actionable insights by:
+1.  **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
+2.  **Regression Analysis:** Attempting to predict the exact share count.
+3.  **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop".
+---
+## 🛠️ The Dataset
+* **Source:** UCI Machine Learning Repository (Online News Popularity).
+* **Size:** ~39,000 articles.
+* **Features:** 61 columns (Content, Sentiment, Time, Keywords).
+* **Target:** `shares` (Number of social media shares).
+---
+## 🧹 Phase 1: Data Handling & EDA
+We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set.
+### Correlation Analysis
+We analyzed the relationship between content features (images, links, sentiment) and the target variable.
+![Correlation Heatmap](image_620b38.png)
+*Above: Correlation Heatmap showing feature relationships.*
+**📉 Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.
+---
+## 🧪 Phase 2: Feature Engineering (Clustering)
+To capture the subtle "tone" of an article, we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
+1.  **Sentiment:** (Positive vs. Negative)
+2.  **Subjectivity:** (Opinion vs. Fact)
+### Choosing the Optimal 'k'
+We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.
+![Elbow Method](image_612983.png)
+*Above: Elbow Method (Left) and Silhouette Score (Right).*
+**Decisions & Logic:**
+* **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
+* **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
+* **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature.
+---
+## 📉 Phase 3: Regression Analysis
+**Goal:** Predict the exact number of shares (`log_shares`).
+We trained three models to compare performance:
+1.  **Linear Regression** (Baseline)
+2.  **Random Forest Regressor**
+3.  **Gradient Boosting Regressor**
+**🏆 Result:**
+* All models struggled to predict the exact number (Low R2).
+* **Gradient Boosting** performed best, but the error margin was still too high for business use.
+* **Pivot:** We concluded that predicting the *exact* share count is noisy due to viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*
+---
+## 🚀 Phase 4: Classification Analysis (The Solution)
+**Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
+**Threshold:** Median split (>1400 shares).
+We compared Logistic Regression (Baseline) against Tree-based models.
+### Model Showdown
+![ROC Curve Comparison](image_5ed85f.png)
+*Above: ROC Curves comparing the 3 models.*
+**🏆 Result:**
+* **Gradient Boosting** was the clear winner with an **AUC of ~0.75**.
+* It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns.
+### What Drives Virality? (Interpretation)
+We analyzed which features the model found most important.
+![Feature Importance](image_60a2b4.png)
+*Above: ROC Curve (Left) and Feature Importance Plot (Right).*
+**💡 Key Insights:**
+* **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest.
+* **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used.
+* **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees.
+---
+## ⚖️ Phase 5: Final Evaluation
+We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).
+* **Final AUC:** ~0.75
+* **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.
+---
+## 🎮 Bonus: The Viral-O-Meter
+To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral.
+---
+## 📂 Files in this Repo
+* `notebook.ipynb`: The complete Python code for the pipeline.
+* `gradient_boosting_viral_predictor.joblib`: The saved final model.
+* `README.md`: Project documentation.
+"""
 ---
 license: mit
 ---