# ๐Ÿš€ Predicting Viral News: A Data Science Pipeline **Author:** Matan Kriel ## ๐Ÿ“Œ Project Overview Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset. The goal was to transform raw data into actionable insights by: 1. **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering. 2. **Regression Analysis:** Attempting to predict the exact share count. 3. **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop". --- ## ๐Ÿ› ๏ธ The Dataset * **Source:** UCI Machine Learning Repository (Online News Popularity). * **Size:** ~39,000 articles. * **Features:** 61 columns (Content, Sentiment, Time, Keywords). * **Target:** `shares` (Number of social media shares). --- ## ๐Ÿงน Phase 1: Data Handling & EDA We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set. ### Correlation Analysis We analyzed the relationship between content features (images, links, sentiment) and the target variable. ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/SXWeSP_FB4xkGCfGMp64F.png) *Above: Correlation Heatmap showing feature relationships.* **๐Ÿ“‰ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression. --- ## ๐Ÿ“Š Phase 2: Regression Model Strategy To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering. **The 3 Models Compared:** 1. **Linear Regression (Baseline):** A simple linear model to establish the minimum performance benchmark. 2. **Random Forest Regressor:** Selected for its ability to handle non-linear relationships and interactions between features (e.g., *Sentiment* vs. *Subjectivity*). 3. **Gradient Boosting Regressor:** Selected as the "Challenger" model, known for high precision in Kaggle-style competitions. ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/U1FQ43jtqphLJnGa0cwWF.png) *Results of this comparison are detailed in Phase 4.* --- ## ๐Ÿงช Phase 3: Feature Engineering (Clustering) To capture the subtle "tone" of an articleโ€”which raw numbers often missโ€”we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions: 1. **Sentiment:** (Positive vs. Negative) 2. **Subjectivity:** (Opinion vs. Fact) ### Choosing the Optimal 'k' We used the Elbow Method and Silhouette Analysis to determine the best number of clusters. ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/DaPKkYm1rCC6ShOKSQJmw.png) *Above: Elbow Method (Left) and Silhouette Score (Right).* **Decisions & Logic:** * **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**. * **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*). * **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models. --- ## ๐Ÿ“‰ Phase 4: Regression Results With our features engineered, we evaluated the three models defined in Phase 2 using **RMSE** (Root Mean Squared Error) and **R2 Score**. **๐Ÿ† Result:** * All models struggled to predict the exact number (Low R2 scores across the board). * **Gradient Boosting** performed best, minimizing the error more than the Linear Baseline. * **Pivot Decision:** We concluded that predicting the *exact* share count is inherently noisy due to massive viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"* ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/s3ORI1PVyYMymPVc6t8oN.png) --- ## ๐Ÿš€ Phase 5: Classification Analysis (The Solution) **Goal:** Classify articles as **Viral** (1) or **Not Viral** (0). **Threshold:** Median split (>1400 shares). We repeated the comparison process, pitting **Logistic Regression** (Baseline) against **Random Forest** and **Gradient Boosting**. ### Model Showdown ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/89myhhhLgf_LA4oDIqto0.png) *Above: ROC Curves comparing the 3 models.* **๐Ÿ† Result:** * **Gradient Boosting** was the clear winner with an **AUC of ~0.75**. * It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns. ### What Drives Virality? (Interpretation) We analyzed which features the model found most important. ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/nI0qJkf_gsC7ckR-vJAs8.png) **๐Ÿ’ก Key Insights:** * **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest. * **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used. * **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees. --- ## โš–๏ธ Phase 6: Final Evaluation We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start). * **Final AUC:** ~0.75 * **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment. ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/0PpsAHMoGwCVUJ8TY2rBQ.png) --- ## ๐ŸŽฎ Bonus: The Viral-O-Meter To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral. --- ## ๐Ÿ“‚ Files in this Repo * `notebook.ipynb`: The complete Python code for the pipeline. * `gradient_boosting_viral_predictor.joblib`: The saved final model. * `README.md`: Project documentation. Video Link: https://youtu.be/Al665qltkDg --- license: mit ---