| # ๐ Predicting Viral News: A Data Science Pipeline | |
| **Author:** Matan Kriel | |
| ## ๐ Project Overview | |
| Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset. | |
| The goal was to transform raw data into actionable insights by: | |
| 1. **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering. | |
| 2. **Regression Analysis:** Attempting to predict the exact share count. | |
| 3. **Classification Analysis:** Successfully predicting if an article will be a "Hit" (>1400 shares) or a "Flop". | |
| --- | |
| ## ๐ ๏ธ The Dataset | |
| * **Source:** UCI Machine Learning Repository (Online News Popularity). | |
| * **Size:** ~39,000 articles. | |
| * **Features:** 61 columns (Content, Sentiment, Time, Keywords). | |
| * **Target:** `shares` (Number of social media shares). | |
| --- | |
| ## ๐งน Phase 1: Data Handling & EDA | |
| We began by cleaning the dataset (stripping whitespace from columns, removing duplicates) and performing a **Time-Based Split**. Since the goal is to predict *future* performance, we sorted data by date (`timedelta`) to prevent "data leakage" from the future into the training set. | |
| ### Correlation Analysis | |
| We analyzed the relationship between content features (images, links, sentiment) and the target variable. | |
|  | |
| *Above: Correlation Heatmap showing feature relationships.* | |
| **๐ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression. | |
| --- | |
| ## ๐ Phase 2: Regression Model Strategy | |
| To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering. | |
| **The 3 Models Compared:** | |
| 1. **Linear Regression (Baseline):** A simple linear model to establish the minimum performance benchmark. | |
| 2. **Random Forest Regressor:** Selected for its ability to handle non-linear relationships and interactions between features (e.g., *Sentiment* vs. *Subjectivity*). | |
| 3. **Gradient Boosting Regressor:** Selected as the "Challenger" model, known for high precision in Kaggle-style competitions. | |
|  | |
| *Results of this comparison are detailed in Phase 4.* | |
| --- | |
| ## ๐งช Phase 3: Feature Engineering (Clustering) | |
| To capture the subtle "tone" of an articleโwhich raw numbers often missโwe engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions: | |
| 1. **Sentiment:** (Positive vs. Negative) | |
| 2. **Subjectivity:** (Opinion vs. Fact) | |
| ### Choosing the Optimal 'k' | |
| We used the Elbow Method and Silhouette Analysis to determine the best number of clusters. | |
|  | |
| *Above: Elbow Method (Left) and Silhouette Score (Right).* | |
| **Decisions & Logic:** | |
| * **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**. | |
| * **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*). | |
| * **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models. | |
| --- | |
| ## ๐ Phase 4: Regression Results | |
| With our features engineered, we evaluated the three models defined in Phase 2 using **RMSE** (Root Mean Squared Error) and **R2 Score**. | |
| **๐ Result:** | |
| * All models struggled to predict the exact number (Low R2 scores across the board). | |
| * **Gradient Boosting** performed best, minimizing the error more than the Linear Baseline. | |
| * **Pivot Decision:** We concluded that predicting the *exact* share count is inherently noisy due to massive viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"* | |
|  | |
| --- | |
| ## ๐ Phase 5: Classification Analysis (The Solution) | |
| **Goal:** Classify articles as **Viral** (1) or **Not Viral** (0). | |
| **Threshold:** Median split (>1400 shares). | |
| We repeated the comparison process, pitting **Logistic Regression** (Baseline) against **Random Forest** and **Gradient Boosting**. | |
| ### Model Showdown | |
|  | |
| *Above: ROC Curves comparing the 3 models.* | |
| **๐ Result:** | |
| * **Gradient Boosting** was the clear winner with an **AUC of ~0.75**. | |
| * It significantly outperformed the Baseline (AUC 0.64), proving the model successfully learned complex non-linear patterns. | |
| ### What Drives Virality? (Interpretation) | |
| We analyzed which features the model found most important. | |
|  | |
| **๐ก Key Insights:** | |
| * **#1 Predictor (kw_avg_avg):** The historical performance of keywords is the strongest predictor. If a topic was popular in the past, it is likely to be popular again. This suggests a "Caching Effect" in audience interest. | |
| * **Content vs. Context:** Structural features (like `is_weekend` or `num_imgs`) mattered less than the specific keywords used. | |
| * **Cluster Vibe:** While our engineered cluster feature helped group articles, historical metrics overpowered it in the final decision trees. | |
| --- | |
| ## โ๏ธ Phase 6: Final Evaluation | |
| We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start). | |
| * **Final AUC:** ~0.75 | |
| * **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment. | |
|  | |
| --- | |
| ## ๐ฎ Bonus: The Viral-O-Meter | |
| To demonstrate the model's utility, we built an interactive **Gradio Dashboard** embedded in the notebook. This allows non-technical stakeholders (e.g., editors) to input article metrics and receive a real-time prediction on whether their draft will go viral. | |
| --- | |
| ## ๐ Files in this Repo | |
| * `notebook.ipynb`: The complete Python code for the pipeline. | |
| * `gradient_boosting_viral_predictor.joblib`: The saved final model. | |
| * `README.md`: Project documentation. | |
| Video Link: https://youtu.be/Al665qltkDg | |
| --- | |
| license: mit | |
| --- | |