MatanKriel commited on
Commit
cd29087
ยท
verified ยท
1 Parent(s): 963e56b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -24
README.md CHANGED
@@ -1,9 +1,9 @@
1
- """# ๐Ÿš€ Predicting Viral News: A Data Science Pipeline
2
- **Author:** [Your Name]
3
- **Course:** [Course Name/Number]
4
 
5
  ## ๐Ÿ“Œ Project Overview
6
- In the fast-paced world of digital media, predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.
7
 
8
  The goal was to transform raw data into actionable insights by:
9
  1. **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
@@ -26,54 +26,73 @@ We began by cleaning the dataset (stripping whitespace from columns, removing du
26
  ### Correlation Analysis
27
  We analyzed the relationship between content features (images, links, sentiment) and the target variable.
28
 
29
- ![Correlation Heatmap](image_620b38.png)
 
 
30
  *Above: Correlation Heatmap showing feature relationships.*
31
 
32
  **๐Ÿ“‰ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.
33
 
34
  ---
35
 
36
- ## ๐Ÿงช Phase 2: Feature Engineering (Clustering)
37
- To capture the subtle "tone" of an article, we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  1. **Sentiment:** (Positive vs. Negative)
39
  2. **Subjectivity:** (Opinion vs. Fact)
40
 
41
  ### Choosing the Optimal 'k'
42
  We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.
43
 
44
- ![Elbow Method](image_612983.png)
 
 
45
  *Above: Elbow Method (Left) and Silhouette Score (Right).*
46
 
47
  **Decisions & Logic:**
48
  * **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
49
  * **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
50
- * **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature.
51
 
52
  ---
53
 
54
- ## ๐Ÿ“‰ Phase 3: Regression Analysis
55
- **Goal:** Predict the exact number of shares (`log_shares`).
56
-
57
- We trained three models to compare performance:
58
- 1. **Linear Regression** (Baseline)
59
- 2. **Random Forest Regressor**
60
- 3. **Gradient Boosting Regressor**
61
 
62
  **๐Ÿ† Result:**
63
- * All models struggled to predict the exact number (Low R2).
64
- * **Gradient Boosting** performed best, but the error margin was still too high for business use.
65
- * **Pivot:** We concluded that predicting the *exact* share count is noisy due to viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*
 
 
 
66
 
67
  ---
68
 
69
- ## ๐Ÿš€ Phase 4: Classification Analysis (The Solution)
70
  **Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
71
  **Threshold:** Median split (>1400 shares).
72
 
73
- We compared Logistic Regression (Baseline) against Tree-based models.
74
 
75
  ### Model Showdown
76
- ![ROC Curve Comparison](image_5ed85f.png)
 
 
77
  *Above: ROC Curves comparing the 3 models.*
78
 
79
  **๐Ÿ† Result:**
@@ -93,12 +112,14 @@ We analyzed which features the model found most important.
93
 
94
  ---
95
 
96
- ## โš–๏ธ Phase 5: Final Evaluation
97
  We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).
98
 
99
  * **Final AUC:** ~0.75
100
  * **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.
101
 
 
 
102
  ---
103
 
104
  ## ๐ŸŽฎ Bonus: The Viral-O-Meter
@@ -110,7 +131,7 @@ To demonstrate the model's utility, we built an interactive **Gradio Dashboard**
110
  * `notebook.ipynb`: The complete Python code for the pipeline.
111
  * `gradient_boosting_viral_predictor.joblib`: The saved final model.
112
  * `README.md`: Project documentation.
113
- """
114
  ---
115
  license: mit
116
  ---
 
1
+ # ๐Ÿš€ Predicting Viral News: A Data Science Pipeline
2
+ **Author:** Matan Kriel
3
+
4
 
5
  ## ๐Ÿ“Œ Project Overview
6
+ Predicting which articles will go "viral" is the holy grail for publishers. This project builds a complete end-to-end Data Science pipeline to analyze the **Online News Popularity** dataset.
7
 
8
  The goal was to transform raw data into actionable insights by:
9
  1. **Engineering Features:** Creating a custom "Article Vibe" feature using Clustering.
 
26
  ### Correlation Analysis
27
  We analyzed the relationship between content features (images, links, sentiment) and the target variable.
28
 
29
+
30
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/SXWeSP_FB4xkGCfGMp64F.png)
31
+
32
  *Above: Correlation Heatmap showing feature relationships.*
33
 
34
  **๐Ÿ“‰ Insight:** As seen in the heatmap, the linear correlation between individual features (like `n_tokens_content` or `num_imgs`) and `shares` is extremely low (max ~0.06). This suggests that virality is **non-linear** and complex, justifying the need for advanced tree-based models over simple linear regression.
35
 
36
  ---
37
 
38
+ ## ๐Ÿ“Š Phase 2: Regression Model Strategy
39
+ To tackle the difficult task of predicting exact share counts, we designed a rigorous comparison of three distinct regression algorithms. This allowed us to establish a baseline before attempting complex feature engineering.
40
+
41
+ **The 3 Models Compared:**
42
+ 1. **Linear Regression (Baseline):** A simple linear model to establish the minimum performance benchmark.
43
+ 2. **Random Forest Regressor:** Selected for its ability to handle non-linear relationships and interactions between features (e.g., *Sentiment* vs. *Subjectivity*).
44
+ 3. **Gradient Boosting Regressor:** Selected as the "Challenger" model, known for high precision in Kaggle-style competitions.
45
+
46
+
47
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/U1FQ43jtqphLJnGa0cwWF.png)
48
+
49
+ *Results of this comparison are detailed in Phase 4.*
50
+
51
+ ---
52
+
53
+ ## ๐Ÿงช Phase 3: Feature Engineering (Clustering)
54
+ To capture the subtle "tone" of an articleโ€”which raw numbers often missโ€”we engineered a new feature called `cluster_vibe`. We used **K-Means Clustering** to group articles based on two dimensions:
55
  1. **Sentiment:** (Positive vs. Negative)
56
  2. **Subjectivity:** (Opinion vs. Fact)
57
 
58
  ### Choosing the Optimal 'k'
59
  We used the Elbow Method and Silhouette Analysis to determine the best number of clusters.
60
 
61
+
62
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/DaPKkYm1rCC6ShOKSQJmw.png)
63
+
64
  *Above: Elbow Method (Left) and Silhouette Score (Right).*
65
 
66
  **Decisions & Logic:**
67
  * **The Elbow:** We observed a distinct "bend" in the WCSS curve at **k=4**.
68
  * **Silhouette Score:** While k=2 had a higher score, it was too broad (just Pos/Neg). k=4 maintained a strong score (~0.32) while providing necessary granularity (e.g., *Positive-Opinion, Neutral-Fact, etc.*).
69
+ * **Action:** We assigned every article to one of these 4 clusters and added it as a categorical feature to improve our models.
70
 
71
  ---
72
 
73
+ ## ๐Ÿ“‰ Phase 4: Regression Results
74
+ With our features engineered, we evaluated the three models defined in Phase 2 using **RMSE** (Root Mean Squared Error) and **R2 Score**.
 
 
 
 
 
75
 
76
  **๐Ÿ† Result:**
77
+ * All models struggled to predict the exact number (Low R2 scores across the board).
78
+ * **Gradient Boosting** performed best, minimizing the error more than the Linear Baseline.
79
+ * **Pivot Decision:** We concluded that predicting the *exact* share count is inherently noisy due to massive viral outliers. We decided to pivot to **Classification** to solve a more actionable business problem: *"Will this be popular or not?"*
80
+
81
+
82
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/s3ORI1PVyYMymPVc6t8oN.png)
83
 
84
  ---
85
 
86
+ ## ๐Ÿš€ Phase 5: Classification Analysis (The Solution)
87
  **Goal:** Classify articles as **Viral** (1) or **Not Viral** (0).
88
  **Threshold:** Median split (>1400 shares).
89
 
90
+ We repeated the comparison process, pitting **Logistic Regression** (Baseline) against **Random Forest** and **Gradient Boosting**.
91
 
92
  ### Model Showdown
93
+
94
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/89myhhhLgf_LA4oDIqto0.png)
95
+
96
  *Above: ROC Curves comparing the 3 models.*
97
 
98
  **๐Ÿ† Result:**
 
112
 
113
  ---
114
 
115
+ ## โš–๏ธ Phase 6: Final Evaluation
116
  We ran the winning **Gradient Boosting Classifier** on the **Test Set** (the "Future" data held out from the start).
117
 
118
  * **Final AUC:** ~0.75
119
  * **Conclusion:** The model is robust and generalizes well to unseen data. It is ready for deployment.
120
 
121
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67dfcd96d01eab4618a66f78/0PpsAHMoGwCVUJ8TY2rBQ.png)
122
+
123
  ---
124
 
125
  ## ๐ŸŽฎ Bonus: The Viral-O-Meter
 
131
  * `notebook.ipynb`: The complete Python code for the pipeline.
132
  * `gradient_boosting_viral_predictor.joblib`: The saved final model.
133
  * `README.md`: Project documentation.
134
+
135
  ---
136
  license: mit
137
  ---