odedf2001 commited on
Commit
361e19c
Β·
verified Β·
1 Parent(s): 482c824

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +315 -0
README.md CHANGED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ metrics:
5
+ - mae
6
+ - r_squared
7
+ - accuracy
8
+ - precision
9
+ - recall
10
+ - f1
11
+ pipeline_tag: tabular-classification
12
+ library_name: sklearn
13
+ tags:
14
+ - movies
15
+ - regression
16
+ - classification
17
+ ---
18
+ # 🎬 Movie Revenue Prediction β€” Full ML Pipeline
19
+
20
+ This project builds a complete machine learning workflow using real movie metadata.
21
+ It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models β€” and full performance evaluation.
22
+
23
+ ---
24
+
25
+ ## πŸ§ͺ Part 0 β€” Initial Research Questions (EDA)
26
+
27
+ Before any modeling, I asked a few basic questions about the dataset:
28
+
29
+ 1️⃣ **What is the relationship between budget and revenue?**
30
+ - Hypothesis: Higher budget β†’ higher revenue.
31
+ - Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always.
32
+
33
+ 2️⃣ **Is there a strong relationship between runtime and revenue?**
34
+ - Hypothesis: Longer movies might earn more.
35
+ - Result: No strong pattern. Most successful movies fall in a β€œnormal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue.
36
+
37
+ 3️⃣ **What are the most common original languages in the dataset?**
38
+ - Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).
39
+
40
+ These EDA steps helped build intuition before moving into modeling.
41
+
42
+ ---
43
+
44
+ ## πŸ§ͺ Main ML Research Questions
45
+
46
+ ### **1️⃣ Can we accurately predict a movie’s revenue using metadata alone?**
47
+ We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.
48
+
49
+ ### **2️⃣ Which features have the strongest impact on movie revenue?**
50
+ We explore the importance of:
51
+ - budget
52
+ - vote counts & vote average
53
+ - popularity
54
+ - profit & profit ratio
55
+ - release year & decade
56
+ - cluster-based features (cluster_group, distance_to_centroid)
57
+
58
+ ### **3️⃣ Can we classify movies into β€œhigh revenue” vs. β€œlow revenue” groups effectively?**
59
+ We convert revenue into a balanced binary target and apply classification models.
60
+
61
+ ### **4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?**
62
+ We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.
63
+
64
+ ---
65
+
66
+ # 🧱 Part 1 β€” Dataset & Basic Cleaning (Before Any Regression)
67
+
68
+ ### πŸ”Ή 1. Loading the Data
69
+
70
+ - Dataset: `movies_metadata.csv` (from Kaggle)
71
+ - Target variable: `revenue` (continuous)
72
+
73
+ ### πŸ”Ή 2. Basic Cleaning
74
+
75
+ - Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric.
76
+ - Parsed `release_date` as a datetime.
77
+ - Removed clearly invalid rows, such as:
78
+ - `budget == 0`
79
+ - `revenue == 0`
80
+ - `runtime == 0`
81
+
82
+ This produced a smaller but more reliable dataset.
83
+
84
+ ---
85
+
86
+ # πŸ“Š Part 2 β€” Initial EDA (Before Any Model)
87
+
88
+ Key insights:
89
+
90
+ - **Budget vs Revenue**
91
+ - Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers.
92
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/BOkbMfLzBaHIxgj8nU7MF.png)
93
+
94
+ - **Runtime vs Revenue**
95
+ - No strong linear correlation. Being "very long" or "very short" does not guarantee success.
96
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/NZQWe3X0kUNUXD3coeibM.png)
97
+
98
+ - **Original Language Distribution**
99
+ - English is by far the most common language; most of the dataset is dominated by English-language films.
100
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/KCROsSBSS7zd9iQ2HIzjS.png)
101
+
102
+ These findings motivated the next steps: building a simple baseline model and then adding smarter features.
103
+
104
+ ---
105
+
106
+ # πŸ§ͺ Part 3 β€” Baseline Regression (Before Feature Engineering)
107
+
108
+ ### 🎯 Goal
109
+ Build a **simple baseline model** that predicts movie revenue using only a few basic features:
110
+
111
+ - `budget`
112
+ - `runtime`
113
+ - `vote_average`
114
+ - `vote_count`
115
+
116
+ ### βš™οΈ Model
117
+
118
+ - **Linear Regression** on the 4 basic features.
119
+ - Train/Test split: 80% train / 20% test.
120
+
121
+ ### πŸ“Š Baseline Regression Results
122
+
123
+ Using only the basic features:
124
+
125
+ - **MAE β‰ˆ 45,652,741**
126
+ - **RMSE β‰ˆ 79,524,121**
127
+ - **RΒ² β‰ˆ 0.715**
128
+
129
+ πŸ“Œ **Interpretation:**
130
+ - The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model.
131
+ - However, the errors (tens of millions) show there is still a lot of noise and missing information β€” which is expected in movie revenue prediction.
132
+
133
+ This baseline serves as a reference point before introducing engineered features.
134
+
135
+ ---
136
+
137
+ # 🧱 Part 4 β€” Feature Engineering (Upgrading the Dataset)
138
+
139
+ To improve model performance, several new features were engineered:
140
+
141
+ ### πŸ”Ή New Numeric Features
142
+
143
+ - `profit = revenue - budget`
144
+ - `profit_ratio = profit / budget`
145
+ - `overview_length` = length of the movie overview text
146
+ - `release_year` = year extracted from `release_date`
147
+ - `decade` = grouped release year by decade (e.g., 1980, 1990, 2000)
148
+
149
+ ### πŸ”Ή Categorical Encoding
150
+
151
+ - `adult` converted from `"True"/"False"` to `1/0`.
152
+ - `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap).
153
+
154
+ ### πŸ”Ή Scaling Numerical Features
155
+
156
+ Used `StandardScaler` to standardize numeric columns:
157
+ - `budget`, `runtime`, `vote_average`, `vote_count`,
158
+ `popularity`, `profit`, `profit_ratio`, `overview_length`
159
+
160
+ Each feature was transformed to have:
161
+ - mean β‰ˆ 0
162
+ - standard deviation β‰ˆ 1
163
+
164
+ ---
165
+
166
+ # 🧩 Part 5 β€” Clustering & PCA (Unsupervised Learning)
167
+
168
+ ### πŸ”Ή K-Means Clustering
169
+
170
+ - Features used:
171
+ `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`
172
+ - Algorithm: **K-Means** with `n_clusters=4`.
173
+ - New feature: `cluster_group` β€” each movie assigned to one of 4 clusters.
174
+
175
+ Rough interpretation of clusters:
176
+ - Cluster 0 β€” low-budget, low-revenue films
177
+ - Cluster 1 β€” mid-range films
178
+ - Cluster 2 β€” big-budget / blockbuster-style movies
179
+ - Cluster 3 β€” more unusual / outlier-like cases
180
+
181
+ ### πŸ”Ή PCA for Visualization
182
+
183
+ - Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality.
184
+ - Created `pca1` and `pca2` for each movie.
185
+ - Plotted the movies in 2D using PCA, colored by `cluster_group`.
186
+
187
+ This allowed visual inspection of:
188
+ - Cluster separation
189
+ - Overlaps
190
+ - Global structure in the data
191
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/f7yf-UcFtEc-JSdSqtGKa.png)
192
+
193
+ ### πŸ”Ή Distance to Centroid (Outlier Feature)
194
+
195
+ Computed:
196
+ - `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center.
197
+
198
+ Interpretation:
199
+ - Small distance β†’ movie is β€œtypical” for its cluster.
200
+ - Large distance β†’ movie is an outlier within its cluster.
201
+
202
+ This feature was later used as an additional signal for modeling.
203
+
204
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/aFktxtXzdNarGtb5eDR2h.png)
205
+ ---
206
+
207
+ # 🧱 Part 6 β€” Advanced Regression (With Engineered Features)
208
+
209
+ ### 🎯 Goal
210
+ Use the engineered features + clustering-based features to improve regression performance.
211
+
212
+ ### πŸ”Ή Final Feature Set
213
+
214
+ Included:
215
+
216
+ - Base numeric:
217
+ `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`
218
+ - Engineered:
219
+ `profit`, `profit_ratio`, `overview_length`, `release_year`, `decade`
220
+ - Clustering:
221
+ `cluster_group`, `distance_to_centroid`
222
+ - One-Hot columns:
223
+ All `original_language_...` and `status_...`
224
+
225
+ ### πŸ”Ή Models Trained
226
+
227
+ - **Linear Regression** (on the enriched feature set)
228
+ - **Random Forest Regressor**
229
+ - **Gradient Boosting Regressor**
230
+
231
+ ### πŸ“Š Regression Results (With Engineered Features)
232
+
233
+ | Model | MAE | RMSE | RΒ² |
234
+ |--------------------|------------|------------|----------|
235
+ | Linear Regression | ~0 (leakage) | ~0 | **1.00** |
236
+ | Random Forest | **1,964,109** | **7,414,303** | **0.9975** |
237
+ | Gradient Boosting | **2,255,268** | **5,199,504** | **0.9988** |
238
+
239
+ πŸ“Œ Note:
240
+ - The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`).
241
+ - The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**.
242
+
243
+ ### πŸ† Regression Winner
244
+
245
+ πŸ”₯ **Gradient Boosting Regressor**
246
+ - Highest RΒ²
247
+ - Lowest RMSE
248
+ - Best at capturing non-linear relationships
249
+
250
+ ---
251
+
252
+ # 🧱 Part 7 β€” Turning Regression into Classification
253
+
254
+ Instead of predicting the exact revenue, we converted the problem to a binary classification task:
255
+
256
+ - **Class 0:** revenue < median(revenue)
257
+ - **Class 1:** revenue β‰₯ median(revenue)
258
+
259
+ ### πŸ“Š Class Balance
260
+
261
+ ```text
262
+ Class 1 (high revenue): 2687
263
+ Class 0 (low revenue): 2682
264
+
265
+
266
+ ### πŸ“Š Classification Results
267
+
268
+ #### Logistic Regression
269
+ - Accuracy: **0.977**
270
+ - Precision: **0.984**
271
+ - Recall: **0.968**
272
+ - F1: **0.976**
273
+
274
+ #### Random Forest
275
+ - Accuracy: **0.986**
276
+ - Precision: **0.988**
277
+ - Recall: **0.982**
278
+ - F1: **0.985**
279
+
280
+ #### Gradient Boosting Classifier
281
+ - Accuracy: **0.990**
282
+ - Precision: **0.990**
283
+ - Recall: **0.990**
284
+ - F1: **0.990**
285
+
286
+ ---
287
+
288
+ ## πŸ† Classification Winner
289
+ πŸ”₯ **Gradient Boosting Classifier**
290
+ - Highest accuracy
291
+ - Balanced precision & recall
292
+ - Best overall performance
293
+
294
+ ---
295
+
296
+ ## πŸ“Œ Tools Used
297
+ - Python
298
+ - pandas / numpy
299
+ - scikit-learn
300
+ - seaborn / matplotlib
301
+ - Google Colab
302
+
303
+ ---
304
+
305
+ ## 🎯 Final Summary
306
+ This project demonstrates a complete machine learning workflow:
307
+ - Data preprocessing
308
+ - Feature engineering
309
+ - K-Means clustering
310
+ - PCA visualization
311
+ - Regression models
312
+ - Classification models
313
+ - Full evaluation and comparison
314
+
315
+ The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.