File size: 9,975 Bytes
361e19c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0fa7d2
 
2c86b0f
 
b2bb7ad
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
language:
- en
metrics:
- mae
- r_squared
- accuracy
- precision
- recall
- f1
pipeline_tag: tabular-classification
library_name: sklearn
tags:
- movies
- regression
- classification
---
# ๐ŸŽฌ Movie Revenue Prediction โ€” Full ML Pipeline

This project builds a complete machine learning workflow using real movie metadata.  
It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models โ€” and full performance evaluation.

---

## ๐Ÿงช Part 0 โ€” Initial Research Questions (EDA)

Before any modeling, I asked a few basic questions about the dataset:

1๏ธโƒฃ **What is the relationship between budget and revenue?**  
- Hypothesis: Higher budget โ†’ higher revenue.  
- Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always.

2๏ธโƒฃ **Is there a strong relationship between runtime and revenue?**  
- Hypothesis: Longer movies might earn more.  
- Result: No strong pattern. Most successful movies fall in a โ€œnormalโ€ runtime range (around 90โ€“150 minutes), but runtime alone does not explain revenue.

3๏ธโƒฃ **What are the most common original languages in the dataset?**  
- Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).

These EDA steps helped build intuition before moving into modeling.

---

## ๐Ÿงช Main ML Research Questions

### **1๏ธโƒฃ Can we accurately predict a movieโ€™s revenue using metadata alone?**  
We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.

### **2๏ธโƒฃ Which features have the strongest impact on movie revenue?**  
We explore the importance of:
- budget  
- vote counts & vote average  
- popularity  
- profit & profit ratio  
- release year & decade  
- cluster-based features (cluster_group, distance_to_centroid)

### **3๏ธโƒฃ Can we classify movies into โ€œhigh revenueโ€ vs. โ€œlow revenueโ€ groups effectively?**  
We convert revenue into a balanced binary target and apply classification models.

### **4๏ธโƒฃ Do clustering and unsupervised learning reveal meaningful structure in the dataset?**  
We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.

---

# ๐Ÿงฑ Part 1 โ€” Dataset & Basic Cleaning (Before Any Regression)

### ๐Ÿ”น 1. Loading the Data

- Dataset: `movies_metadata.csv` (from Kaggle)  
- Target variable: `revenue` (continuous)  

### ๐Ÿ”น 2. Basic Cleaning

- Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric.
- Parsed `release_date` as a datetime.
- Removed clearly invalid rows, such as:
  - `budget == 0`
  - `revenue == 0`
  - `runtime == 0`

This produced a smaller but more reliable dataset.

---

# ๐Ÿ“Š Part 2 โ€” Initial EDA (Before Any Model)

Key insights:

- **Budget vs Revenue**  
  - Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers.
  ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/BOkbMfLzBaHIxgj8nU7MF.png)

- **Runtime vs Revenue**  
  - No strong linear correlation. Being "very long" or "very short" does not guarantee success.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/NZQWe3X0kUNUXD3coeibM.png)

- **Original Language Distribution**  
  - English is by far the most common language; most of the dataset is dominated by English-language films.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/KCROsSBSS7zd9iQ2HIzjS.png)

These findings motivated the next steps: building a simple baseline model and then adding smarter features.

---

# ๐Ÿงช Part 3 โ€” Baseline Regression (Before Feature Engineering)

### ๐ŸŽฏ Goal  
Build a **simple baseline model** that predicts movie revenue using only a few basic features:

- `budget`
- `runtime`
- `vote_average`
- `vote_count`

### โš™๏ธ Model

- **Linear Regression** on the 4 basic features.
- Train/Test split: 80% train / 20% test.

### ๐Ÿ“Š Baseline Regression Results

Using only the basic features:

- **MAE โ‰ˆ 45,652,741**
- **RMSE โ‰ˆ 79,524,121**
- **Rยฒ โ‰ˆ 0.715**

๐Ÿ“Œ **Interpretation:**
- The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model.
- However, the errors (tens of millions) show there is still a lot of noise and missing information โ€” which is expected in movie revenue prediction.

This baseline serves as a reference point before introducing engineered features.

---

# ๐Ÿงฑ Part 4 โ€” Feature Engineering (Upgrading the Dataset)

To improve model performance, several new features were engineered:

### ๐Ÿ”น New Numeric Features

- `profit = revenue - budget`  
- `profit_ratio = profit / budget`  
- `overview_length` = length of the movie overview text  
- `release_year` = year extracted from `release_date`  
- `decade` = grouped release year by decade (e.g., 1980, 1990, 2000)

### ๐Ÿ”น Categorical Encoding

- `adult` converted from `"True"/"False"` to `1/0`.
- `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap).

### ๐Ÿ”น Scaling Numerical Features

Used `StandardScaler` to standardize numeric columns:
- `budget`, `runtime`, `vote_average`, `vote_count`,  
  `popularity`, `profit`, `profit_ratio`, `overview_length`

Each feature was transformed to have:
- mean โ‰ˆ 0  
- standard deviation โ‰ˆ 1  

---

# ๐Ÿงฉ Part 5 โ€” Clustering & PCA (Unsupervised Learning)

### ๐Ÿ”น K-Means Clustering

- Features used:  
  `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`
- Algorithm: **K-Means** with `n_clusters=4`.
- New feature: `cluster_group` โ€” each movie assigned to one of 4 clusters.

Rough interpretation of clusters:
- Cluster 0 โ€” low-budget, low-revenue films  
- Cluster 1 โ€” mid-range films  
- Cluster 2 โ€” big-budget / blockbuster-style movies  
- Cluster 3 โ€” more unusual / outlier-like cases  

### ๐Ÿ”น PCA for Visualization

- Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality.
- Created `pca1` and `pca2` for each movie.
- Plotted the movies in 2D using PCA, colored by `cluster_group`.

This allowed visual inspection of:
- Cluster separation  
- Overlaps  
- Global structure in the data  
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/f7yf-UcFtEc-JSdSqtGKa.png)

### ๐Ÿ”น Distance to Centroid (Outlier Feature)

Computed:
- `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center.

Interpretation:
- Small distance โ†’ movie is โ€œtypicalโ€ for its cluster.  
- Large distance โ†’ movie is an outlier within its cluster.

This feature was later used as an additional signal for modeling.

![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/aFktxtXzdNarGtb5eDR2h.png)
---

# ๐Ÿงฑ Part 6 โ€” Advanced Regression (With Engineered Features)

### ๐ŸŽฏ Goal  
Use the engineered features + clustering-based features to improve regression performance.

### ๐Ÿ”น Final Feature Set

Included:

- Base numeric:  
  `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`
- Engineered:  
  `profit`, `profit_ratio`, `overview_length`, `release_year`, `decade`
- Clustering:  
  `cluster_group`, `distance_to_centroid`
- One-Hot columns:  
  All `original_language_...` and `status_...`

### ๐Ÿ”น Models Trained

- **Linear Regression** (on the enriched feature set)  
- **Random Forest Regressor**  
- **Gradient Boosting Regressor**

### ๐Ÿ“Š Regression Results (With Engineered Features)

| Model              | MAE        | RMSE       | Rยฒ       |
|--------------------|------------|------------|----------|
| Linear Regression  | ~0 (leakage) | ~0       | **1.00** |
| Random Forest      | **1,964,109** | **7,414,303** | **0.9975** |
| Gradient Boosting  | **2,255,268** | **5,199,504** | **0.9988** |

๐Ÿ“Œ Note:  
- The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`).
- The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**.

### ๐Ÿ† Regression Winner

๐Ÿ”ฅ **Gradient Boosting Regressor**
- Highest Rยฒ  
- Lowest RMSE  
- Best at capturing non-linear relationships  

---

# ๐Ÿงฑ Part 7 โ€” Turning Regression into Classification

Instead of predicting the exact revenue, we converted the problem to a binary classification task:

- **Class 0:** revenue < median(revenue)  
- **Class 1:** revenue โ‰ฅ median(revenue)

### ๐Ÿ“Š Class Balance

```text
Class 1 (high revenue): 2687
Class 0 (low revenue):  2682


### ๐Ÿ“Š Classification Results

#### Logistic Regression
- Accuracy: **0.977**
- Precision: **0.984**
- Recall: **0.968**
- F1: **0.976**

#### Random Forest
- Accuracy: **0.986**
- Precision: **0.988**
- Recall: **0.982**
- F1: **0.985**

#### Gradient Boosting Classifier
- Accuracy: **0.990**
- Precision: **0.990**
- Recall: **0.990**
- F1: **0.990**

---

## ๐Ÿ† Classification Winner  
๐Ÿ”ฅ **Gradient Boosting Classifier**  
- Highest accuracy  
- Balanced precision & recall  
- Best overall performance  

---

## ๐Ÿ“Œ Tools Used
- Python  
- pandas / numpy  
- scikit-learn  
- seaborn / matplotlib  
- Google Colab  

---

## ๐ŸŽฏ Final Summary
This project demonstrates a complete machine learning workflow:
- Data preprocessing  
- Feature engineering  
- K-Means clustering  
- PCA visualization  
- Regression models  
- Classification models  
- Full evaluation and comparison  

The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.

---
```

๐ŸŽฅ Watch the full project here:

https://www.loom.com/share/303dfe317514455db992438357cf8cb4