File size: 9,975 Bytes
361e19c a0fa7d2 2c86b0f b2bb7ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | ---
language:
- en
metrics:
- mae
- r_squared
- accuracy
- precision
- recall
- f1
pipeline_tag: tabular-classification
library_name: sklearn
tags:
- movies
- regression
- classification
---
# ๐ฌ Movie Revenue Prediction โ Full ML Pipeline
This project builds a complete machine learning workflow using real movie metadata.
It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models โ and full performance evaluation.
---
## ๐งช Part 0 โ Initial Research Questions (EDA)
Before any modeling, I asked a few basic questions about the dataset:
1๏ธโฃ **What is the relationship between budget and revenue?**
- Hypothesis: Higher budget โ higher revenue.
- Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always.
2๏ธโฃ **Is there a strong relationship between runtime and revenue?**
- Hypothesis: Longer movies might earn more.
- Result: No strong pattern. Most successful movies fall in a โnormalโ runtime range (around 90โ150 minutes), but runtime alone does not explain revenue.
3๏ธโฃ **What are the most common original languages in the dataset?**
- Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).
These EDA steps helped build intuition before moving into modeling.
---
## ๐งช Main ML Research Questions
### **1๏ธโฃ Can we accurately predict a movieโs revenue using metadata alone?**
We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.
### **2๏ธโฃ Which features have the strongest impact on movie revenue?**
We explore the importance of:
- budget
- vote counts & vote average
- popularity
- profit & profit ratio
- release year & decade
- cluster-based features (cluster_group, distance_to_centroid)
### **3๏ธโฃ Can we classify movies into โhigh revenueโ vs. โlow revenueโ groups effectively?**
We convert revenue into a balanced binary target and apply classification models.
### **4๏ธโฃ Do clustering and unsupervised learning reveal meaningful structure in the dataset?**
We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.
---
# ๐งฑ Part 1 โ Dataset & Basic Cleaning (Before Any Regression)
### ๐น 1. Loading the Data
- Dataset: `movies_metadata.csv` (from Kaggle)
- Target variable: `revenue` (continuous)
### ๐น 2. Basic Cleaning
- Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric.
- Parsed `release_date` as a datetime.
- Removed clearly invalid rows, such as:
- `budget == 0`
- `revenue == 0`
- `runtime == 0`
This produced a smaller but more reliable dataset.
---
# ๐ Part 2 โ Initial EDA (Before Any Model)
Key insights:
- **Budget vs Revenue**
- Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers.

- **Runtime vs Revenue**
- No strong linear correlation. Being "very long" or "very short" does not guarantee success.

- **Original Language Distribution**
- English is by far the most common language; most of the dataset is dominated by English-language films.

These findings motivated the next steps: building a simple baseline model and then adding smarter features.
---
# ๐งช Part 3 โ Baseline Regression (Before Feature Engineering)
### ๐ฏ Goal
Build a **simple baseline model** that predicts movie revenue using only a few basic features:
- `budget`
- `runtime`
- `vote_average`
- `vote_count`
### โ๏ธ Model
- **Linear Regression** on the 4 basic features.
- Train/Test split: 80% train / 20% test.
### ๐ Baseline Regression Results
Using only the basic features:
- **MAE โ 45,652,741**
- **RMSE โ 79,524,121**
- **Rยฒ โ 0.715**
๐ **Interpretation:**
- The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model.
- However, the errors (tens of millions) show there is still a lot of noise and missing information โ which is expected in movie revenue prediction.
This baseline serves as a reference point before introducing engineered features.
---
# ๐งฑ Part 4 โ Feature Engineering (Upgrading the Dataset)
To improve model performance, several new features were engineered:
### ๐น New Numeric Features
- `profit = revenue - budget`
- `profit_ratio = profit / budget`
- `overview_length` = length of the movie overview text
- `release_year` = year extracted from `release_date`
- `decade` = grouped release year by decade (e.g., 1980, 1990, 2000)
### ๐น Categorical Encoding
- `adult` converted from `"True"/"False"` to `1/0`.
- `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap).
### ๐น Scaling Numerical Features
Used `StandardScaler` to standardize numeric columns:
- `budget`, `runtime`, `vote_average`, `vote_count`,
`popularity`, `profit`, `profit_ratio`, `overview_length`
Each feature was transformed to have:
- mean โ 0
- standard deviation โ 1
---
# ๐งฉ Part 5 โ Clustering & PCA (Unsupervised Learning)
### ๐น K-Means Clustering
- Features used:
`budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`
- Algorithm: **K-Means** with `n_clusters=4`.
- New feature: `cluster_group` โ each movie assigned to one of 4 clusters.
Rough interpretation of clusters:
- Cluster 0 โ low-budget, low-revenue films
- Cluster 1 โ mid-range films
- Cluster 2 โ big-budget / blockbuster-style movies
- Cluster 3 โ more unusual / outlier-like cases
### ๐น PCA for Visualization
- Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality.
- Created `pca1` and `pca2` for each movie.
- Plotted the movies in 2D using PCA, colored by `cluster_group`.
This allowed visual inspection of:
- Cluster separation
- Overlaps
- Global structure in the data

### ๐น Distance to Centroid (Outlier Feature)
Computed:
- `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center.
Interpretation:
- Small distance โ movie is โtypicalโ for its cluster.
- Large distance โ movie is an outlier within its cluster.
This feature was later used as an additional signal for modeling.

---
# ๐งฑ Part 6 โ Advanced Regression (With Engineered Features)
### ๐ฏ Goal
Use the engineered features + clustering-based features to improve regression performance.
### ๐น Final Feature Set
Included:
- Base numeric:
`budget`, `runtime`, `vote_average`, `vote_count`, `popularity`
- Engineered:
`profit`, `profit_ratio`, `overview_length`, `release_year`, `decade`
- Clustering:
`cluster_group`, `distance_to_centroid`
- One-Hot columns:
All `original_language_...` and `status_...`
### ๐น Models Trained
- **Linear Regression** (on the enriched feature set)
- **Random Forest Regressor**
- **Gradient Boosting Regressor**
### ๐ Regression Results (With Engineered Features)
| Model | MAE | RMSE | Rยฒ |
|--------------------|------------|------------|----------|
| Linear Regression | ~0 (leakage) | ~0 | **1.00** |
| Random Forest | **1,964,109** | **7,414,303** | **0.9975** |
| Gradient Boosting | **2,255,268** | **5,199,504** | **0.9988** |
๐ Note:
- The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`).
- The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**.
### ๐ Regression Winner
๐ฅ **Gradient Boosting Regressor**
- Highest Rยฒ
- Lowest RMSE
- Best at capturing non-linear relationships
---
# ๐งฑ Part 7 โ Turning Regression into Classification
Instead of predicting the exact revenue, we converted the problem to a binary classification task:
- **Class 0:** revenue < median(revenue)
- **Class 1:** revenue โฅ median(revenue)
### ๐ Class Balance
```text
Class 1 (high revenue): 2687
Class 0 (low revenue): 2682
### ๐ Classification Results
#### Logistic Regression
- Accuracy: **0.977**
- Precision: **0.984**
- Recall: **0.968**
- F1: **0.976**
#### Random Forest
- Accuracy: **0.986**
- Precision: **0.988**
- Recall: **0.982**
- F1: **0.985**
#### Gradient Boosting Classifier
- Accuracy: **0.990**
- Precision: **0.990**
- Recall: **0.990**
- F1: **0.990**
---
## ๐ Classification Winner
๐ฅ **Gradient Boosting Classifier**
- Highest accuracy
- Balanced precision & recall
- Best overall performance
---
## ๐ Tools Used
- Python
- pandas / numpy
- scikit-learn
- seaborn / matplotlib
- Google Colab
---
## ๐ฏ Final Summary
This project demonstrates a complete machine learning workflow:
- Data preprocessing
- Feature engineering
- K-Means clustering
- PCA visualization
- Regression models
- Classification models
- Full evaluation and comparison
The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.
---
```
๐ฅ Watch the full project here:
https://www.loom.com/share/303dfe317514455db992438357cf8cb4
|