File size: 9,975 Bytes

---
language:
- en
metrics:
- mae
- r_squared
- accuracy
- precision
- recall
- f1
pipeline_tag: tabular-classification
library_name: sklearn
tags:
- movies
- regression
- classification
---
# 🎬 Movie Revenue Prediction — Full ML Pipeline

This project builds a complete machine learning workflow using real movie metadata.  
It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models — and full performance evaluation.

---

## 🧪 Part 0 — Initial Research Questions (EDA)

Before any modeling, I asked a few basic questions about the dataset:

1️⃣ **What is the relationship between budget and revenue?**  
- Hypothesis: Higher budget → higher revenue.  
- Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always.

2️⃣ **Is there a strong relationship between runtime and revenue?**  
- Hypothesis: Longer movies might earn more.  
- Result: No strong pattern. Most successful movies fall in a “normal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue.

3️⃣ **What are the most common original languages in the dataset?**  
- Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).

These EDA steps helped build intuition before moving into modeling.

---

## 🧪 Main ML Research Questions

### **1️⃣ Can we accurately predict a movie’s revenue using metadata alone?**  
We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.

### **2️⃣ Which features have the strongest impact on movie revenue?**  
We explore the importance of:
- budget  
- vote counts & vote average  
- popularity  
- profit & profit ratio  
- release year & decade  
- cluster-based features (cluster_group, distance_to_centroid)

### **3️⃣ Can we classify movies into “high revenue” vs. “low revenue” groups effectively?**  
We convert revenue into a balanced binary target and apply classification models.

### **4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?**  
We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.

---

# 🧱 Part 1 — Dataset & Basic Cleaning (Before Any Regression)

### 🔹 1. Loading the Data

- Dataset: `movies_metadata.csv` (from Kaggle)  
- Target variable: `revenue` (continuous)  

### 🔹 2. Basic Cleaning

- Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric.
- Parsed `release_date` as a datetime.
- Removed clearly invalid rows, such as:
  - `budget == 0`
  - `revenue == 0`
  - `runtime == 0`

This produced a smaller but more reliable dataset.

---

# 📊 Part 2 — Initial EDA (Before Any Model)

Key insights:

- **Budget vs Revenue**  
  - Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers.
  ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/BOkbMfLzBaHIxgj8nU7MF.png)

- **Runtime vs Revenue**  
  - No strong linear correlation. Being "very long" or "very short" does not guarantee success.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/NZQWe3X0kUNUXD3coeibM.png)

- **Original Language Distribution**  
  - English is by far the most common language; most of the dataset is dominated by English-language films.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/KCROsSBSS7zd9iQ2HIzjS.png)

These findings motivated the next steps: building a simple baseline model and then adding smarter features.

---

# 🧪 Part 3 — Baseline Regression (Before Feature Engineering)

### 🎯 Goal  
Build a **simple baseline model** that predicts movie revenue using only a few basic features:

- `budget`
- `runtime`
- `vote_average`
- `vote_count`

### ⚙️ Model

- **Linear Regression** on the 4 basic features.
- Train/Test split: 80% train / 20% test.

### 📊 Baseline Regression Results

Using only the basic features:

- **MAE ≈ 45,652,741**
- **RMSE ≈ 79,524,121**
- **R² ≈ 0.715**

📌 **Interpretation:**
- The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model.
- However, the errors (tens of millions) show there is still a lot of noise and missing information — which is expected in movie revenue prediction.

This baseline serves as a reference point before introducing engineered features.

---

# 🧱 Part 4 — Feature Engineering (Upgrading the Dataset)

To improve model performance, several new features were engineered:

### 🔹 New Numeric Features

- `profit = revenue - budget`  
- `profit_ratio = profit / budget`  
- `overview_length` = length of the movie overview text  
- `release_year` = year extracted from `release_date`  
- `decade` = grouped release year by decade (e.g., 1980, 1990, 2000)

### 🔹 Categorical Encoding

- `adult` converted from `"True"/"False"` to `1/0`.
- `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap).

### 🔹 Scaling Numerical Features

Used `StandardScaler` to standardize numeric columns:
- `budget`, `runtime`, `vote_average`, `vote_count`,  
  `popularity`, `profit`, `profit_ratio`, `overview_length`

Each feature was transformed to have:
- mean ≈ 0  
- standard deviation ≈ 1  

---

# 🧩 Part 5 — Clustering & PCA (Unsupervised Learning)

### 🔹 K-Means Clustering

- Features used:  
  `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`
- Algorithm: **K-Means** with `n_clusters=4`.
- New feature: `cluster_group` — each movie assigned to one of 4 clusters.

Rough interpretation of clusters:
- Cluster 0 — low-budget, low-revenue films  
- Cluster 1 — mid-range films  
- Cluster 2 — big-budget / blockbuster-style movies  
- Cluster 3 — more unusual / outlier-like cases  

### 🔹 PCA for Visualization

- Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality.
- Created `pca1` and `pca2` for each movie.
- Plotted the movies in 2D using PCA, colored by `cluster_group`.

This allowed visual inspection of:
- Cluster separation  
- Overlaps  
- Global structure in the data  
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/f7yf-UcFtEc-JSdSqtGKa.png)

### 🔹 Distance to Centroid (Outlier Feature)

Computed:
- `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center.

Interpretation:
- Small distance → movie is “typical” for its cluster.  
- Large distance → movie is an outlier within its cluster.

This feature was later used as an additional signal for modeling.

![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/aFktxtXzdNarGtb5eDR2h.png)
---

# 🧱 Part 6 — Advanced Regression (With Engineered Features)

### 🎯 Goal  
Use the engineered features + clustering-based features to improve regression performance.

### 🔹 Final Feature Set

Included:

- Base numeric:  
  `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`
- Engineered:  
  `profit`, `profit_ratio`, `overview_length`, `release_year`, `decade`
- Clustering:  
  `cluster_group`, `distance_to_centroid`
- One-Hot columns:  
  All `original_language_...` and `status_...`

### 🔹 Models Trained

- **Linear Regression** (on the enriched feature set)  
- **Random Forest Regressor**  
- **Gradient Boosting Regressor**

### 📊 Regression Results (With Engineered Features)

| Model              | MAE        | RMSE       | R²       |
|--------------------|------------|------------|----------|
| Linear Regression  | ~0 (leakage) | ~0       | **1.00** |
| Random Forest      | **1,964,109** | **7,414,303** | **0.9975** |
| Gradient Boosting  | **2,255,268** | **5,199,504** | **0.9988** |

📌 Note:  
- The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`).
- The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**.

### 🏆 Regression Winner

🔥 **Gradient Boosting Regressor**
- Highest R²  
- Lowest RMSE  
- Best at capturing non-linear relationships  

---

# 🧱 Part 7 — Turning Regression into Classification

Instead of predicting the exact revenue, we converted the problem to a binary classification task:

- **Class 0:** revenue < median(revenue)  
- **Class 1:** revenue ≥ median(revenue)

### 📊 Class Balance

```text
Class 1 (high revenue): 2687
Class 0 (low revenue):  2682


### 📊 Classification Results

#### Logistic Regression
- Accuracy: **0.977**
- Precision: **0.984**
- Recall: **0.968**
- F1: **0.976**

#### Random Forest
- Accuracy: **0.986**
- Precision: **0.988**
- Recall: **0.982**
- F1: **0.985**

#### Gradient Boosting Classifier
- Accuracy: **0.990**
- Precision: **0.990**
- Recall: **0.990**
- F1: **0.990**

---

## 🏆 Classification Winner  
🔥 **Gradient Boosting Classifier**  
- Highest accuracy  
- Balanced precision & recall  
- Best overall performance  

---

## 📌 Tools Used
- Python  
- pandas / numpy  
- scikit-learn  
- seaborn / matplotlib  
- Google Colab  

---

## 🎯 Final Summary
This project demonstrates a complete machine learning workflow:
- Data preprocessing  
- Feature engineering  
- K-Means clustering  
- PCA visualization  
- Regression models  
- Classification models  
- Full evaluation and comparison  

The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.

---
```

🎥 Watch the full project here:

https://www.loom.com/share/303dfe317514455db992438357cf8cb4