movies_metadata.csv / README.md
odedf2001's picture
Update README.md
2c86b0f verified
---
language:
- en
metrics:
- mae
- r_squared
- accuracy
- precision
- recall
- f1
pipeline_tag: tabular-classification
library_name: sklearn
tags:
- movies
- regression
- classification
---
# 🎬 Movie Revenue Prediction — Full ML Pipeline
This project builds a complete machine learning workflow using real movie metadata.
It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models — and full performance evaluation.
---
## 🧪 Part 0 — Initial Research Questions (EDA)
Before any modeling, I asked a few basic questions about the dataset:
1️⃣ **What is the relationship between budget and revenue?**
- Hypothesis: Higher budget → higher revenue.
- Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always.
2️⃣ **Is there a strong relationship between runtime and revenue?**
- Hypothesis: Longer movies might earn more.
- Result: No strong pattern. Most successful movies fall in a “normal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue.
3️⃣ **What are the most common original languages in the dataset?**
- Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).
These EDA steps helped build intuition before moving into modeling.
---
## 🧪 Main ML Research Questions
### **1️⃣ Can we accurately predict a movie’s revenue using metadata alone?**
We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.
### **2️⃣ Which features have the strongest impact on movie revenue?**
We explore the importance of:
- budget
- vote counts & vote average
- popularity
- profit & profit ratio
- release year & decade
- cluster-based features (cluster_group, distance_to_centroid)
### **3️⃣ Can we classify movies into “high revenue” vs. “low revenue” groups effectively?**
We convert revenue into a balanced binary target and apply classification models.
### **4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?**
We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.
---
# 🧱 Part 1 — Dataset & Basic Cleaning (Before Any Regression)
### 🔹 1. Loading the Data
- Dataset: `movies_metadata.csv` (from Kaggle)
- Target variable: `revenue` (continuous)
### 🔹 2. Basic Cleaning
- Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric.
- Parsed `release_date` as a datetime.
- Removed clearly invalid rows, such as:
- `budget == 0`
- `revenue == 0`
- `runtime == 0`
This produced a smaller but more reliable dataset.
---
# 📊 Part 2 — Initial EDA (Before Any Model)
Key insights:
- **Budget vs Revenue**
- Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/BOkbMfLzBaHIxgj8nU7MF.png)
- **Runtime vs Revenue**
- No strong linear correlation. Being "very long" or "very short" does not guarantee success.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/NZQWe3X0kUNUXD3coeibM.png)
- **Original Language Distribution**
- English is by far the most common language; most of the dataset is dominated by English-language films.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/KCROsSBSS7zd9iQ2HIzjS.png)
These findings motivated the next steps: building a simple baseline model and then adding smarter features.
---
# 🧪 Part 3 — Baseline Regression (Before Feature Engineering)
### 🎯 Goal
Build a **simple baseline model** that predicts movie revenue using only a few basic features:
- `budget`
- `runtime`
- `vote_average`
- `vote_count`
### ⚙️ Model
- **Linear Regression** on the 4 basic features.
- Train/Test split: 80% train / 20% test.
### 📊 Baseline Regression Results
Using only the basic features:
- **MAE ≈ 45,652,741**
- **RMSE ≈ 79,524,121**
- **R² ≈ 0.715**
📌 **Interpretation:**
- The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model.
- However, the errors (tens of millions) show there is still a lot of noise and missing information — which is expected in movie revenue prediction.
This baseline serves as a reference point before introducing engineered features.
---
# 🧱 Part 4 — Feature Engineering (Upgrading the Dataset)
To improve model performance, several new features were engineered:
### 🔹 New Numeric Features
- `profit = revenue - budget`
- `profit_ratio = profit / budget`
- `overview_length` = length of the movie overview text
- `release_year` = year extracted from `release_date`
- `decade` = grouped release year by decade (e.g., 1980, 1990, 2000)
### 🔹 Categorical Encoding
- `adult` converted from `"True"/"False"` to `1/0`.
- `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap).
### 🔹 Scaling Numerical Features
Used `StandardScaler` to standardize numeric columns:
- `budget`, `runtime`, `vote_average`, `vote_count`,
`popularity`, `profit`, `profit_ratio`, `overview_length`
Each feature was transformed to have:
- mean ≈ 0
- standard deviation ≈ 1
---
# 🧩 Part 5 — Clustering & PCA (Unsupervised Learning)
### 🔹 K-Means Clustering
- Features used:
`budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`
- Algorithm: **K-Means** with `n_clusters=4`.
- New feature: `cluster_group` — each movie assigned to one of 4 clusters.
Rough interpretation of clusters:
- Cluster 0 — low-budget, low-revenue films
- Cluster 1 — mid-range films
- Cluster 2 — big-budget / blockbuster-style movies
- Cluster 3 — more unusual / outlier-like cases
### 🔹 PCA for Visualization
- Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality.
- Created `pca1` and `pca2` for each movie.
- Plotted the movies in 2D using PCA, colored by `cluster_group`.
This allowed visual inspection of:
- Cluster separation
- Overlaps
- Global structure in the data
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/f7yf-UcFtEc-JSdSqtGKa.png)
### 🔹 Distance to Centroid (Outlier Feature)
Computed:
- `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center.
Interpretation:
- Small distance → movie is “typical” for its cluster.
- Large distance → movie is an outlier within its cluster.
This feature was later used as an additional signal for modeling.
![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/aFktxtXzdNarGtb5eDR2h.png)
---
# 🧱 Part 6 — Advanced Regression (With Engineered Features)
### 🎯 Goal
Use the engineered features + clustering-based features to improve regression performance.
### 🔹 Final Feature Set
Included:
- Base numeric:
`budget`, `runtime`, `vote_average`, `vote_count`, `popularity`
- Engineered:
`profit`, `profit_ratio`, `overview_length`, `release_year`, `decade`
- Clustering:
`cluster_group`, `distance_to_centroid`
- One-Hot columns:
All `original_language_...` and `status_...`
### 🔹 Models Trained
- **Linear Regression** (on the enriched feature set)
- **Random Forest Regressor**
- **Gradient Boosting Regressor**
### 📊 Regression Results (With Engineered Features)
| Model | MAE | RMSE | R² |
|--------------------|------------|------------|----------|
| Linear Regression | ~0 (leakage) | ~0 | **1.00** |
| Random Forest | **1,964,109** | **7,414,303** | **0.9975** |
| Gradient Boosting | **2,255,268** | **5,199,504** | **0.9988** |
📌 Note:
- The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`).
- The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**.
### 🏆 Regression Winner
🔥 **Gradient Boosting Regressor**
- Highest R²
- Lowest RMSE
- Best at capturing non-linear relationships
---
# 🧱 Part 7 — Turning Regression into Classification
Instead of predicting the exact revenue, we converted the problem to a binary classification task:
- **Class 0:** revenue < median(revenue)
- **Class 1:** revenue ≥ median(revenue)
### 📊 Class Balance
```text
Class 1 (high revenue): 2687
Class 0 (low revenue): 2682
### 📊 Classification Results
#### Logistic Regression
- Accuracy: **0.977**
- Precision: **0.984**
- Recall: **0.968**
- F1: **0.976**
#### Random Forest
- Accuracy: **0.986**
- Precision: **0.988**
- Recall: **0.982**
- F1: **0.985**
#### Gradient Boosting Classifier
- Accuracy: **0.990**
- Precision: **0.990**
- Recall: **0.990**
- F1: **0.990**
---
## 🏆 Classification Winner
🔥 **Gradient Boosting Classifier**
- Highest accuracy
- Balanced precision & recall
- Best overall performance
---
## 📌 Tools Used
- Python
- pandas / numpy
- scikit-learn
- seaborn / matplotlib
- Google Colab
---
## 🎯 Final Summary
This project demonstrates a complete machine learning workflow:
- Data preprocessing
- Feature engineering
- K-Means clustering
- PCA visualization
- Regression models
- Classification models
- Full evaluation and comparison
The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.
---
```
🎥 Watch the full project here:
https://www.loom.com/share/303dfe317514455db992438357cf8cb4